Introduction
This blog post introduces and gives usage examples of the Machine Learning (ML) data structure Tries with frequencies, [AA1], creation and usage through the Python package “TriesWithFrequencies”.
For the original Trie (or Prefix tree) data structure see the Wikipedia article “Trie”.
Setup
from TriesWithFrequencies import *
Creation examples
In this section we show a few ways to create tries with frequencies.
Consider a trie (prefix tree) created over a list of words:
tr = trie_create_by_split( ["bar", "bark", "bars", "balm", "cert", "cell"] )
trie_form(tr)
TRIEROOT => 6.0
├─b => 4.0
│ └─a => 4.0
│ ├─r => 3.0
│ │ └─k => 1.0
│ │ └─s => 1.0
│ └─l => 1.0
│ └─m => 1.0
└─c => 2.0
└─e => 2.0
├─r => 1.0
│ └─t => 1.0
└─l => 1.0
└─l => 1.0
Here we convert the trie with frequencies above into a trie with probabilities:
ptr = trie_node_probabilities( tr )
trie_form(ptr)
TRIEROOT => 1.0
├─b => 0.6666666666666666
│ └─a => 1.0
│ ├─r => 0.75
│ │ ├─k => 0.3333333333333333
│ │ └─s => 0.3333333333333333
│ └─l => 0.25
│ └─m => 1.0
└─c => 0.3333333333333333
└─e => 1.0
├─r => 0.5
│ └─t => 1.0
└─l => 0.5
└─l => 1.0
Shrinking
Here we shrink the trie with probabilities above:
trie_form(trie_shrink(ptr))
TRIEROOT => 1.0
└─ba => 1.0
└─r => 0.75
└─k => 0.3333333333333333
└─s => 0.3333333333333333
└─lm => 1.0
└─ce => 1.0
└─rt => 1.0
└─ll => 1.0
Here we shrink the frequencies trie using a separator:
trie_form(trie_shrink(tr, sep="~"))
TRIEROOT => 6.0
└─b~a => 4.0
└─r => 3.0
└─k => 1.0
└─s => 1.0
└─l~m => 1.0
└─c~e => 2.0
└─r~t => 1.0
└─l~l => 1.0
Retrieval and sub-tries
Here we retrieve a sub-trie with a key:
trie_form(trie_sub_trie(tr, list("bar")))
r => 3.0
└─k => 1.0
└─s => 1.0
Classification
Create a trie:
words = [*(["bar"] * 6), *(["bark"] * 3), *(["bare"] * 2), *(["cam"] * 3), "came", *(["camelia"] * 4)]
tr = trie_create_by_split(words)
tr = trie_node_probabilities(tr)
Show node counts:
trie_node_counts(tr)
{'total': 13, 'internal': 10, 'leaves': 3}
Show the trie form:
trie_form(tr)
TRIEROOT => 1.0
├─b => 0.5789473684210527
│ └─a => 1.0
│ └─r => 1.0
│ ├─k => 0.2727272727272727
│ └─e => 0.18181818181818182
└─c => 0.42105263157894735
└─a => 1.0
└─m => 1.0
└─e => 0.625
└─l => 0.8
└─i => 1.0
└─a => 1.0
Classify with the letters of the word \”cam\”:
trie_classify(tr, list("cam"), prop="Probabilities")
{'a': 0.5, 'm': 0.375, 'e': 0.12499999999999997}
References
Articles
[AA1] Anton Antonov, “Tries with frequencies for data mining”, (2013), MathematicaForPrediction at WordPress.
[AA2] Anton Antonov, “Removal of sub-trees in tries”, (2013), MathematicaForPrediction at WordPress.
[AA3] Anton Antonov, “Tries with frequencies in Java” (2017), MathematicaForPrediction at WordPress. GitHub Markdown.
[WK1] Wikipedia entry, Trie.
Packages
[AAp1] Anton Antonov, Tries with frequencies Mathematica Version 9.0 package, (2013), MathematicaForPrediction at GitHub.
[AAp2] Anton Antonov, Tries with frequencies Mathematica package, (2013-2018), MathematicaForPrediction at GitHub.
[AAp3] Anton Antonov, Tries with frequencies in Java, (2017), MathematicaForPrediction at GitHub.
[AAp4] Anton Antonov, Java tries with frequencies Mathematica package, (2017), MathematicaForPrediction at GitHub.
[AAp5] Anton Antonov, Java tries with frequencies Mathematica unit tests, (2017), MathematicaForPrediction at GitHub.
[AAp6] Anton Antonov, ML::TriesWithFrequencies Raku package, (2021), GitHub/antononcube.
Videos
[AAv1] Anton Antonov, “Prefix Trees with Frequencies for Data Analysis and Machine Learning”, (2017), Wolfram Technology Conference 2017, Wolfram channel at YouTube.