CS 61B | Lecture 15 ~ 16 | Prefix Operations and Tries, Range Searching and Multi-Dimensional Data


Tries

Reflection

In terms of the ADTs Set and Map, let’s recall the runtimes for both of these:

  • Balanced Search Tree:
    • contains(x): $\Theta(\log{N})$
    • add(x): $\Theta(\log{N})$
  • Resizing Separate Chaining Hash Table:
    • contains(x): $\Theta(1)$ (assuming even spread)
    • add(x): $\Theta(1)$ (assuming even spread and amortized)

Question: Can we do even better than this? That could depend on the nature of our problem. What if we knew a special characteristic of the keys?

Answer: Yes, for example if we always know our keys are ASCII characters then instead of using a general-purpose HashMap, simply use an array where each character is a different index in our array:

1
2
3
4
5
6
7
8
public class DataIndexedCharMap<V> {
private V[] items;
public DataIndexedCharMap(int R) {
items = (V[]) new Object[R];
}
public void put(char c, V val) { items[c] = val; }
public V get(char c) { return items[c]; }
}

The value R represents the number of possible characters (e.g. 128 for ASCII). So, by knowing the range of our keys as well as what types of values they will be we have a simple and efficient data structure.

A key takeaway is that we can often improve a general-purpose data structure when we add specificity to our problem, often by adding additional constraints. For example, we improved our implementation of HashMap when we restricted the keys to only be ASCII character, creating extremely efficient data structure.

Definition

The Trie is a specific implementation for Sets and Maps that is specialized for strings (like hash table is specialized for the non-comparable).

It is a short for Retrieval tree, but almost everyone pronounces it as “try”. The inventor Edward Fredkin suggested it be pronounced as “tree”.

Tries are a very useful data structure used in cases where keys can be broken into “characters” and share prefixes with other keys (e.g. strings).

With the previous implementations BST and RST Hash Table, we may have the following structures:

Key ideas:

  • Every node stores only one letter.
  • Nodes can be shared by multiple keys.

An important observation to make is that most of these words share the same prefixes.

To check if the trie contains a key, walk down the tree from the root along the correct nodes.

Since we are going to share nodes, we must figure out some way to represent which strings belong in our set and which don’t. To search, we will traverse our trie and compare to each character. There are only two cases when we would not be able to find a string:

  • Hit an unmarked node or the final node is not marked (not the ending node).
  • Fall off the tree (no node matching the current character).

A trie can be used as a map too.

Implementation

Approach 1

DataIndexedCharMap is for a node accessing its children. For example, node a can access node d by “d”, node m by “m”, and node p by “p”.

We can make an important observation: Each link corresponds to a character if and only if that character exists. Therefore, we can remove the Node’s character variable and instead base the value of the character from its position in the parent DataIndexedCharMap.

Note: In real implementation, contains(char) and contains(String) are different. In this case, we’d better use String.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
public class DataIndexedCharMap<V> {
private V[] items;
public DataIndexedCharMap(int R) {
items = (V[]) new Object[R];
}
public void put(char c, V x) {
items[c] = x;
}
public V get(char c) {
if (c < 0 || c > R - 1) return null;
return items[c];
}
public V contains(char c) {
return get(c) != null;
}
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
public class TrieSet {
// R indicates how many possible children a node may have
private static final int R = 128; // ASCII
private Node root; // root of trie

public TrieSet() {
root = new Node(false, R);
}

private static class Node {
// private char ch;
private boolean isKey; // is an ending node?
private DataIndexedCharMap next;

private Node(boolean b, int R) {
// ch = c;
isKey = b;
next = new DataIndexedCharMap<Node>(R);
}

public void setKey(boolean k) {
isKey = k;
}
}

// root
// / \
// a [c]
// /
// [d]
// Test: "a", "c", "d", "ad", "ae", "ca", "adc"
public boolean contains(String str) {
if (str == null || str.length == 0) return false;
Node p = root;
for (int i = 0; i < str.length(); i += 1) {
boolean end = (i == (str.length() - 1));
char c = str.charAt(i);
DataIndexedCharMap map = p.next;
if (map.contains(c) == false) return false; // fall off
p = map.get(c); // get the current node & set next
if (end && p.isKey) return p.isKey;
}
return false; // actually won't be executed
}

// Consider adding "same" first, and then add "sam"
// We need to update "m" rather than new an "m" node.

public void add(String str) {
if (str == null || str.length() == 0) return;
Node p = root;
for (int i = 0; i < str.length(); i += 1) {
boolean end = (i == (str.length() - 1);
char c = str.charAt(i);
DataIndexedCharMap map = p.next;
if (map.contains(c) == false) { // not existed -> add
map.put(c, new Node(end, R));
}
p = map.get(c); // set next
if (end) { // update last node's key if it exists.
p.setKey(true);
}
}
}
}

Problem: If we use a DataIndexedCharMap to track children, every node has $R$ links.

Observation: The letter stored inside each node is actually redundant. So we can remove from the representation.

Performance in Terms of N

Given a Trie with $N$ keys. What is the runtimes of add and contains? $\Theta(1)$.

Runtimes independent of number of keys. It doesn’t matter how many items we have in our trie.

Or in terms of $L$, the length of the key: $\Theta(L)$.

Approach 2

Alternate Idea #1: Hash Table

Alternate Idea #2: BST

So, when we implement a Trie, we have to pick a map to our children.

Now we have:

  • DataIndexedCharMap: Very fast, but memory hungry.
  • Hash Table: Almost as fast, uses less memory.
  • Balanced BST: A little slower than Hash Table, uses similar amount of memory.

Let’s look into the details:

  • DataIndexedCharMap
    • Space: 128 links per node.
    • Runtime: $\Theta(1)$
  • BST
    • Space: $C$ links per node, where $C$ is the number of children.
    • Runtime: $O(\log{R})$, where $R$ is the size of alphabet.
  • Hash Table
    • Space: $C$ links per node.
    • Runtime: $O(R)$. (all mapping to a bucket)

Note: Cost per link is higher in BST and Hash Table.

However, since $R$ is fixed (e.g. 128), we can think of all three as $\Theta(1)$.

You would have to do computational experiments to see which is best for your application.

String Operations

Theoretical asymptotic speed improvement is nice. but the main appeal of tries is their ability to efficiently support string specific operations like prefix matching.

  • Finding all keys that match a given prefix: keysWithPrefix(“sa”)
  • Finding longest prefix of: longestPrefixOf(“sample”)

collect()

Given an algorithm for collecting all the keys in a Trie.

collect() returns [“a”, “awls”, “sad”, “sam”, “same”, “sap”]

collect():

  • Create an empty list of returns x
  • For character c in root.next.keys():
    • Call colHelp(c, root.next.get(c))
  • Return x

colHelp(String s, List x, Node n):

  • If n.isKey == true, then x.add(s).
  • For character c in n.next.keys(): Call colHelp again.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
public List<String> collect() {
List<String> result = new ArrayList<>();
for (char c : root.next.keys()) { // All the first character
String cs = c + "";
colHelp(cs, result, root.next.get(c)); // get(c) -> the first node
}
return result;
}

private void colHelp(String s, List<String> x, Node n) {
if (n == null) return;
if (n.isKey) {
x.add(s);
}
for (char c : n.next.keys()) {
colHelp(s + c, x, n.next.get(c));
}
}

keysWithPrefix()

Example: keysWithPrefix(“sa”) is [“sad”, “sam”, “same”, “sap”].

  • Find the node a corresponding to the prefix string pf.
  • Create an empty list x.
  • For character c in a.next.keys():
    • Call colHelp("sa" + c, x, a.next.get(c)).
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
public List<String> keysWithPrefix(String pf) {
// Find the node a
Node p = root;
for (int i = 0; i < pf.length(); i += 1) {
char c = pf.charAt(i);
p = p.next.get(c);
if (p == null) break;
} // now p points to the node a

List<String> x = new ArrayList<>();
for (char c : p.next.keys()) {
colHelp(pf + c, x, p.next.get(c));
}
return x;
}

longestPrefixOf()

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
public String longestPrefixOf(String s) {
if (s == null || s.length() == 0) return null;

Node p = root;
String longestPrefix = "";
for (int i = 0; i < s.length(); i += 1) {
char c = s.charAt(i);
// get the current node
p = root.next.get(c); // may be null
if (p == null) break;
// match
longestPrefix += Character.toString(c);
}
return longestPrefix;
}

Autocomplete

Example: When we type “how are” into Google, we get 10 results.

One way to do this is to create a Trie-Based Map from strings to values.

  • Value: Represents how important Google thinks that string is.
  • Can store billions of strings efficiently since they share nodes.
  • When a user types in a string “hello”, we:
    • Call keysWithPrefix("hello").
    • Return the 10 strings with the highest value.

More Efficient:

If we enter a short string, the number of keys with the appropriate prefix will be too big.

Solution: Each node stores its own value, as well as the value of its best substring.

Search will consider nodes in order of “best”.

  • Consider sp before sm (since the value of node p is 20).
  • Can stop when top 3 matches are all better than best remaining.

Hint: Use a PQ.

Even More Efficient:

Can also merge nodes that are redundant!

  • This version of trie is known as a radix tree or radix trie.

Bottom line: Data structures interact in beautiful and important ways!

Ex (Guide)

C level

Ex 1 (TST)

Problem 5 from Princeton’s Spring 2008 final.

A ternary search tree (also called a prefix tree) is a special trie data structure where the child nodes of a standard trie are ordered as a binary search tree.

Unlike tries where each node contains 26 pointers for its children, each node in a ternary search tree contains only 3 pointers:

  • The left pointer must point to a node whose character value is less than the current node.
  • The right pointer must point to a node whose character is greater than the current node.
  • The equal pointer points to the next character in the word.

Wikipedia: link

Problem:

Consider the following TST, where the black nodes correspond to strings in the TST.

(a) Which 7 strings (in alphabetical order) are in the TST?

aac, aat, ac, ca, ct, g, ta

(b) Draw the results of adding the following strings into the TST above: cgt, aaca, tt

Ex 2 (TST)

Problem 8 from Princeton’s Spring 2011 final.

Consider the following ternary search trie, with string keys and integer value.

Circle which one or more of the following strings are keys in the TST.

B level

Throughout, we define $R$ as the size of the alphabet, and $N$ as the number of items in a trie.

Ex 1 (Single Character String)

When looking for a single character string in a Trie, what is the worst case time to find that string in terms of $R$ and $N$?

$\Theta(1)$

Ex 2 (Why TST?)

Problem 5 from Princeton’s Fall 2009 final.

(a) List (in alphabetical order) the set of strings that were inserted.

ear, fo, his, hitch, hold, holdup, hotel, hum, humble, ill.

(c) List two compelling reasons why a programmer would use a TST instead of a red-black tree.

  • Faster, especially for search miss.
  • Support character-based operations such as prefix match (autocomplete), longest prefix, and wildcard match.

Ex 3 (Height of TST)

Problem 9 from Princeton’s Fall 2012 final.

Suppose that the following set of 11 strings are inserted into a TST (in some order).

CANNON, CAP, CHARTER, CLOISTER, COLONIAL, COTTAGE,
IVY, QUAD, TERRACE, TIGER, TOWER

(a) Suppose that the strings are inserted into the TST in alphabetical order. What is the height of the resulting TST?

The height is 9.

(b) Suppose that the strings are inserted into the TST in an order which minimizes the height. What is the height then?

Since COLONIAL and CLOISTER each contains 8 characters, at least one of them will end up at depth 8 (height = 7) or higher. To control the height, we can add COLONIAL and CLOISTER before other items.

Ex 4 (Height of TST)

Problem 1 from Hug’s Fall 2013 final.

(a) List the 8 strings in the TST in alphabetical order.

ID, IDD, IDDQD, IDKFA, XC, XD, XEA, XYZZY.

(b) Give an example of a minimum length string that will increase the height of the TST.

IDAAAA or XDAAAA or XCAAAA.

A level

Ex 1 (Questions)

Problem 9 from Hug’s Fall 2014 midterm 2.

0%