Friday, 24 April 2015

Constructing a Trie in F#

This post might get a bit more context next week, but essentially, in continuation of looking at data structures in F# and C# I picked prefix trees (tries) as the next step. A trie is similar to a BST except the search is cumulative. A simple example would be building a spellchecker. The spell checker can use the structure to see if a word exists quickly and efficiently.

It works by creating a new node for each part of the word, using existing nodes if they're already in place. You can see in this diagram an example structure after an insertion of several words.

In the above diagram the words ANT, AND, BATS have been added into the data structure. You'll notice that the word BATS can also be the word BAT and that becomes a part of the structure, setting a value at each node to state if this is a place that a word terminates.


1
2
3
4
 // Node type for storing a trie.  
 type TrieNode =   
   | Node of char * TrieNode list * bool  
   | Root of TrieNode list  

I defined the type to have 2 options. The root node, which we want to basically ignore after we've looked through it's children and the node for storing data. It has a char for storing the value, a list of sub nodes and a bool to state whether this is where a word terminates.

The first part is to get inserting nodes into the structure sorted. There are some similarities to the BST, but my first attempt was very convoluted, and eventually was cut down into this:


 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
// Basic check for equality on a node against a character.  
 let matchChar c = (fun node -> match node with  
                 | Root(_) -> false  
                 | Node(v, _, _) -> v = c)  
    
 // Insert a new node into the trie.  
 let rec Insert (chars: char list) = function  
   // If this is the root node and it starts on a known character continue, otherwise add a new child, then continue.  
   | Root(list) -> if List.exists(matchChar chars.Head) list = false  
             then Root(list @ [(Node(chars.Head, [], false) |> Insert chars)])  
           else Root(List.map (fun node -> Insert chars node) list)  
   // If we've ran out letters it's now a word. If there's no children, make some, as we have more letters.  
   | Node(c, list, isword) when c = chars.Head -> if chars.Tail = []   
                             then Node(c, list, true)  
                           else if list = []   
                             then Node(c, [(Node(chars.Tail.Head, [], false) |> Insert chars.Tail)], isword)  
                           // If we need to add a child to continue, do so.  
                           else if List.exists(matchChar chars.Tail.Head) list = false   
                             then Node(c, list @ [(Node(chars.Tail.Head, [], false) |> Insert chars.Tail)], isword)  
                           // Continue down the tree.  
                           else Node (c, List.map (fun node -> Insert chars.Tail node) list, isword)  
   // Default case for and existing node.  
   | Node(c, list, isword) -> Node(c, list, isword)

There are probably some ways to simplify this further, but it does state the intention of the function clearly.

Much like with the BST, insertion was the trickier part of the important steps, and looking up a word was similar to insertion but simpler. We just want to keep going down the tree until we either run out of letters for the word we're looking up, in which case we want to check for a flag to see if this is a word termination node, or until we run out of tree, in which case it's obviously not a word.


 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
let Exists (word: string, tree: TrieNode) =   
   // Turn the string into a list of char.  
   let wordChars = [for character in word -> character]  
    
   // Look through the tree for the word.  
   let rec find (chars: char list) = function  
     // If it's the root, look at all the child nodes.  
     | Root(list) -> if list = []   
               then false  
             else list |> List.exists (fun node -> find chars node)  
     // If we've found the last character, is it a word? If there's no more tree, it's not a word, otherwise keep searching.  
     | Node(c, list, isword) when c = chars.Head -> if chars.Tail = []  
                               then isword  
                             else if list = []  
                               then false  
                             else list |> List.exists (fun node -> find chars.Tail node)  
     | Node(c, list, isword) -> false  
    
   // Find the word.  
   find wordChars tree  

Some testing verified that it behaved as expected, with a fairly simple program and a few breakpoints.


 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
 [<EntryPoint>]  
 let main argv =   
   let root = Root([]);  
   let b = Insert ['a';'b'] root  
   let c = Insert ['b';'c'] b  
   let d = Insert ['a';'b';'c'] c  
   let e = Insert ['a';'b';'d'] d  
   let f = Insert ['a';'b';'c';'d';'e';'f';'g';'h'] e  
   let exists = Exists ("abd", f)  
   
   0 // return an integer exit code  

Hopefully in the next post, we'll use it for something a bit more interesting.