02.10.07
One Last Boggle Boost
Last time, we used Tries to get a 20x speedup of an already-fast Boggle program. This time, we’ll ferret out the largest remaining bottleneck to produce an even faster Boggler.
What are the main operations that we perform in solving a Boggle board? There’s 1) word lookup, 2) checking if we’ve previously found a word, 3) checking if a letter starts a word and 4) recursion. Here’s how much time they take:
Operation | Complexity |
---|---|
Word Lookup | Memory access (Trie::CompletesWord ) |
Previously found? | Hash lookup/insertion |
Starts a word? | Memory access (Trie::StartsWord ) |
Recursion | Subroutine call |
We’re not going to do better than a memory access or a subroutine call. But who knows how long a hash lookup/insert takes? It’s O(1), but who knows what that constant factor is? It’s certainly more time than a memory access. Let’s see if we can replace that hash lookup/insert with another memory access. We’ll go through a couple of ideas.
Trie annotation
The Trie reduces CompletesWord
and StartsWord
to memory lookups by storing their results in each node (the is_word
and children
properties). So why not add another boolean property, found
, to each Trie node.
This will clearly reduce the check for whether we’ve found a word to a memory access. The problem is that, if we’ve found a word on one board, we haven’t necessarily found it on the next. So when we’re done solving a board, we need to go back and clean up the Trie. That requires another data structure to track which Trie nodes we’ve marked and an extra loop to reset their found
property after every board. That’s at least four memory accesses, not just one.
What’s more, this approach destroys the thread safety of the Trie. Since the Boggler can write to the Trie as well as read it, multiple solvers could easily mess each other up.
This approach works fairly well in practice, but it’s more complex than it seems at first.
Bit Vector
This is inspired by the thread safety argument against Trie annotation. Instead of storing the found bit in the Trie, why not store it somewhere else? Each thread will have a store, so thread safety isn’t an issue. This approach isn’t so different than the hash table approach we used last time, except that it uses a bit vector so it’s faster but uses more memory.
This approach has many drawbacks. First of all, how do you map pointer addresses to locations in a bit vector? There’s a lot of places that a pointer can point to, and there needs to be a bit in the bit vector for all of them. In practice, we’d measure the extent of the Trie in memory and adjust for memory alignment and the number of bits in a word. Unfortunately, you still wind up with a bit vector that’s not that much smaller than the Trie itself.
The bit vector has the same clearing problem after the board has been solved. Using bzero
to clear the whole vector every time has a dramatic effect on perf, so you wind up using an extra array to track which bits you’ve set, just like the annotation strategy. This is way more complex than the first idea, but it is thread-safe.
Annotation revisited
This approach took me forever to come up with, but I think it’s the best.
Instead of annotating the Trie with a boolean found
value, we can annotate it with a full 32-bit integer. The Boggler keeps track of how many boards it’s solved and stores that value in the Trie node of each word it finds. When it finds a word again, it compares the found
value in that node against the board number. If they’re the same, that word has been found on this board. Otherwise it hasn’t. The beauty of this approach is that we don’t need to clean up after ourselves. It’s fine to leave a mess in the Trie, because that mess will never equal the next board number.
We’ll have to do a complete sweep of the Trie to clean out the board numbers when we overflow the board counter, but that happens only once every 2^32 = 4 billion boards, so it’s not a serious concern.
This approach is the simplest, since we don’t need any auxiliary data structures. It’s also the fastest, since we don’t have to clean up after ourselves. But it’s not thread-safe, which is its only major drawback.
The code
Here’s the new stuff in trie.h:
inline void Mark(unsigned mark) { mark_ = mark; } inline unsigned Mark() const { return mark_; } unsigned mark_;
We need to add a variable to count the number of boards we’ve processed in boggler.h:
unsigned num_boards_;
And here’s the goodies from boggler.cc:
void ClearMarks(Trie* t) { // NEW t->Mark(0); // NEW for (int i=0; i<kNumLetters; i++) // NEW if (t->StartsWord(i)) // NEW ClearMarks(t->Descend(i)); // NEW } // Returns the score from this portion of the search void Boggler::Solve(int x, int y, int len, Trie* t) { prev_[x][y] = true; len += (bd_[x][y]=='q'-'a' ? 2 : 1); if (t->IsWord()) { if (t->Mark() != num_boards_) { // NEW score_ += kWordScores[len]; // NEW t->Mark(num_boards_); // NEW } } for (int dx=-1; dx<=1; ++dx) { int cx = x + dx; if (cx<0 || cx>3) continue; for (int dy=-1; dy<=1; ++dy) { int cy = y + dy; if (cy<0 || cy>3) continue; if (prev_[cx][cy]) continue; if (t->StartsWord(bd_[cx][cy])) Solve(cx, cy, len, t->Descend(bd_[cx][cy])); } } prev_[x][y] = false; } int Boggler::Score(const char* letters) { if (!ParseBoard(letters)) return -1; score_ = 0; for (int i=0; i<16; ++i) Solve(i/4, i%4, 0, dict_->Descend(bd_[i/4][i%4])); num_boards_++; // NEW if (!num_boards_) { // NEW ClearMarks(dict_); // NEW num_boards_ = 1; // NEW } return score_; }
That’s about a 15 line change total. What does it buy us?
$ ./perf Loaded 172203 words. Testing perf on 50000 boards... 50000 boards in 4.82284 secs => 10367.35 bds/sec Average board score: 141.1734 pts/bd
That’s another 65% speedup from last time. It’s certainly not 20x, but impressive nonetheless.
Now that we’ve got a lightning-fast Boggle solver we can solving interesting problems with it. Until next time, here’s the source.
To recap, here’s how far we’ve come:
Program | Speed | Change | Improvement |
---|---|---|---|
Original | .004 b/s | - | Really simple Ruby |
Stems | 17.5 b/s | 4000x | Used stems to prune the DFS |
C++ | 310 b/s | 18x | Converted Ruby to C++ |
Tries | 6271 b/s | 20x | Use Tries instead of hashes |
Final | 10367 b/s | 65% | Annotate the Trie with board counter |
That’s a factor of six million improvement since our first try. It’s interesting to note that the final 600x of improvements all came from replacing O(1) operations with other O(1) operations.
danvk.org » How many Boggle boards are there? said,
August 2, 2007 at 8:45 pm
[...] Last time, we completed a board solver that could score 10,000 boards per second. So what to do with all that power? One option was to compute statistics on random boards. Another is to hunt for the holy grail of Boggle: the highest possible scoring Boggle board. The next few posts will be devoted to the search for this board. [...]
danvk.org » Solving Boggle by Taking Option Three said,
August 4, 2009 at 6:23 pm
[...] fewer points. In a previous post, I estimated that there were 2^69 possible boggle boards. At 10,000 boards/second, this would take 1.9 billion years of compute [...]
danvk.org » Breaking 3×3 Boggle said,
August 8, 2009 at 10:35 am
[...] there are so many boards to consider: 2^72 for the 4×4 case and 2^40 for the 3×3 case. At 10,000 boards/second former corresponds to about 2 billion years of compute time, and the latter just two years. Just [...]
Joe Rice said,
February 26, 2012 at 12:21 pm
Thanks for the tip about Tries. I am now using them in the solver for the game engine for my Facebook game called BitWordy. I’ve included an acknowledgement to “the person behind danvk.org” in my about box. Let me know if you’d like something more specific.