danvk.org blog 2025-02-13T22:18:44+00:00 https://danvk.github.io/ Dan Vanderkam danvdk@gmail.com Boggle Revisited: New Ideas in 2025 2025-02-13T00:00:00+00:00 https://danvk.github.io//2025/02/13/boggle2025 <p>Over the past few weeks I’ve revisited a 15-year old project of mine: trying to find the globally optimal <a href="https://en.wikipedia.org/wiki/Boggle">Boggle</a> board. In the <a href="https://www.danvk.org/2025/02/10/boggle34.html">last post</a>, I recapped the work I did in 2009 to find the globally-optimal 3x3 Boggle board.</p> <p>In this post, I’ll present a few optimizations I found in 2025 that add up to something like a 10x speed boost over the 2009 approach. Between better algorithms, faster CPUs, and the widespread availability of large cloud machines, I’ve now been able to find the globally-optimal 3x4 Boggle board.</p> <p>With an impressive 1,651 points, here it is:</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>S L P I A E N T R D E S </code></pre></div></div> <p>This board is chock full of big words, including REPAINTED and STRAINED.</p> <p>A real Boggle board, of course, is 4x4 or even 5x5. Sadly, those problems still remain out of reach. The final post in this series will look at ideas on how to tackle them.</p> <p>The code for this post lives in the <img alt="GitHub Logo" src="/images/github-mark.svg" height="16" /> <a href="https://github.com/danvk/hybrid-boggle">danvk/hybrid-boggle</a> repo.</p> <h2 id="new-ideas-in-2025">New ideas in 2025</h2> <p>I explored many new ideas for speeding up Boggle solving, but there were five that panned out:</p> <ol> <li>Play a slightly different version of Boggle where you can find the same word as many times as you like.</li> <li>Build the “evaluation tree” used to calculate the max/no-mark bound explicitly in memory.</li> <li>Implement “pivot” and “lift” operations on this tree to synchronize letter choices across subtrees.</li> <li>Aggressively compress and de-duplicate the evaluation tree.</li> <li>Use three letter classes instead of four.</li> </ol> <h3 id="multi-boggle">Multi-Boggle</h3> <p>Looking at the <a href="https://www.danvk.org/wp/2009-08-08/breaking-3x3-boggle/index.html">best 3x3 board</a>:</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>P E R L A T D E S </code></pre></div></div> <p>There are two different ways to find the word LATE:</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>P E R P E\R L-A-T L-A-T D E/S D E S </code></pre></div></div> <p>In regular Boggle you’d only get points for finding LATE once. But for our purposes, this will wind up being a global constraint that’s hard to enforce. Instead, we just give you two points for it. We’ll call this “Multi-Boggle”. The score of a board in Multi-Boggle is always higher than its score in regular Boggle, so it’s still an <a href="https://www.danvk.org/wp/2009-08-08/breaking-3x3-boggle/index.html#:~:text=Upper-,Bounds,-Now%20on%20to">upper bound</a>.</p> <p>If there are no repeat letters on the board, then the score is the same as in regular Boggle. In other words, while you can find LATE twice, you still can’t find LATTE because there’s only one T on the board.</p> <p>In practical terms, this means that we’re going to focus solely on the <a href="https://www.danvk.org/wp/2009-08-11/a-few-more-boggle-examples/index.html">max/no-mark</a> bound and forget about the <a href="https://www.danvk.org/wp/2009-08-11/some-maxno-mark-examples/index.html">sum/union</a> bound. The max/no-mark bound for a concrete board (one with a single letter on each cell) is its Multi-Boggle score.</p> <h3 id="the-evaluation-tree">The Evaluation Tree</h3> <p>Recall from the last post that we started by calculating an upper bound on one large board class:</p> <style> td, th { border: 1px solid #777; padding: 0.25em; } table { margin-bottom: 10px; } </style> <table> <tbody> <tr> <td>lnrsy</td> <td>chkmpt</td> <td>lnrsy</td> </tr> <tr> <td>aeiou</td> <td>aeiou</td> <td>aeiou</td> </tr> <tr> <td>chkmpt</td> <td>lnrsy</td> <td>bdfgjvwxz</td> </tr> </tbody> </table> <p>and then split it up into five smaller board classes, one with each vowel in the center cell, to reduce the upper bound:</p> <style> #five-boards > tbody > tr > td { border: none; } </style> <table id="five-boards"> <tr> <td> <table> <tr><td>lnrsy</td><td>chkmpt</td><td>lnrsy</td></tr> <tr><td>aeiou</td><td>a</td><td>aeiou</td></tr> <tr><td>chkmpt</td><td>lnrsy</td><td>bdfgjvwxz</td></tr> </table> </td> <td> <table> <tr><td>lnrsy</td><td>chkmpt</td><td>lnrsy</td></tr> <tr><td>aeiou</td><td>e</td><td>aeiou</td></tr> <tr><td>chkmpt</td><td>lnrsy</td><td>bdfgjvwxz</td></tr> </table> </td> <td>...</td> <td> <table> <tr><td>lnrsy</td><td>chkmpt</td><td>lnrsy</td></tr> <tr><td>aeiou</td><td>u</td><td>aeiou</td></tr> <tr><td>chkmpt</td><td>lnrsy</td><td>bdfgjvwxz</td></tr> </table> </td> </tr> </table> <p>The fundamental inefficiency in <a href="https://www.danvk.org/2025/02/10/boggle34.html">this approach</a> is that it results in an enormous amount of duplicated work. These five board classes have a lot in common. Every word that doesn’t go through the middle cell is identical. It would be nice if we could avoid repeating that work.</p> <p>Back in 2009, I <a href="https://github.com/danvk/performance-boggle/blob/master/3x3/ibuckets.cc">implemented</a> the max/no-mark upper bound by recursively searching over the board and the dictionary <a href="https://www.danvk.org/wp/2007-02-06/in-which-we-discover-that-tries-are-awesome/index.html">Trie</a>. This was a natural generalization of the way you score a concrete Boggle board. It didn’t use much memory, but it also didn’t leave much room for improvement.</p> <p>You can visualize a series of recursive function calls as a tree. The key advance in 2025 is to form this tree explicitly in memory. This is more expensive, but it gives us a lot of options to speed up subsequent steps.</p> <p>Here’s an example of what one of these trees looks like:</p> <p><img src="/images/boggle-2x3-tree.svg" alt="Evaluation tree for a 2x3 Boggle board" /></p> <p>This is visualizing a 2x3 board class, with the cells numbered 0-5:</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> T I 0 3 AE . 1 . R . 2 . </code></pre></div></div> <p>The top “ROOT” indicates that you have to start on one of the four cells. “CH” ovals indicate that you have to make a choice on a cell, rectangles indicate what those choices are (A or E?) and double-outlined rectangles indicate complete words. You can read words by following a path down the tree. From the left we have: TAR, TIE, TIER, AIT, RAT, RET, REI. (It’s news to me that <a href="https://en.wiktionary.org/wiki/ait">AIT</a> is a word!)</p> <p>To get a bound from a tree, you take the sum of your children on rectangular nodes and the max of your children on oval (choice) nodes. Double-outlined boxes indicate how many points they’re worth. In this case the bound at the root is 3, coming from the top branch.</p> <p>This is a small tree with 30 nodes, but in practice they can be quite large. The tree for the 3x3 board class we’ve been looking at has 520,947 nodes and the 3x4 and 4x4 trees can be much larger.</p> <p>I actually tried <a href="https://github.com/danvk/performance-boggle/blob/master/tree_tool.cc">building these trees</a> in 2009, but I <a href="https://github.com/danvk/performance-boggle/commit/ec55e1c55cb1e5ad66e0784e3bd318a59c8812af">abandoned it</a> because I wasn’t seeing a big enough speedup in subsequent steps (scoring boards with a split cell) to justify the cost of building the tree.</p> <p>What did I miss in 2009? Sadly, I had a <a href="https://github.com/danvk/performance-boggle/blob/2710062fca308b93a6ee6a19980d6bcb4218b6e8/breaking_tree.cc#L34">TODO</a> that turned out to be critical: rather than pruning out subtrees that don’t lead to points in a second pass, it’s much faster and results in less memory fragmentation if you do it as you build the tree. A 33% speedup becomes a 2x speedup. Maybe if I’d discovered that in 2009, I would have kept going!</p> <p>The other discovery was that there’s a more important operation to optimize.</p> <h3 id="pivot-and-lift-operations">“Pivot” and “Lift” operations</h3> <p>After we found an upper bound on the 3x3 board class, the next operation was to split up the middle cell and consider each of those five (smaller) board classes individually. Now that we have a tree, the question becomes: how do you go from the initial tree to the trees for each of those five board classes?</p> <p>There’s another way to think about this problem. Why is the max/no-mark bound imprecise? Why doesn’t it get us the score of the best board in the class? Its flaw is that you don’t have make consistent choices across different subtrees. You can see this by zooming in on the “0=t” subtree from the previous graph:</p> <!-- TODO: add bounds to these next few images --> <p><img src="/images/boggle-focus-t0.svg" alt="A subtree with choices made in inconsistent orders" /></p> <p>The bound on this tree is 3 (sum all the words). On the top branch (with a “T-“ prefix), it makes the most sense to choose “A” for cell 1, so that you can spell “TAR.” But on the bottom branch (with a “TI-“ prefix), it makes more sense to choose “E” so that you can spell “TIE” and “TIER.”</p> <p>Of course, the cell has to be either A or E. It can’t be both. The problem is that these choices happen far apart in the tree, so they’re not synchronized. If we adjusted the tree so that the first thing you did was make a choice for cell 1, then the subtrees would all be synchronized and the bound would go down:</p> <p><img src="/images/lift-choice1.svg" alt="The same subtree with choice 1 made first" /></p> <p>This represents the same tree, except that the choice on cell 1 has been pushed to the left. The bound is now 2, not 3 (you have to pick 1 point from the top branch or 2 points from the bottom branch).</p> <p>What we need is a “pivot” operation to lift a particular choice node up to the top of the tree. You can work out how to do this for each type of node.</p> <p>First of all, if a subtree doesn’t involve a choice for cell N, then we don’t have to change it. Easy.</p> <p>For the other two types of nodes (sum and choice), it helps a lot to draw the lift operation. Here’s a sum node (labeled “ROOT”).</p> <p><img src="/images/boggle/sum-lift-before.svg" alt="Sum node before lifting" /></p> <p>We’d like to pivot the tree so that the choice on cell 1 is at the root. Here’s what that looks like:</p> <p><img src="/images/boggle/sum-lift-after.svg" alt="Sum node after lifting" /></p> <p>The tree has gotten bigger and a little more complicated. That’s typical. We’ll be able to improve this, but more on that in a moment.</p> <p>Here’s a choice node:</p> <p><img src="/images/boggle/choice-lift-before.svg" alt="Choice node before lifting" /></p> <p>Here’s what it looks like after pivoting the choice on 1 to the root:</p> <p><img src="/images/boggle/choice-lift-after.svg" alt="Choice node after lifting" /></p> <p>Again, the tree has gotten more complex. In particular, notice how the <code class="language-plaintext highlighter-rouge">0=c</code> node with three points has been duplicated. This blowup is the cost of pivoting. The payoff is reduced bounds.</p> <p>If you lift the choice for the middle cell of the 3x3 board all the way to the top of the tree, you’ll wind up with this:</p> <!-- Maybe just rotate this one, too? --> <p><img src="/images/boggle-1-aeiou.svg" alt="Top of tree with choice of middle cell first" /></p> <p>There’s a choice node with five sum nodes below it, and the bound is lower. Now if you lift another cell to the top, you’ll get two layers of choice nodes with sum nodes below them:</p> <p><img src="/images/boggle-lift-4-5.svg" alt="Top of tree with two choices lifted" /></p> <p>I’ve rotated the tree because it’s already getting big. As before, the bound is lower. If you keep doing this, you build up a “pyramid” of choice nodes at the top of the tree. If the bound on any of these nodes drops below the highest score we know about, we can prune it out. This is equivalent to the “stop” condition from the 2009 algorithm, it’s just that we’re doing it in tree form.</p> <p>This “lift” operation is not cheap. The cost of making choices in an unnatural order is that the tree gets larger. Here’s the tree size if you lift all nine cells of the 3x3 board, along with the bound:</p> <table> <thead> <tr> <th>Step</th> <th>Nodes</th> <th>Bound</th> </tr> </thead> <tbody> <tr> <td>(init)</td> <td>520,947</td> <td>9,359</td> </tr> <tr> <td>1</td> <td>702,300</td> <td>6,979</td> </tr> <tr> <td>2</td> <td>1,315,452</td> <td>5,334</td> </tr> <tr> <td>3</td> <td>2,527,251</td> <td>4,069</td> </tr> <tr> <td>4</td> <td>5,158,477</td> <td>3,047</td> </tr> <tr> <td>5</td> <td>8,395,605</td> <td>2,318</td> </tr> <tr> <td>6</td> <td>14,889,665</td> <td>1,774</td> </tr> <tr> <td>7</td> <td>18,719,619</td> <td>1,373</td> </tr> <tr> <td>8</td> <td>11,205,272</td> <td>1,037</td> </tr> <tr> <td>9</td> <td>4,143,221</td> <td>804</td> </tr> </tbody> </table> <p>The node count goes up slowly, then more rapidly, and then it comes down again as we’re able to prune more subtrees. It increases a lot before it comes back down. For this board, there’s a 36x increase in the number of nodes after 7 lifts. At the end of this, there’s only 428 concrete boards (out of 5,625,000 boards in the class) that we need to test.</p> <p>Does 4 million nodes seem like a lot for only 428 Boggle boards? It is. There are a few important tweaks we can make to keep the tree small as we lift choices.</p> <h3 id="compression-and-de-duping">Compression and De-duping</h3> <p>Keeping the tree as small as possible is essential for solving Boggle quickly. There are two inefficient patterns we can identify and fix in our trees to keep them compact.</p> <ol> <li>Collapse chains of sum nodes into a single sum node.</li> <li>Merge sibling choice nodes.</li> </ol> <p>There’s no point in having trees of sum nodes without choice nodes in between them. We may as well add all their points and children to a single, bigger sum node. Here’s a tree we had after lifting through a sum node earlier:</p> <p><img src="/images/boggle/sum-lift-after.svg" alt="Tree after lifting through sum node" /></p> <p>Here’s what it looks like after collapsing sum nodes:</p> <p><img src="/images/boggle/sum-lift-compress.svg" alt="Tree after compressing sum nodes" /></p> <p>The points from the <code class="language-plaintext highlighter-rouge">0=a</code> and <code class="language-plaintext highlighter-rouge">0=b</code> nodes have moved into their parents, which lets us delete those two nodes, which makes the tree smaller.</p> <p>Here are the node counts when you add compression after each pivot:</p> <table> <thead> <tr> <th>Step</th> <th>Nodes</th> </tr> </thead> <tbody> <tr> <td>(init)</td> <td>520,947</td> </tr> <tr> <td>1</td> <td>669,156</td> </tr> <tr> <td>2</td> <td>1,054,515</td> </tr> <tr> <td>3</td> <td>1,726,735</td> </tr> <tr> <td>4</td> <td>2,675,250</td> </tr> <tr> <td>5</td> <td>2,620,720</td> </tr> <tr> <td>6</td> <td>1,420,925</td> </tr> <tr> <td>7</td> <td>301,499</td> </tr> <tr> <td>8</td> <td>39,667</td> </tr> <tr> <td>9</td> <td>9,621</td> </tr> </tbody> </table> <p>These are considerably better. After 7 lifts, we’ve gone from nearly 19 million nodes to a mere 300,000. The maximum increase is now only 5-6x. Compression on its own is a 2-3x speedup.</p> <p>Remember how we switched to playing “Multi-Boggle” earlier? This is where that change is crucial. If we had to track individual words, we couldn’t collapse sum nodes because they’d each reflect different words, and we’d need to keep track of that for the sum/union bound. But with Multi-Boggle, we’re free to collapse TIE and TIER into a single word that’s worth 2 points.</p> <p>Here’s a tree where we can merge sibling choice nodes:</p> <p><img src="/images/boggle/merge-choices-before.svg" alt="Tree with duplicate sibling choice nodes" /></p> <p>There are two choices for cell 0 that are siblings under the root node. The first is a choice between <code class="language-plaintext highlighter-rouge">0=a</code> and <code class="language-plaintext highlighter-rouge">0=b</code> (<code class="language-plaintext highlighter-rouge">0=c</code> nets zero points). The second is a choice between <code class="language-plaintext highlighter-rouge">0=b</code> and <code class="language-plaintext highlighter-rouge">0=c</code>. There’s no reason we should make those choices independently. They’re the same choice, and the tree should reflect that. We can implement that by merging the trees:</p> <p><img src="/images/boggle/merge-choices-after.svg" alt="Tree with duplicate sibling choice nodes merged" /></p> <p>There are fewer nodes now and (exciting!) the bound has gone down because we’ve synchronized two choices that were previously made independently. Subtree merging can be done recursively.</p> <p>The “lift” operation expands the tree because it can duplicate nodes or entire subtrees. Another optimization is to de-duplicate these, and make sure we only ever operate on unique nodes. This can be done by computing a hash for each node. Here’s a tree with structurally identical nodes marked in red:</p> <!-- poetry run python -m boggle.make_dot '. . . . rs e ae o .' 4 6 7 --> <p><img src="/images/boggle/tree-dedupe.svg" alt="Tree with duplicate nodes marked in red" /></p> <p>You can scan over the tree to find the canonical versions of each red subtree. For example, the red “4 CH” in the middle reads “4 CH - 4=r - 5 CH - 5=e (1)”, which is exactly the same as the line right above it.</p> <table> <thead> <tr> <th>Step</th> <th>Unique Nodes</th> </tr> </thead> <tbody> <tr> <td>(init)</td> <td>98,453</td> </tr> <tr> <td>1</td> <td>117,602</td> </tr> <tr> <td>2</td> <td>215,121</td> </tr> <tr> <td>3</td> <td>318,088</td> </tr> <tr> <td>4</td> <td>592,339</td> </tr> <tr> <td>5</td> <td>754,947</td> </tr> <tr> <td>6</td> <td>481,449</td> </tr> <tr> <td>7</td> <td>125,277</td> </tr> <tr> <td>8</td> <td>27,125</td> </tr> <tr> <td>9</td> <td>9,613</td> </tr> </tbody> </table> <p>Whereas compression is more effective at reducing node counts after many lifts, de-duplication is better at reducing (unique) node counts initially and after fewer lifts. Only processing each unique node once can potentially save us a lot of time.</p> <p>One way to think about this is that it allows us to <a href="https://en.wikipedia.org/wiki/Memoization">memoize</a> the pivot operation. Another is that it turns the tree into a <a href="https://en.wikipedia.org/wiki/Directed_acyclic_graph">DAG</a>, similar to how you can compress a Trie by turning it into a <a href="https://en.wikipedia.org/wiki/Deterministic_acyclic_finite_state_automaton">DAWG/DAFSA</a> (Directed Acyclic Word Graph). Visualizing it as a DAG doesn’t work very well — there’s just too many crossing lines.</p> <h3 id="use-three-letter-classes-instead-of-four">Use three letter classes instead of four</h3> <p>The net effect of all these changes is that we’re able to “break” difficult board classes much more efficiently. For example, this 3x4 board class:</p> <table> <tbody> <tr> <td>lnrsy</td> <td>aeiou</td> <td>bdfgjqvwxz</td> </tr> <tr> <td>lnrsy</td> <td>aeiou</td> <td>aeiou</td> </tr> <tr> <td>chkmpt</td> <td>lnrsy</td> <td>lnrsy</td> </tr> <tr> <td>bdfgjqvwxz</td> <td>aeiou</td> <td>aeiou</td> </tr> </tbody> </table> <p>takes about 14 seconds to break using the 2009 technique but only 3 seconds to break using the tree techniques described above. Lifting five times reduces the bound from 51,639 to 6,695, at which point we can switch over to the 2009 technique.</p> <p>Compare that with an easier board:</p> <table> <tbody> <tr> <td>lnrsy</td> <td>lnrsy</td> <td>bdfgjqvwxz</td> </tr> <tr> <td>aeiou</td> <td>bdfgjqvwxz</td> <td>bdfgjqvwxz</td> </tr> <tr> <td>lnrsy</td> <td>aeiou</td> <td>chkmpt</td> </tr> <tr> <td>chkmpt</td> <td>lnrsy</td> <td>bdfgjqvwxz</td> </tr> </tbody> </table> <p>This takes 0.16 seconds with the 2009 technique and 0.12 seconds with the tree technique. It’s a win, but much less of a win. (The five lifts on this one reduce the bound from 8,138 to 1,231.)</p> <!-- Easy: /usr/bin/time -l poetry run python -m boggle.break_all 'bdfgjqvwxz aeiou lnrsy chkmpt' 1600 --size 34 --board_id 1412524 --breaker hybrid --switchover_level 5 --free_after_score Hard: 6579514 --> <p>If trees are most helpful on hard board classes, maybe we should have more of them? If we use use three letter buckets instead of four, it significantly reduces the number of board classes we need to consider:</p> <ul> <li>Four letter buckets: 4^12/4 ≈ 4.2M boards classes</li> <li>Three letter buckets: 3^12/4 ≈ 133k board classes</li> </ul> <p>The board classes with three letter buckets are going to be bigger and harder to break. But with our new tools, these are exactly the sort of boards on which we get the biggest improvement. So long as the average breaking time doesn’t go up by more than a factor of ~32x (4.2M/133k), using three buckets will be a win.</p> <table> <thead> <tr> <th>Code Year</th> <th>Buckets</th> <th>Pace (s/class)</th> </tr> </thead> <tbody> <tr> <td>2009</td> <td>4</td> <td>4.53</td> </tr> <tr> <td>2009</td> <td>3</td> <td>33.336 (7.4x slower)</td> </tr> <tr> <td>2025</td> <td>4</td> <td>1.175</td> </tr> <tr> <td>2025</td> <td>3</td> <td>5.321 (4.5x slower)</td> </tr> </tbody> </table> <p>So three buckets would have been better with the 2009 algorithm, but it’s an even bigger win in 2025. Since 32 / 4.5 ≈ 7, we’d expect this to be around a 7x speedup.</p> <p>Why not keep going to two classes, or even just one? The cost is memory and reduced parallelism. “Chunkier” board classes require bigger trees, and RAM is a finite resource. Moreover, the fewer letter buckets we use, the more skewed the distribution of breaking times gets. Some board classes remain trivial to break (ones with all consonants, for example), but others are real beasts. On the full 3x4 run, the fastest board was broken in 0.003s whereas the slowest took 2297s. It’s harder to distribute these uneven tasks evenly across many cores or many machines to get the full speedup you expect from distribution. I think using slightly bigger chunks could still help, say two classes (consonant/vowel) in the corners, three on the edges and five in the middle.</p> <h2 id="putting-it-all-together">Putting it all together</h2> <p>For each board class, the 2025 algorithm is:</p> <ol> <li>Build the evaluation tree.</li> <li>“Lift” a few choice cells to the top.</li> <li>Continue splitting cells without modifying the tree, ala the 2009 approach.</li> </ol> <p>The right number of “lifts” depends on the board and the amount of memory you have available. Harder boards benefit from more lifts, but this takes more memory.</p> <p>Using this approach, I was able to evaluate all the 3x4 Boggle board classes on a <a href="https://cloud.google.com/compute/docs/general-purpose-machines#c4_series">192-node C4 cloud instance</a> in 8–9 hours, roughly $100 of compute time. The results? There are exactly five boards that score more than 1600 points with the ENABLE2K word list:</p> <ul> <li>srepetaldnis (1651)</li> <li>srepetaldnic (1614)</li> <li>srepetaldnib (1613)</li> <li>sresetaldnib (1607)</li> <li>sresetaldnip (1607)</li> </ul> <p>The best one is the same one I found through simulated annealing. The others are 1-2 character variations on it. It would have been more exciting if there were a new, never before seen board. But we shouldn’t be too surprised that simulated annealing found the global optimum. After all, it did for 3x3 Boggle. And Boggle is “smooth” in the sense that similar boards tend to have similar scores. It would be hard for a great board to “hide” far away from any other good boards.</p> <h2 id="next-steps">Next Steps</h2> <p>This is an exciting result! After 15 years, it is meaningful progress towards the goal of finding the globally-optimal 4x4 Boggle board.</p> <p>There are still many optimizations that could be made. My 2025 code only wound up being a 3-4x speedup vs. my 2009 code when I ran it on the 192-core machine. This was because I had to dial back a few of the optimizations because I kept running out of memory. So <a href="https://github.com/danvk/hybrid-boggle/issues/12">changes that reduce memory usage</a> would likely be the most impactful.</p> <p>On the other hand, I don’t think there’s any tweaks to my current approach that will yield more than a 10x performance improvement. So while I might be able to break 3x4 Boggle more efficiently, it’s not going to make a big dent in 4x4 Boggle. Remember that 50,000x increase in difficulty from earlier. For 4x4 Boggle, we still need a different approach. (Or $500,000 of compute time!)</p> <p>In the next and final post, I’ll talk about a few ideas that might help to finally crack 4x4 Boggle. I’ll also share some thoughts on [pybind11], picking up an old project, and the experience of working on a hard optimization problem.</p> <p>You can find all the code for this post in the <img alt="GitHub Logo" src="/images/github-mark.svg" height="16" /> <a href="https://github.com/danvk/hybrid-boggle">danvk/hybrid-boggle</a> repo.</p> Boggle Revisited: Finding the Globally-Optimal 3x4 Boggle Board 2025-02-10T00:00:00+00:00 https://danvk.github.io//2025/02/10/boggle34 <p>Over 15 years ago (!) I wrote a series of <a href="https://www.danvk.org/wp/category/boggle/">blog posts</a> about the board game Boggle. <a href="https://en.wikipedia.org/wiki/Boggle">Boggle</a> is a word search game created by Hasbro. The goal is to find as many words as possible on a 4x4 grid. You may connect letters in any direction (including diagonals) but you may not use the same letter twice in a word (unless that letter appears twice on the board).</p> <p align="center"> <img src="/images/boggle.png" alt="A 4x4 Boggle board with the word &quot;THREAT&quot; shown" /> </p> <p>You score points for each word you find, with longer words being worth more points (3-4 letters are 1 point, 5=2 points, 6=3 points, 7=5 points, 8+ letters=11 points).</p> <p>Boggle is a fun game, but it’s also a fun Computer Science problem. There are three increasingly hard problems to solve as you go down this rabbit hole:</p> <ol> <li><strong>Write a program to find all the words on a Boggle board.</strong> This is a classic data structures and algorithms problem, and sometimes even an interview question. What’s wonderful about this problem is that it’s a perfect use for a <a href="https://www.danvk.org/wp/2007-02-06/in-which-we-discover-that-tries-are-awesome/index.html">Trie</a> (aka <a href="https://en.wikipedia.org/wiki/Trie">Prefix Tree</a>), and a counter to the idea that hash tables are always the best answer. You can find <a href="https://www.danvk.org/wp/2007-02-04/moving-boggle-to-c/index.html">many</a>, <a href="https://github.com/danvk/performance-boggle">many</a> Boggle solvers of this sort on the internet. Apparently Jeff Dean is a <a href="https://x.com/JeffDean/status/1887173255448121617">fan</a> of Boggle, and LLMs can even write these sorts of solvers.</li> <li><strong>Find high scoring Boggle boards.</strong> Once you’ve written a fast solver, a natural question is “what’s the Boggle board with the most points on it?” The usual approach is some variation on <a href="https://www.danvk.org/wp/2009-02-19/sky-high-boggle-scores-with-simulated-annealing/index.html">simulated annealing</a> or <a href="https://en.wikipedia.org/wiki/Hill_climbing">hill climbing</a>. Start with a random board and find all the words on it. Then change a letter or swap two letters and see if it improves things. Repeat until you stall out. This problem is less popular than the first, but you can still find <a href="http://www.robertgamble.net/2016/01/a-programmers-analysis-of-boggle.html">a few</a> about it and a <a href="https://codegolf.stackexchange.com/questions/5654/best-scoring-boggle-board?rq=1">Code Golf competition</a>. There’s even a <a href="https://digitalcommons.butler.edu/cgi/viewcontent.cgi?article=2722&amp;context=wordways">published article</a> from 1982 about it! (Fun fact: they found the wrong board.)</li> <li><strong>Prove that a Boggle board is the global optimum.</strong> If you do enough simulated annealing runs, you’ll see the same few boards pop up again and again. A natural next question is “are these truly the highest-scoring boards?” Are there any high-scoring boards that simulated annealing misses? Proving a global optimum is much harder than finding a few high-scoring boards and, so far as I’m aware, I’m the only person who’s ever spent significant time on this particular problem.</li> </ol> <p>The crowning achievement of my work in 2009 was <a href="https://www.danvk.org/wp/2009-08-08/breaking-3x3-boggle/index.html">proving</a> that this was the highest-scoring 3x3 Boggle board (using the <a href="https://everything2.com/title/ENABLE+word+list">ENABLE2K</a> word list), with 545 points:</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>R T S E A E P L D </code></pre></div></div> <p>Now, 15 years later, I’ve been able to prove that this is the best 3x4 board, with 1651 points:</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>S L P I A E N T R D E S </code></pre></div></div> <p>This post will summarize how I found the highest-scoring 3x3 Boggle board back in 2009, and <a href="https://www.danvk.org/2025/02/13/boggle2025.html">the next</a> will describe how I extended this to 3x4 in 2025. Alas, 4x4 Boggle still remains out of reach for now. Maybe in 2040?</p> <h2 id="why-is-this-a-hardinteresting-problem">Why is this a hard/interesting problem?</h2> <p>There are an <a href="https://www.danvk.org/wp/2007-08-02/how-many-boggle-boards-are-there/index.html">enormous number</a> of possible Boggle boards. Something like 26^16/8, which is around 5 billion trillion (5*10^21). This is far, far too many to check one by one. I <a href="https://www.danvk.org/wp/2007-08-02/how-many-boggle-boards-are-there/index.html">previously estimated</a> that it would take around 2 billion years on a single CPU.</p> <p>And yet… there’s a lot of structure in this problem that might be exploited to make it tractable. There’s an enormous number of possible optimizations to try, lots of interesting data structures and algorithms to read about and implement, and always the possibility that you’re one insight away from solving this problem.</p> <p>Boggle is a worthy adversary, and most ideas don’t pan out. But the possibility of achieving such an enormous speedup (2 billion years → a few hours) is what makes this problem exciting to me.</p> <h2 id="why-pick-it-back-up-now">Why pick it back up now?</h2> <p>Fifteen years is a long time! The world has changed a lot since my <a href="https://www.danvk.org/wp/2009-08-11/a-few-more-boggle-examples/index.html">last Boggle post</a>. Computers have gotten much faster. There have been five new versions of C++. Cloud computing is a thing now. Stack Overflow is a thing. So are LLMs. A cool language called TypeScript came out and I <a href="https://effectivetypescript.com/">wrote a book on it</a>. I even have an iPhone now!</p> <p>I’ve gotten in the <a href="https://effectivetypescript.com/2024/07/17/advent2023-zig/">habit</a> of doing the <a href="https://adventofcode.com/">Advent of Code</a>, a coding competition that’s held every December. It involves lots of data structures and algorithms problems, so it got me in that headspace.</p> <p>In addition, I’ve long been curious to write code using a mix of C++ and Python: C++ for the performance-critical parts, Python for everything else. Maybe it could be a best-of-both worlds: the speed of C++ with the convenience of Python. I thought Boggle would be a great problem to use as motivation. I wound up using <a href="https://pybind11.readthedocs.io/en/stable/index.html">pybind11</a> and I’m a fan. I’ll have some thoughts to share about it in a future post.</p> <h2 id="how-did-i-find-the-optimal-3x3-board-in-2009">How did I find the optimal 3x3 board in 2009?</h2> <p>Though I didn’t know it at the time, I used <a href="https://en.wikipedia.org/wiki/Branch_and_bound">branch and bound</a>, a ubiquitous optimization strategy first developed in the 1960s. There were three key ideas.</p> <h3 id="board-classes">Board classes</h3> <p>The first idea was to reduce the number of boards by considering whole classes of boards at once. Here’s an example of a class of 3x3 boards:</p> <style> td, th { border: 1px solid #777; padding: 0.25em; } table { margin-bottom: 10px; } </style> <table> <tbody> <tr> <td>l, n, r, s, y</td> <td>c, h, k, m, p, t</td> <td>l, n, r, s, y</td> </tr> <tr> <td>a, e, i, o, u</td> <td>a, e, i, o, u</td> <td>a, e, i, o, u</td> </tr> <tr> <td>c, h, k, m, p, t</td> <td>l, n, r, s, y</td> <td>b, d, f, g, j, v, w, x, z</td> </tr> </tbody> </table> <!-- lnrsy aeiou chkmpt chkmpt aeiou lnrsy lnrsy aeiou bdfgjvwxz --> <!-- XXX missing the letter q! --> <p>I’ve divided the alphabet up into four different “buckets.” Instead of having a single letter on each cell, this board has 5-9 possible letters from one of those buckets on each cell. There are 5,062,500 individual boards in this class. The highest-scoring 3x3 board (<code class="language-plaintext highlighter-rouge">rts/eae/pld</code>) is one of them, but there are many others.</p> <p>There are vastly fewer board <em>classes</em> than individual boards. For 3x3 Boggle, using four “buckets” of letters takes us from 26^9/8 = 6x10^11 boards → 4^9/8 = 32,768 board classes. If we can find the highest-scoring board in each class in a reasonable amount of time, this will make the problem of finding the global optimum tractable.</p> <h3 id="bound-upper-bounds-on-board-classes">Bound: Upper bounds on board classes</h3> <p>The second insight is that, rather than finding the highest-scoring board in a class, all we really need to do is establish an <a href="https://en.wikipedia.org/wiki/Upper_and_lower_bounds">upper bound</a> on its score. An upper bound is a concept from mathematics: if the highest-scoring board in a class has 500 points, then 500 is an upper bound on the score. So is 600. The upper bound doesn’t need to be achieved by any particular board, it just needs to be greater or equal to the score of every board.</p> <p>If the upper bound is less than 545 (the score of the best individual board we found through simulated annealing), then we know there’s no specific board in this class that beats our best board, and we can toss it out without having to score every single board in the class.</p> <p>As it turns out, establishing an upper bound is much, much easier than finding the best board in a class. I came up with <a href="https://www.danvk.org/wp/2009-08-08/breaking-3x3-boggle/index.html">two upper bounds</a> back in 2009:</p> <ul> <li><strong><a href="https://www.danvk.org/wp/2009-08-11/some-maxno-mark-examples/index.html">sum/union</a></strong>: the sum of the points for every word that can be found on any board in the class.</li> <li><strong><a href="https://www.danvk.org/wp/2009-08-11/a-few-more-boggle-examples/index.html">max/no-mark</a></strong>: a bound that takes into account that you have to choose one letter for each cell.</li> </ul> <p>You can read more about how these work in the linked blog posts from 2009. The max/no-mark bound is typically much lower than the sum/union bound, but not always. Usually neither of the upper bounds is low enough. On this board class, for example, the sum/union bound is 106,383 and the max/nomark bound is 9,359. Those are both much, much larger than 545!</p> <h3 id="branch-repeatedly-split-board-classes">Branch: Repeatedly split board classes</h3> <p>This brings us to the final insight: if the upper bound is too high, you can split up one of the cells to make several smaller classes. For example, if you split up the middle cell in the board class from above, then this single class becomes five classes, one for each choice of vowel in the middle:</p> <style> #five-boards > tbody > tr > td { border: none; } </style> <table id="five-boards"> <tr> <td> <table> <tr><td>lnrsy</td><td>chkmpt</td><td>lnrsy</td></tr> <tr><td>aeiou</td><td>a</td><td>aeiou</td></tr> <tr><td>chkmpt</td><td>lnrsy</td><td>bdfgjvwxz</td></tr> </table> </td> <td> <table> <tr><td>lnrsy</td><td>chkmpt</td><td>lnrsy</td></tr> <tr><td>aeiou</td><td>e</td><td>aeiou</td></tr> <tr><td>chkmpt</td><td>lnrsy</td><td>bdfgjvwxz</td></tr> </table> </td> <td>...</td> <td> <table> <tr><td>lnrsy</td><td>chkmpt</td><td>lnrsy</td></tr> <tr><td>aeiou</td><td>u</td><td>aeiou</td></tr> <tr><td>chkmpt</td><td>lnrsy</td><td>bdfgjvwxz</td></tr> </table> </td> </tr> </table> <p>These are the bounds you get on those five board classes:</p> <table> <thead> <tr> <th><strong>Middle Letter</strong></th> <th><strong>Max/No Mark</strong></th> <th><strong>Sum/Union</strong></th> </tr> </thead> <tbody> <tr> <td>A</td> <td>6,034</td> <td>55,146</td> </tr> <tr> <td>E</td> <td>6,979</td> <td>69,536</td> </tr> <tr> <td>I</td> <td>6,155</td> <td>58,139</td> </tr> <tr> <td>O</td> <td>5,487</td> <td>48,315</td> </tr> <tr> <td>U</td> <td>4,424</td> <td>37,371</td> </tr> </tbody> </table> <p>Those numbers are all still too high (we want them to get below 545), but they have come down considerably. Choosing “U” for the middle cell brings the bound down the most, while choosing “E” brings it down the least.</p> <h3 id="branch-and-bound">Branch and Bound</h3> <p>If we keep splitting up cells, we’ll keep getting more board classes with lower bounds. I didn’t know it in 2009, but this is <a href="https://en.wikipedia.org/wiki/Branch_and_bound">branch and bound</a>, a ubiquitous approch for solving optimization problems:</p> <ul> <li><strong>Branch</strong>: Split up a cell to get smaller board classes (subproblems).</li> <li><strong>Bound:</strong> Calculate an upper bound on the board class.</li> </ul> <p>If you iteratively break cells, your bounds will keep going down. If they drop below 545, you can stop. These recursive breaks form a sort of tree. The branches of the tree with lower scores (like the “U”) will require fewer subsequent breaks and will be shallower than the higher-scoring branches (the “E”). The sum/union bound converges on the true score, so if you break all 9 cells and still have more than 545 points, you’ve found a global optimum.</p> <p>Back in 2009, I reported that I checked all the 3x3 boards this way in around 6 hours. The board you find via simulated annealing is, in fact, the global optimum. In 2025, I’m able to run the same code on a single core on my laptop in around 40 minutes.</p> <p>This is something like a 400x speedup vs. scoring all the 3x3 boards individually.</p> <h2 id="how-much-harder-are-3x4-and-4x4-boggle">How much harder are 3x4 and 4x4 Boggle?</h2> <p>As you increase the size of the board, the maximization problem gets harder for two reasons:</p> <ol> <li>There are exponentially more boards (and board classes) to consider.</li> <li>Each board (and board class) has more words and more points on it.</li> </ol> <p>How bad is this?</p> <ul> <li>3x3: each class takes ~80ms to break and there a ~33,000 of them ⇒ ~40 minutes.</li> <li>3x4: each class takes ~1.6s to break and there are ~6.7M of them ⇒ ~78 days.</li> <li>4x4: each class takes ~10m to break and there are ~537M of them ⇒ ~10,000 years.</li> </ul> <p>So with the current algorithms, 3x4 Boggle is ~3,000x harder than 3x3 Boggle and 4x4 Boggle is around 50,000 times harder than that.</p> <p>That’s enough for today. In the <a href="https://www.danvk.org/2025/02/13/boggle2025.html">next post</a>, I’ll present a few optimizations on the 2009 approach that net us another ~10x speedup. Enough that it’s reasonable to solve 3x4 Boggle on a single beefy cloud machine, but not enough to bring 4x4 Boggle within reach.</p> Another Decade, Another Webdiff 2024-06-21T00:00:00+00:00 https://danvk.github.io//2024/06/21/webdiff <p>Over the past few weeks, I found the time to work on <a href="https://github.com/danvk/webdiff">webdiff</a>, an open source project of mine that I built <a href="https://www.danvk.org/wp/2014-07-03/introducing-git-webdiff/index.html">over a decade ago</a>. I still use it all the time, but hadn’t actively worked on it since 2015. Revisiting an old project is always an interesting experience, and this post presents my reflections on it.</p> <p>First off, what is webdiff? It’s a diff tool. Rather than running <code class="language-plaintext highlighter-rouge">git diff</code>, you run <code class="language-plaintext highlighter-rouge">git webdiff</code> and you get a two-column diff UI with syntax highlighting in your browser:</p> <p><img src="/images/webdiff.png" alt="Webdiff showing a two column diff with syntax highlighting" /></p> <p>Because it’s running in your browser, you get lots of nice things for free: web fonts, zoom, search. You can also look at diffs between images:</p> <p><img src="/images/webdiff-images.png" alt="Webdiff showing an image diff" /></p> <p>Installation is as simple as</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>pip install webdiff </code></pre></div></div> <h2 id="background">Background</h2> <p>The project came out of my experience of <a href="https://www.danvk.org/2014/10/31/life-after-google-six-months-in.html">leaving Google</a> for the first time back in 2014. I was very accustomed to how software was built inside Google, and there were some tools that I missed. In particular, Google’s code review tools (Mondrian and later Critique) were light years ahead of GitHub’s in 2014. GitHub’s PR Review was quite barebones back then: no two-column diffs, no syntax highlighting. webdiff was my attempt both to improve this, and to learn how to build and publish a tool outside the Google ecosystem.</p> <p>I spent a good chunk of the summer of 2014 building webdiff, and I was happy with how it turned out. I released it that July and <a href="https://www.danvk.org/2014/11/07/github-integration-and-image-diff-improvements-headline-webdiff-0-8.html">continued to work on it</a>, even giving a <a href="https://www.danvk.org/2015/04/12/pycon-2015-make-web-development-awesome-with-visual-diffing-tools.html">PyCon talk</a> about it in 2015.</p> <p>While I haven’t worked much on webdiff since 2015, I’ve continued to be an active user. My years at Google trained me to look at two-column diffs with syntax highlighting in a browser, and I still much prefer this to looking at diffs in a terminal. GitHub’s Pull Request UI has <a href="https://github.blog/2016-03-15-more-code-review-tools/">improved significantly</a> over the years, but I like that I can run <code class="language-plaintext highlighter-rouge">git webdiff</code> locally (or on a plane) without having to push anything to GitHub. I also prefer the one-file-at-a-time UI, which matches Mondrian/Critique. <a href="https://vscode.one/diff-vscode/">VSCode’s diff viewer</a> is another interesting option these days, though I feel some mismatch between diff viewing as an ephemeral process and editing as a more persistent one.</p> <h2 id="realizations">Realizations</h2> <p>Using webdiff over the years, I had a few realizations. One was that I hadn’t understood git very well when I built it in 2014. Most of my experience had been with <code class="language-plaintext highlighter-rouge">git5</code>, a git wrapper around Perforce in use at Google at the time. In retrospect, this was an incredibly confusing way to learn git! My understanding of git improved while I <a href="https://www.danvk.org/2015/10/21/hammerlab-posts.html">worked at Hammerlab</a> in 2014–2015, then took another big step forward in 2016 when I watched the fantastic <a href="https://www.youtube.com/watch?v=MYP56QJpDr4">git from the bits up</a> talk.</p> <p>The other big realization was that I’d architected <code class="language-plaintext highlighter-rouge">webdiff</code> in the wrong way. webdiff takes two directories (before and after, or left and right) and tries to match up the files in them. This is usually straightforward, but there are some tricky edge cases, like a <a href="https://github.com/danvk/webdiff/issues/7">rename+change</a>. The original webdiff matched files up on its own, then calculated diffs for each file. The realization was that <code class="language-plaintext highlighter-rouge">git</code> <a href="https://github.com/danvk/webdiff/issues/129">is already really good at this</a>, and that I should rely on it to do all the diff calculations for me. webdiff should <em>display</em> diffs, never calculate them.</p> <h2 id="a-wall">A wall</h2> <p>This idea kicked around in the back of my head for a few years, until I had some time to work on it in the fall of 2022. I learned about <code class="language-plaintext highlighter-rouge">git diff --no-index</code>, which lets you use <code class="language-plaintext highlighter-rouge">git-diff</code> to diff two files or directories outside a git repo. And I learned about <code class="language-plaintext highlighter-rouge">git diff --raw</code>, which diffs two directories and matches files between them to produce adds, deletes, renames and changes. This all seemed promising! It even let me play around with flags like <code class="language-plaintext highlighter-rouge">git diff -w</code> which tells <code class="language-plaintext highlighter-rouge">git diff</code> to ignore whitespace changes.</p> <p>Then I ran into a wall: if you run <code class="language-plaintext highlighter-rouge">git difftool</code> from <code class="language-plaintext highlighter-rouge">HEAD</code>, one of the directories it produces will be filled with symlinks to files in your repo. This makes sense: it’s faster to create symlinks to the files than copies. And for webdiff, it meant that you could edit a file, reload the browser window, and see the new diff.</p> <p>Unfortunately for webdiff, <code class="language-plaintext highlighter-rouge">git diff --no-index</code> <a href="https://public-inbox.org/git/1489877673.24742.1.camel@kaarsemaker.net/t/">does not resolve symlinks</a>. This meant that, in order to produce a diff, I had to run <code class="language-plaintext highlighter-rouge">git difftool --no-symlinks</code>. This was slower, and it broke an important workflow: reloading the diff after editing a file no longer reflected your changes. This was frustrating, and enough to put me off the project.</p> <h2 id="a-breakthrough">A breakthrough</h2> <p>Fast-forward almost two years and I decided to pick up the project again. What had seemed like a fundamental issue in 2022 now just seemed like a nuisance. Before passing the directories to <code class="language-plaintext highlighter-rouge">git diff --no-index</code>, I could make a version of the directory that resolved the symlinks. This would let <code class="language-plaintext highlighter-rouge">git</code> pair up the files for me. Then I could resolve symlinks before running <code class="language-plaintext highlighter-rouge">git diff --no-index</code> to generate diffs for individual files. Elegant, no, but it let me get through the impasse.</p> <p>Once that was resolved, I was able to cut the first new release of webdiff in years. But once I was in there, I didn’t want to stop. When you look at ten year old code, it’s hard to resist the urge to modernize it. I’ve done quite a bit of that over the past few weeks.</p> <p>One advantage of stepping away from a project for so long is that you get to skip several generations of tooling. In this case I got to skip straight from Python’s vintage setuptools to <a href="https://python-poetry.org/">poetry</a> for managing dependencies and releases. The Python packaging situation has improved significantly over the past decade. I like poetry, and I like <code class="language-plaintext highlighter-rouge">pyproject.toml</code> over <code class="language-plaintext highlighter-rouge">setup.py</code>.</p> <p><a href="https://github.com/danvk/webdiff/pull/190">Migrating the diff UI</a> from jQuery to React was a real throwback to 2015. It was also a nice reminder of the beauty of React. There was considerable duplication between the code for building the initial diff UI and for filling in additional rows when you clicked a “Show 12 rows” link. Adding “show 10 more” links would have made it even worse.</p> <p>When I ported the code over to React, <a href="https://github.com/danvk/webdiff/pull/190">the duplication went away</a>. It was easy to show the additional skipped rows <em>in the data model</em> and trust React to render them appropriately.</p> <p>I’ve added quite a few new features and even started to play around with next-generation tooling like the <a href="https://github.com/danvk/webdiff/pull/209">React Compiler</a>. But after a few weeks, I can tell that I’ve hit the point of diminishing returns. I’d like to go back to being a user again.</p> <h2 id="whats-next">What’s next</h2> <p>What would I work on next for webdiff? There’s a <a href="https://github.com/danvk/webdiff/issues/66">long-standing, annoying bug</a> where the terminal process doesn’t quit when you close the diff in your browser. I’d like to try to fix that. And now that the diff UI is fully React-ified, <a href="https://github.com/danvk/webdiff/issues/182">generating it lazily</a> could make it easier to render diffs for large files like lockfiles (or <code class="language-plaintext highlighter-rouge">checker.ts</code>). I’d also be excited about a special mode for <a href="https://github.com/danvk/webdiff/issues/211">diffing minified JSON</a>.</p> <p>My biggest dream for a code review tool would be to have language services available while reviewing code. Google didn’t have this when I worked there, but it might have it now. I don’t think this is a feature webdiff will ever have, but if VS Code gets it right, it might be enough to make me switch.</p> AlphaGo vs. Lee Sedol 2016-03-15T00:00:00+00:00 https://danvk.github.io//2016/03/15/alphago-vs-lee-sedol <p>I was dimly aware of the <a href="https://en.wikipedia.org/wiki/AlphaGo_versus_Lee_Sedol">ongoing competition between AlphaGo and Lee Sedol</a>, but I hadn’t paid much attention until I saw this chart <a href="https://www.reddit.com/r/dataisbeautiful/comments/4a8336/lee_sedol_vs_alphago_4th_game_thinking_time_in/">on reddit</a>:</p> <p><img src="http://i.imgur.com/3HcJKbk.png" border="0" width="700" alt="time per move during game 4" /></p> <p>It’s hard to read “Lee Sedol’s brilliant attack (78th)” and not get curious! This led me into a deep dive on the competition. You can read more about <a href="https://gogameguru.com/lee-sedol-defeats-alphago-masterful-comeback-game-4/">the move</a> or watch a <a href="https://www.youtube.com/watch?v=G5gJ-pVo1gs">15 minute summary</a> of the match. The full <a href="https://www.youtube.com/watch?v=yCALyQRN3hw&amp;feature=youtu.be&amp;t=3h10m20s">6 hour match</a>, including a press conference afterwards, is also online. You can learn a lot about Go from listening to the YouTube commentators. One of them is <a href="https://en.wikipedia.org/wiki/Michael_Redmond_(Go_player)">Michael Redmond</a>, the all-time top-rated American Go player. Even if you don’t understand how to play Go (I barely do), it’s fun to watch <a href="https://www.youtube.com/watch?v=SMqjGNqfU6I&amp;feature=youtu.be&amp;t=1h25m32s">the experts react</a> to this move: “Oh, this [is] very creative.”</p> <p><img src="/images/alphago.png" width="231" height="190" style="float: right; margin-left: 1em;" /> <a href="https://en.wikipedia.org/wiki/Lee_Sedol">Lee Sedol</a>’s win in the fourth match is being celebrated as a victory for humankind. But it’s surprising that we’re here at all. AlphaGo, Google DeepMind’s computer Go program, had already won the first three games and hence the best-of-five competition. This is a coming of age moment for the neural net community. Over the past ten years, thanks to the emergence of <a href="https://www.cs.toronto.edu/~kriz/cifar.html">large data sets</a> and <a href="http://www.nvidia.com/object/what-is-gpu-computing.html">GPUs</a>, the whole field has experienced a renaissance. But most of the great results have been on toy problems like image classification or traditional signal processing problems like speech recognition. Beating an elite Go player for the first time is a marquee result that transcends the field. As long as I’ve worked in software, Go has been the one game at which computers couldn’t compete. Everyone thought that this result was years away.</p> <p>Last fall, AlphaGo competed against <a href="https://en.wikipedia.org/wiki/Fan_Huio">Fan Hui</a>, a lower-ranked Go professional. It beat him 5-0. This was the first time that a computer Go program had defeated a professional. What happened next is suggestive. <a href="http://www.wired.com/2016/03/sadness-beauty-watching-googles-ai-play-go/">According to Wired</a>, he began consulting with the DeepMind team:</p> <blockquote> <p>As he played match after match with AlphaGo over the past five months, he watched the machine improve. But he also watched himself improve. The experience has, quite literally, changed the way he views the game. When he first played the Google machine, he was ranked 633rd in the world. Now, he is up into the 300s. In the months since October, AlphaGo has taught him, a human, to be a better player. He sees things he didn’t see before. And that makes him happy. “So beautiful,” he says. “So beautiful.”</p> </blockquote> <p>We learn most quickly from our betters—people who can review your work, say what you did well and point out what would have been better. But if you’re the best Go player in the world, who do you learn from? I wouldn’t be surprised if this result prompts humans to discover new styles of play. Computers may get the best of us at Go in the long run, but we’ll get better at it in the process.</p> <p>The fifth and final game happens tonight. If Lee Sedol wins, you could make the case that he just took a few games to figure out how to get the best of AlphaGo. If not, it means that it took completely brilliant play from the best in the world to beat a computer, and it’s likely to be the last time a human ever pulls this off. The match is being <a href="https://www.youtube.com/watch?v=mzpW10DPHeQ">streamed live on YouTube</a>.</p> Extending the Grid to Add 1,000 Photos to OldNYC 2016-01-19T00:00:00+00:00 https://danvk.github.io//2016/01/19/oldnyc-update <p>I recently added around 1,000 new photos to the map on OldNYC. Read on to find out how!</p> <div class="center"> <img src="/images/stuytown-update.gif" width="300" height="275" alt="OldNYC update around Stuytown" /> </div> <p>At its core, OldNYC is based on <a href="https://en.wikipedia.org/wiki/Geocoding">geocoding</a>: the process of going from textual addresses like “9th Street and Avenue A” to numeric latitudes and longitudes. There’s a bit of a mismatch here. The NYPL photos have 1930s addresses and cross-streets, but geocoders are built to work with contemporary addresses. OldNYC makes an assumption that contemporary geocoders will produce accurate results for these old addresses. For NYC, this is usually a good assumption! The street grid hasn’t changed too much in the past 150 years. But it is an assumption, and it doesn’t always pan out.</p> <p>Two of the most noticeable problem spots are Stuytown and Park Avenue South:</p> <div class="center"> <img src="/images/stuytown-pas-missing.png" width="600" height="550" alt="Stuytown and Park Avenue South have no images" /> </div> <p>The lettered Avenues (A, B, C, D) used to continue above 14th street. This was the <a href="https://en.wikipedia.org/wiki/Gashouse_District">Gas House district</a>. But in the 1940s, this area was destroyed to make way for the super-blocks of <a href="https://en.wikipedia.org/wiki/Stuyvesant_Town%E2%80%93Peter_Cooper_Village">Stuyvesant Town</a>. Intersections like “15th and A” do no exist in the contemporary Manhattan grid and geocoders can’t make sense of them. But there <em>are</em> <a href="http://www.oldnyc.org/#711319f-a">photos there</a>!</p> <p>The problem for Park Avenue South is different. <a href="https://en.wikipedia.org/wiki/Park_Avenue#History">Until 1959</a>, it was known as 4th Avenue. So photographs from the 1930s are recorded as being at, for example, “4th Avenue and 17th street”, an interesection which no longer exists. Again, contemporary geocoders can’t make sense of this.</p> <p>The frustrating thing here is that it’s perfectly obvious where all of these interesctions <em>should</em> be. Manhattan has a <a href="https://en.wikipedia.org/wiki/Commissioners%27_Plan_of_1811">regular street grid</a>, after all. So I set out to build my own Manhattan street grid geocoder.</p> <p>To begin with, I gathered lat/lons for <a href="https://docs.google.com/spreadsheets/d/1Kv0xXYHXUrlFmyIF92NsWTMuuRWAl-heBEI_OejcBA0/edit#gid=0">every intersection</a> that I could. With <a href="https://github.com/danvk/oldnyc/blob/9aa15e0a4ce46d746531c7f9531ff4a50e17a53f/grid/gold.py#L21-L41">some simple logic</a>, this handled the Avenue renaming issue.</p> <p>My initial idea to geocode unknown interesections was to interpolate on the avenues. For example, to find where the intersection of 18th Street and Avenue A should be, you can assume that the intersections of numbered streets and Avenue A are evenly spaced and then find where the 18th street intersection would fall:</p> <div class="center"> <img src="/images/extrapolation.png" width="552" height="428" alt="Extrapolation of intersections" /> </div> <p>Mathematically, you fit linear regressions from cross-street to latitude and longitude. This feels like it should work but, because the streets aren’t all perfectly spaced, it winds up producing results that don’t quite look right.</p> <p>While I was playing around with this approach, I realized that I was checking the results using a different technique: continuing the straight lines of the streets until they intersected:</p> <div class="center"> <img src="/images/continued-lines.png" width="403" height="329" alt="Extrapolation of streets" /> </div> <p>Mathematically, this means that you fit a linear regression to the latitude→longitude mapping for each Street and Avenue. To find an intersection, you find the point where these lines intersect. This works so long as the Streets and Avenues are straight. Fortunately, with a few exceptions like Avenue C and the West Village, they are (r<sup>2</sup>&gt;0.99).</p> <p>This approach produced very good results. The oddities which remained were as likely to be problems with the data as with the geocoder (<a href="http://digitalcollections.nypl.org/items/510d47dd-07a6-a3d9-e040-e00a18064a99">one image</a> was non-sensically labeled as “25th &amp; D”, which extrapolates to somewhere in the East River).</p> <p>While Stuytown and Park Avenue South were clear winners, new photos appeared all over the map:</p> <div class="center"> <img src="/images/oldnyc-update.gif" width="400" height="400" alt="Update for lower Manhattan" /> </div> <p>It even helped uptown:</p> <div class="center"> <img src="/images/oldnyc-uptown-update.gif" width="640" height="400" alt="Update for Uptown" /> </div> <p>All told, there are about 1,000 new images on the map. Go check them out and! And please help transcribe the text on the back of them. My <a href="http://www.danvk.org/2015/01/09/extracting-text-from-an-image-using-ocropus.html">OCR system</a> didn’t run on the new images, so they’re sorely lacking descriptions.</p> <p>Here are a few favorites:</p> <div class="center"> <a href="https://www.oldnyc.org/#716426f-a"><img src="/images/716426f-a.jpg" width="600" height="421" /></a> <br /> <i>Looking down Avenue B from 15th to 17th Street. This Avenue no longer exists.</i> </div> <div class="center"> <a href="https://www.oldnyc.org/#708194f-a"><img src="/images/708194f-a.jpg" width="482" height="348" /></a> <br /> <i>The Everett House hotel at Union Square in 1906. This building no longer exists, but there's a new hotel (the W) at the same location.</i> </div> My takeaways from NIPS 2015 2015-12-12T00:00:00+00:00 https://danvk.github.io//2015/12/12/nips-2015 <p>I’ve just wrapped up my trip to <a href="https://nips.cc/Conferences/2015">NIPS 2015</a> in Montreal and thought I’d jot down a few things that struck me this year:</p> <ul> <li> <p><strong>Saddle Points vs Local Minima</strong></p> <p>I heard this point repeated in a talk almost every day. In low-dimensional spaces (i.e. the ones we can visualize) local minima are the major impediment to optimizers reaching the global minimum. But this doesn’t generalize. In high-dimensional spaces, local minima are almost non-existent. Instead, there are saddle points: points which are a minimum in some directions but a maximum in others. Intuitively, this makes sense: in N dimensions, the odds of the curvatures all going the same way at a point is (1/2)^N. As Yoshua Bengio said, “it’s hard to build an n-dimensional wall.” This gives an intuition for why procedures like <a href="https://en.wikipedia.org/wiki/Gradient_descent">gradient descent</a> are effective at optimizing the thousands of weights in a neural net: they won’t get stuck in a local optimum. And it gives an intuition for why <a href="https://en.wikipedia.org/wiki/Stochastic_gradient_descent#Momentum">momentum</a> is helpful: it helps gradient descent escape from saddle points.</p> </li> <li> <p><strong>Model Compression</strong></p> <p>The tutorial on <a href="http://nips2015.sched.org/event/4FAy/high-performance-hardware-for-machine-learning"><em>Hardware for Deep Learning</em></a> was less about new hardware and more about how to make your software get the most out of existing hardware. Due to the high cost of uncached, off-chip memory reads, reducing the memory footprint of your models can be a huge performance win. Bill Dally presented a result on model pruning that I found interesting: by iteratively removing small weights from a model and retraining, they were able to remove 90+% of the weights with zero loss of precision. This parallels an observation from <a href="http://www.ttic.edu/dl/dark14.pdf">transfer learning</a>, that small networks are most effectively trained using the output of larger networks. It would be nice if we could train these smaller networks directly. See the <a href="http://arxiv.org/abs/1510.00149">Deep Compression</a> paper.</p> </li> <li> <p><strong>The importance of canonical data sets / problems</strong></p> <p>Over and over, talks and posters referenced the same canonical data sets: the <a href="http://yann.lecun.com/exdb/mnist/">MNIST</a> set of handwritten digits, the <a href="https://www.cs.toronto.edu/~kriz/cifar.html">CIFAR</a> and <a href="http://www.image-net.org/">ImageNet</a> images, the <a href="https://catalog.ldc.upenn.edu/LDC93S1">TIMIT</a> speech corpus and the <a href="http://www.arcadelearningenvironment.org/">Atari/Arcade Learning Environment</a> (ALE). These have given researchers in their fields a shared problem on which to experiment, compete, collaborate and measure their progress. If you want to push a field forward, built a good challenge problem.</p> </li> <li> <p><strong>One-shot Learning</strong></p> <p>There was much high-level talk about how the human brain is very good at learning to perform new tasks quickly. Contrast this with neural nets, which require thousands or millions of training examples to reach human performance. This comparison is somewhat unfair because adult humans have years of experience interacting with the real world from which to draw on. There seems to be a great deal of interest in getting machines to do a better job of transferring general knowledge to specific tasks.</p> </li> <li> <p><strong>This conference has gotten huge!</strong></p> <p>I’ve read that registrations have gone up significantly over the last few years. This was palpable at the conference. Many of the workshop rooms were packed to the gills and I watched most of the larger talks in overflow rooms. This is probably good thing for the ML community. I’m not sure if it’s a good thing for NIPS.</p> </li> </ul> <p>A few smaller bits that struck me:</p> <ul> <li> <p><a href="http://arxiv.org/abs/1505.00387">Highway Networks</a> are cool. They let you train very deep networks, where the depth has to be learned. They found that depth &gt; 20 was not helpful for MNIST, but was for CIFAR.</p> </li> <li> <p><a href="http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf">AlexNet</a> seems to have become a canonical neural net for experimentation. I saw it referenced repeatedly, e.g. on the <a href="http://arxiv.org/abs/1510.00149">Deep Compression</a> poster.</p> </li> <li> <p>As an example of the above, I really enjoyed the <a href="http://arxiv.org/abs/1407.5104"><em>Pixels to Voxels</em></a> talk. Pulpit Agarwal &amp; co showed that the activations of individual layers of AlexNet correlate to activations in regions of the brain. They were able to use this correspondence to learn what some of the <a href="https://en.wikipedia.org/wiki/Visual_cortex#V4">mid-level regions</a> of the visual cortex are doing.</p> </li> <li> <p>We heard several speakers say that “backprop is not biologically plausible.” I assume this is because we don’t consume nearly enough labels for it to be practical at the scale of a human brain?</p> </li> <li> <p>I asked someone at the NVIDIA booth whether the ML industry is large enough to drive GPU design &amp; sales (as opposed to the game industry). It is. The GPUs designed for ML tend to be more robust than those used in games. They have error correcting codes built-in. In a game, if an arithmetic unit makes a mistake, it’ll be fixed on the next frame. When you’re training a neural net, that mistake can propagate.</p> </li> </ul> <p>I do <a href="http://www.danvk.org/2015/01/11/training-an-ocropus-ocr-model.html">relatively little</a> machine learning in my day-to-day. NIPS is always a bit over my head, but it’s a good way to rekindle my interest in the field.</p> Dan writes on HammerLab 2015-10-21T18:03:00+00:00 https://danvk.github.io//2015/10/21/hammerlab-posts <p>I haven’t written a substantial blog post on danvk.org <a href="http://www.danvk.org/2015/01/11/training-an-ocropus-ocr-model.html">since January</a>. Instead, I’ve been writing over on my groups’s blog at <a href="http://www.hammerlab.org/">hammerlab.org</a>.</p> <p>Here are the posts I’ve written or edited:</p> <ul> <li> <p><a href="http://www.hammerlab.org/2015/10/13/svg-canvas-the-pileup-js-journey/">SVG→Canvas, the pileup.js Journey</a> (13 Oct 2015)</p> <p>In which I explain why we changed from using SVG to using canvas for our genome browser (spoiler: it’s performance). This also introduces <a href="https://github.com/hammerlab/data-canvas">data-canvas</a>, which compensates for some of the drawbacks of canvas, e.g. difficulty in tracking clicks and writing tests.</p> </li> <li> <p><a href="http://www.hammerlab.org/2015/07/09/bundling-and-distributing-complex-es6-libraries-in-an-es5-world/">Bundling and Distributing Complex ES6 Libraries in an ES5 World</a> (09 Jul 2015)</p> <p>I helped edit this post, which was written by <a href="http://arman.aksoy.org/">Arman Aksoy</a>. It outlines our approach to writing ES6 JavaScript while distributing/bundling it for ES5 clients.</p> </li> <li> <p><a href="http://www.hammerlab.org/2015/06/19/introducing-pileup-js-a-browser-based-genome-viewer/">Introducing pileup.js, a Browser-based Genome Viewer</a> (19 Jun 2015)</p> <p>This post introduces pileup.js, the in-browser genome visualizer I’ve been working on for most of the past year.</p> </li> <li><a href="http://www.hammerlab.org/2015/02/14/testing-react-web-apps-with-mocha/">Testing React Web Apps with Mocha</a> (14 Feb 2015)</li> <li> <p><a href="http://www.hammerlab.org/2015/02/21/testing-react-web-apps-with-mocha-part-2/">Testing React Web Apps with Mocha (Part 2)</a> (21 Feb 2015)</p> <p>A two-parter in which I explain how we set up testing for our React web app using Mocha, rather than Jest. Since writing this post, I’ve completely changed my mind about how web testing should be done. If your code is intended to be run in the browser, then you should test it in the browser, rather than using Node as these posts suggest.</p> </li> <li> <p><a href="http://www.hammerlab.org/2015/01/23/faster-pileup-loading-with-bai-indices/">Faster Pileup Loading with BAI Indices</a> (23 Jan 2015)</p> <p>BAM is a very widely used bioinformatics file format which stores aligned reads from a genome. These files can be huge, so the BAM Index (BAI) format was created to speed up retrieval. We found that the BAI was <em>also</em> too large for convenient access on the web, so we created an index of the index.</p> </li> <li> <p><a href="http://www.hammerlab.org/2014/12/05/igv-httpfs/">Streaming from HDFS with igv-httpfs</a> (05 Dec 2014)</p> <p>How we got data from HDFS into our genome viewer of choice by building a small piece of infrastructure. We’ve since stopped using this. Nowadays we have an NFS mount for our HDFS file system and serve from that using nginx.</p> </li> </ul> Launched: OldNYC 2015-06-04T02:24:00+00:00 https://danvk.github.io//2015/06/04/launched-oldnyc <p><a href="http://www.oldnyc.org/"><img src="/images/oldnyc-big.jpg" width="300" height="236" style="float:right; margin-left: 20px;" /></a> Two weeks ago I <a href="https://twitter.com/danvdk/status/601381598525796352">launched</a> my latest side project, <a href="http://www.oldnyc.org/">OldNYC</a>. It’s a collaboration with NYPL Labs which places around 40,000 historical photos of New York City on a map. Avid readers of this blog know that I’ve been <a href="http://www.danvk.org/2015/01/09/extracting-text-from-an-image-using-ocropus.html">working on</a> this <a href="http://www.danvk.org/wp/2013-02-09/finding-pictures-in-pictures/">for years</a>.</p> <p>The response to OldNYC has been completely overwhelming. Hundreds of thousands of people have used the site. Millions of images have been viewed. Users have left nearly a thousand comments and fixed thousands of typos in the <a href="http://www.danvk.org/2015/01/09/extracting-text-from-an-image-using-ocropus.html">OCR’d descriptions</a>.</p> <p>Rather than say more about the project myself, I’ll let you pick your write-up of choice. There have been many!</p> <ul> <li>New York Times: <a href="http://cityroom.blogs.nytimes.com/2015/05/26/new-york-today-new-views-of-the-past/?_r=0">New York Today: New Views of the Past</a></li> <li>Gothamist: <a href="http://gothamist.com/2015/05/26/old_nyc_photo_map.php">This Photo Map Will Bring You Back To Old NYC, Block By Block</a></li> <li>The Guardian: <a href="http://www.theguardian.com/us-news/2015/may/21/new-york-historical-photo-archive-lets-residents-track-changes-to-a-city-in-sepia">New York vintage photo archive lets you track a century of change to your block</a></li> <li>Citylab: <a href="http://www.citylab.com/design/2015/05/pictures-map-old-new-york-historical/393944/">Mapping the New York That Once Was</a></li> <li>Pix11 News (TV): <a href="http://pix11.com/2015/05/26/website-gives-a-virtual-tour-of-old-nyc-through-photo-maps/">Website gives a virtual tour of old NYC through photo maps</a></li> <li>Library Journal: <a href="http://lj.libraryjournal.com/2015/06/digital-content/developer-maps-library-photo-archives/">Developer Maps Library Photo Archives</a></li> <li>Kottke: <a href="http://kottke.org/15/05/mapping-photos-of-old-nyc">Mapping photos of old NYC</a></li> <li>Gizmodo: <a href="http://gizmodo.com/here-are-40-000-photos-of-old-new-york-plotted-on-a-cit-1706342527">Here Are 40,000 Photos Of Old New York Plotted on a City Map</a></li> <li>DNAInfo: <a href="http://www.dnainfo.com/new-york/20150522/midtown/see-old-photos-of-your-neighborhood-with-this-interactive-map">See Old Photos of Your Neighborhood With This Interactive Map</a></li> <li>West Side Rag: <a href="http://www.westsiderag.com/2015/05/21/throwback-thursday-new-tool-lets-you-see-photos-of-your-street-through-the-years">Throwback Thursday: New Tool Lets You See Photos of Your Street Through the Years</a></li> <li>Daily Mail: <a href="http://www.dailymail.co.uk/sciencetech/article-3093512/Step-time-OldNY-Interactive-map-allows-users-explore-New-York-40-000-old-photos.html">The Big Apple back in time: Forty thousand historic photographs on an interactive map offer slice of life from old New York</a></li> <li>Business Insider: <a href="http://www.businessinsider.com/nypl-old-photos-interactive-map-2015-5">This incredible map lets New Yorkers see vintage photos of their street corners</a></li> <li>EV Grieve: <a href="http://evgrieve.com/2015/05/immerse-yourself-in-archival-photos-of.html">Immerse yourself in archival photos of NYC</a></li> <li>The Awl: <a href="http://www.theawl.com/2015/05/old-new-york-mapped">Old New York, Mapped</a></li> <li>The Week: <a href="http://theweek.com/speedreads/556593/check-amazing-archive-historic-new-york-pictures">Check out this amazing archive of historic New York pictures</a></li> <li>Fast Company: <a href="http://www.fastcodesign.com/3046606/see-what-times-square-and-wall-street-looked-like-100-years-ago">See What Times Square And Wall Street Looked Like 100 Years Ago</a></li> <li>Free Williamsburg: <a href="http://freewilliamsburg.com/see-photos-of-old-williamsburg-thanks-to-the-ny-public-library/">See photos of old Williamsburg thanks to the NY Public Library</a></li> <li>PetaPixel: <a href="http://petapixel.com/2015/05/25/oldsf-and-oldnyc-historical-photos-plotted-on-maps/">OldSF and OldNYC: Historical Photos Plotted on Maps</a></li> <li>Brooklyn Magazine: <a href="http://www.bkmag.com/2015/05/21/here-are-thousands-of-historical-photos-of-new-york-city-all-on-one-interactive-map/">Here are Thousands of Historical Photos of New York City, All on One Interactive Map</a></li> <li>Mental Floss: <a href="http://mentalfloss.com/article/64253/tour-old-new-york-your-computer">Tour Old New York on Your Computer</a></li> </ul> PyCon 2015: Make web development awesome with visual diffing tools 2015-04-12T15:05:00+00:00 https://danvk.github.io//2015/04/12/pycon-2015-make-web-development-awesome-with-visual-diffing-tools <p>Here’s the <a href="https://youtu.be/jUUTqgzNR3M">video</a> of my talk from PyCon 2015, <a href="https://us.pycon.org/2015/schedule/presentation/395/">Make web development awesome with visual diffing tools</a>:</p> <iframe width="640" height="360" src="https://www.youtube.com/embed/jUUTqgzNR3M" frameborder="0" allowfullscreen=""></iframe> <p>Here are the <a href="http://bit.ly/pycon2015-visual-diffs">slides</a> for the talk:</p> <iframe src="https://docs.google.com/presentation/embed?id=1C6TcdHSBQNcLEIH6SFmmAOLJ7cy1gvIbmCj3cyd7fxE&amp;start=false&amp;loop=false&amp;" frameborder="0" width="520" height="405"></iframe> <p>The two tools referenced are:</p> <ul> <li><a href="https://github.com/bslatkin/dpxdt">dpxdt</a> for generating screenshots</li> <li><a href="https://github.com/danvk/webdiff">webdiff</a> for viewing image diffs</li> </ul> <p>I used <a href="http://www.comparea.org/">comparea</a> and <a href="http://www.dygraphs.com/">dygraphs</a> as sample apps for the demos.</p> <p>If you enjoyed my talk, you might also enjoy <a href="http://www.onebigfluke.com/">Brett’s</a> talk from a few years ago, <a href="https://www.youtube.com/watch?v=1wHr-O6gEfc">The Secret of Safe Continuous Deployment: Perceptual Diffs</a>. It goes into more depth on how screenshots can facilitate rapid deployment.</p> Training an Ocropus OCR model 2015-01-11T04:58:00+00:00 https://danvk.github.io//2015/01/11/training-an-ocropus-ocr-model <style> .bordered { border: 1px solid rgb(204, 204, 204); } </style> <p>In the <a href="http://www.danvk.org/2015/01/09/extracting-text-from-an-image-using-ocropus.html">last post</a>, we walked through the steps in the <a href="https://github.com/tmbdev/ocropy">Ocropus</a> OCR pipeline. We extracted text from images like this:</p> <div class="center"> <img src="/images/ocropus/start.png" width="608" height="137" /> </div> <p>The results using the default model were passable but not great:</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>O1inton Street, aouth from LIYingston Street. Auguat S, 1934. P. L. Sperr. NO REPODUCTIONS. </code></pre></div></div> <p>Over the <a href="http://digitalgallery.nypl.org/nypldigital/dgdivisionbrowseresult.cfm?div_id=hh">larger corpus</a> of images, the error rate was around 10%. The default model has never seen typewriter fonts, nor has it seen ALLCAPS text, both of which figure prominently in this collection. So its poor performance comes as no surprise.</p> <p>In this post I’ll walk through the process of training an Ocropus model to recognize the typewritten text in this collection. By the end of this post, the performance will be extremely good.</p> <h3 id="generating-truth-data">Generating truth data</h3> <p>Ocropus trains its model using <a href="http://en.wikipedia.org/wiki/Supervised_learning">supervised learning</a>: it requires images of lines along with correct transcriptions. If you’re trying to recognize a known font, you can generate arbitrary amounts of labeled data (using <code class="language-plaintext highlighter-rouge">ocropus-linegen</code>). But in our case, we have to label some images by hand.</p> <p>This is tedious and involves a lot of typing. Amazon’s <a href="https://www.mturk.com/mturk/welcome">Mechanical Turk</a> is a popular way of farming out small tasks like this, but I prefer to do the transcription myself using <a href="http://www.danvk.org/wp/2013-04-20/generating-training-data/index.html">localturk</a>. It doesn’t take as long as you might think (I typed 800 lines in about an hour and 20 minutes). And it has the benefit of forcing you to look at a large sample of your data, something that’s likely to lead to insights.</p> <div class="center"> <img src="/images/ocropus/localturk.png" width="474" height="401" /> <br /><i>(localturk in action)</i> </div> <p>I used <a href="https://gist.github.com/danvk/abec6e0c7657e2c8e86f">this template</a> for the transcription. Ocropus expects truth data to be in <code class="language-plaintext highlighter-rouge">.gt.txt</code> files with the same name as the PNG files for the lines. For example:</p> <ul> <li><code class="language-plaintext highlighter-rouge">book/0001/010001.png</code></li> <li><code class="language-plaintext highlighter-rouge">book/0001/010001.gt.txt</code></li> </ul> <p>It’s important that you transcribe <em>lines</em>, not entire pages. I initially transcribed pages and tried to have Ocropus learn on them, but this doesn’t work at all.</p> <h3 id="training-a-model">Training a model</h3> <p>Ocropus trains a model by learning from its mistakes. It transcribes the text in a line, then adjusts the weights in the Neural Net to compensate for the errors. Then it does this again for the next line, and the next, and so on. When it gets to the last line of labeled data, it starts over again. As it loops through the training data over and over again, the model gets better and better.</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>ocropus-rtrain -o modelname book*/????/*.bin.png </code></pre></div></div> <p>This produces lots of output like this:</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>2000 70.56 (1190, 48) 715641b-crop-010002.png TRU: u'504-508 West 142nd Street, adjoining and west of Hamilton' ALN: u'504-5088 West 422nd Street, adjoining and west of Hammilton' OUT: u'3od-iS est 4nd Street, doning nd est of Sarilton' 2001 32.38 (341, 48) 726826b-crop-010003.png TRU: u'NO REPRODUCTIONS' ALN: u'NO REPRODUCTIONS' OUT: u'sO EROCoOri' ... </code></pre></div></div> <p><code class="language-plaintext highlighter-rouge">TRU</code> is the truth data. <code class="language-plaintext highlighter-rouge">OUT</code> is the output of the model. <code class="language-plaintext highlighter-rouge">ALN</code> is a variant of the model output which is aligned to the truth data. It’s used to adjust the model weights more precisely. It typically looks better than the model output, especially in early iterations. It lets you know that you’re making progress.</p> <p>Here’s a video that Thomas, the Ocropus developer, put together. It shows the network’s output for a single image as it learns (see the <a href="https://www.youtube.com/watch?v=czG5Jk9iC7c">YouTube page</a> for explanations of the different charts):</p> <div class="center"> <iframe width="560" height="315" src="//www.youtube.com/embed/czG5Jk9iC7c" frameborder="0" allowfullscreen=""></iframe> </div> <p>For my first model, I used 400 of the labeled lines as training data and held out the other 400 as test data. Ocropus saves models to disk every 1000 iterations, so it’s simple to evaluate the model’s performance as it learns:</p> <div class="center"> <img src="/images/ocropus/err-train-test.png" width="552" height="326" /> </div> <p>The error rate starts high (over 50%) but quickly comes down to about 2% after 10,000 iterations, eventually hitting a minimum of 0.96% at 16,000 iterations.</p> <p>The error rate on the test set is consistently about 3% higher than that on the training set. The best error rate on the test set was 4.20%.</p> <p>There’s a lot of variation in the error rate. You might expect it to slowly decrease over time, but that’s not at all the case. I’m not quite sure how to interpret this. Does the error rate spike at 17,000 iterations because the model tries to jolt itself out of a local minimum? Is it just randomness?</p> <p>In any case, it’s important to generate a chart like this. Choosing the wrong model could lead to needlessly bad performance.</p> <h3 id="training-with-more-data">Training with more data.</h3> <p>You’d expect that training on more data would yield a better model. So for my next model, I trained on all 800 labeled images (rather than just 400). I didn’t have a test set. Here’s what the error rate looked like:</p> <div class="center"> <img src="/images/ocropus/err-train800.png" width="472" height="294" /> </div> <p>This doesn’t make much sense to me. The lowest error rate on the 800 training images is 3.59%. But the model from the previous section achieved an error rate of 2.58% on the same data set (average of 0.96% and 4.20%). And it only saw half the data! How is that possible? Maybe this model just had bad luck.</p> <p>There’s the same pattern as before of occasional spikes in error rate. More disturbing, after around 40,000 iterations, I started seeing lots of <a href="https://github.com/tmbdev/ocropy/issues/5">FloatingPointErrors</a>. It’s unclear to me exactly what this means. Perhaps the model is diverging?</p> <p>Here’s another model that I trained for even longer:</p> <div class="center"> <img src="/images/ocropus/err-typewriter.png" width="522" height="275" /> </div> <p>It achieves an error rate of 0.89% at iteration 33,000, then spikes to over 15% at 37,000. It eventually gets back down to 0.85% after 53,000 iterations, then starts spiking again. By the time I stopped it, I was again seeing lots of <code class="language-plaintext highlighter-rouge">FloatingPointErrors</code>.</p> <p>The point of all this is that the error rates are quite erratic, so you need to look at them before choosing which model you use!</p> <h3 id="training-with-the-default-model">Training with the default model</h3> <p>So far we’ve built our models from scratch. But you can also build on top of an existing model.</p> <p>Even though it’s never seen typewriter text or ALLCAPS, the <a href="http://www.tmbdev.net/">default Ocropus model</a> presumably knows a lot about Latin characters and the relationship between them in English words. And I trust the Ocropus developers to build a good Ocropus model far more than I trust myself.</p> <p>You train on top of an existing model using the <code class="language-plaintext highlighter-rouge">--load</code> option:</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>ocropus-rtrain --load en-default.pyrnn.gz -o my-model *.png </code></pre></div></div> <p>Here’s what the error rate looks like:</p> <div class="center"> <img src="/images/ocropus/err-default.png" width="468" height="286" /> </div> <p>Now we’re getting somewhere: the error rate gets all the way down to 0.277%!</p> <p>Something interesting happens when you get the error rate significantly below 1%. The “mistakes” that the model makes are quite likely to be errors that <em>you</em> made while transcribing truth data! I noticed that I misspelled some words and even hallucinated new words like “the” into some of the lines.</p> <p>Even crazier, there were typos in the original images that I subconsciously corrected:</p> <div class="center"> <img src="/images/ocropus/milstein-typo.png" width="594" height="30" /> </div> <p>(Look at the second to last word.)</p> <p>A model with a 0.2% error rate is good enough to produce readable text. For example, here’s what it produces for the image from the <a href="http://www.danvk.org/2015/01/09/extracting-text-from-an-image-using-ocropus.html">last post</a>:</p> <p><img class="bordered" src="/images/ocropus/line1.png" width="610" height="35" /><br /> → <code class="language-plaintext highlighter-rouge">Clinton Street, south from Livingston Street.</code> <br /> <img class="bordered" src="/images/ocropus/line2.png" width="163" height="25" /> <br /> → <code class="language-plaintext highlighter-rouge">P. L. Sperr.</code> <br /> <img class="bordered" src="/images/ocropus/line3.png" width="233" height="24" /> <br /> → <code class="language-plaintext highlighter-rouge">NO REPRODUCTIONS.</code> <br /> <img class="bordered" src="/images/ocropus/line4.png" width="208" height="27" /> <br /> → <code class="language-plaintext highlighter-rouge">August 5, 1934.</code></p> <p>i.e. it’s perfect. Here’s the output of the Neural Net for the last line:</p> <div class="center"> <img src="/images/ocropus/new-network.png" width="699" height="216" /> </div> <p>Compare that to what it was before:</p> <div class="center"> <img src="/images/ocropus/network-labeled.png" width="699" height="216" /> </div> <p>There’s still some ambiguity around <code class="language-plaintext highlighter-rouge">5</code>/<code class="language-plaintext highlighter-rouge">S</code>, but it makes the right call. The <code class="language-plaintext highlighter-rouge">a</code> vs <code class="language-plaintext highlighter-rouge">s</code> error is completely gone.</p> <h3 id="conclusions">Conclusions</h3> <p>At this point the model is good enough. If I were to improve it further, I’d either improve my <a href="http://www.danvk.org/2015/01/07/finding-blocks-of-text-in-an-image-using-python-opencv-and-numpy.html">image cropper</a> or incorporate some kind of spell checking as a post-processing step.</p> <p>The behavior of the models as they’re trained is sometimes inscrutable. Finding a good one involves a lot of trial and error. To avoid flailing, measure your performance constantly and keep a list of ideas to explore. “Train a model starting with the pre-built one” was item #6 on my list of ideas and it took me a while to get around to trying it. But it was the solution!</p> <p>If you’re feeling lost or frustrated, go generate some more training data. At least you’ll be doing something useful.</p> <p>At the end of the day, I’m very happy with the OCR model I built. Ocropus has some rough edges, but it’s simple enough that you can usually figure out what’s going on and how to fix problems as they come up. And the results speak for themselves!</p> Extracting text from an image using Ocropus 2015-01-09T22:45:00+00:00 https://danvk.github.io//2015/01/09/extracting-text-from-an-image-using-ocropus <style> .bordered { border: 1px solid rgb(204, 204, 204); } </style> <p>In the <a href="http://www.danvk.org/2015/01/07/finding-blocks-of-text-in-an-image-using-python-opencv-and-numpy.html">last post</a>, I described a way to crop an image down to just the part containing text. The end product was something like this:</p> <div class="center"> <img src="/images/ocropus/start.png" width="608" height="137" /> </div> <p>In this post, I’ll explain how to extract text from images like these using the <a href="https://github.com/tmbdev/ocropy">Ocropus</a> <a href="http://en.wikipedia.org/wiki/Optical_character_recognition">OCR</a> library. Plain text has a number of advantages over images of text: you can search it, it can be stored more compactly and it can be reformatted to fit seamlessly into web UIs.</p> <p>I don’t want to get too bogged down in the details of why I went with Ocropus over its more famous cousin, <a href="https://code.google.com/p/tesseract-ocr/">Tesseract</a>, at least not in this post. The gist is that I found it to be:</p> <ol> <li>more transparent about what it was doing.</li> <li>more hackable</li> <li>more robust to <a href="http://stackoverflow.com/questions/27592430/how-can-i-tell-tesseract-that-my-font-has-a-particular-size">character segmentation issues</a></li> </ol> <p>This post is a bit long, but there are lots of pictures to help you get through it. Be strong!</p> <h3 id="ocropus">Ocropus</h3> <p>Ocropus (or Ocropy) is a collection of tools for extracting text from scanned images. The basic pipeline looks like this:</p> <div class="center"> <img src="/images/ocropus/ocropus-pipeline.png" width="304" height="341" /> </div> <p>I’ll talk about each of these steps in this post. But first, we need to install Ocropus!</p> <h3 id="installation">Installation</h3> <p>Ocropus uses the <a href="http://www.scipy.org/about.html">Scientific Python</a> stack. To run it, you’ll need <code class="language-plaintext highlighter-rouge">scipy</code>, <code class="language-plaintext highlighter-rouge">PIL</code>, <code class="language-plaintext highlighter-rouge">numpy</code>, <code class="language-plaintext highlighter-rouge">OpenCV</code> and <code class="language-plaintext highlighter-rouge">matplotlib</code>. Setting this up is a bit of a pain, but you’ll only ever have to do it once (at least until you get a new computer).</p> <p>On my Mac running Yosemite, I set up <a href="http://brew.sh">brew</a>, then ran:</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>brew install python brew install opencv brew install homebrew/python/scipy </code></pre></div></div> <p>To make this last step work, I had to follow the workaround described in <a href="https://github.com/Homebrew/homebrew/issues/16016#issuecomment-42912638">this comment</a>:</p> <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">cd</span> /usr/local/Cellar/python/2.7.6_1/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages <span class="nb">rm </span>cv.py cv2.so <span class="nb">ln</span> <span class="nt">-s</span> /usr/local/Cellar/opencv/2.4.9/lib/python2.7/site-packages/cv.py cv.py <span class="nb">ln</span> <span class="nt">-s</span> /usr/local/Cellar/opencv/2.4.9/lib/python2.7/site-packages/cv2.so cv2.so </code></pre></div></div> <p>Then you can follow the instructions on the <a href="https://github.com/tmbdev/ocropy">ocropy site</a>. You’ll know you have things working when you can run <code class="language-plaintext highlighter-rouge">ocropus-nlbin --help</code>.</p> <h3 id="binarization">Binarization</h3> <p>The first step in the Ocropus pipeline is <em>binarization</em>: the conversion of the source image from grayscale to black and white.</p> <p>There are many ways to do this, some of which you can read about <a href="https://docs.google.com/presentation/d/1N1scoKZhmneH_qyLCjdVcAWKqqL65T3ahKrk2-1Tvcg/edit#slide=id.i39">in this presentation</a>. Ocropus uses a form of <a href="http://opencv-python-tutroals.readthedocs.org/en/latest/py_tutorials/py_imgproc/py_thresholding/py_thresholding.html">adaptive thresholding</a>, where the cutoff between light and dark can vary throughout the image. This is important when working with scans from books, where there can be variation in light level over the page.</p> <p>Also lumped into this step is <em>skew estimation</em>, which tries to rotate the image by small amounts so that the text is truly horizontal. This is done more or less through brute force: Ocropy tries 32 different angles between +/-2° and picks the one which maximizes the variance of the row sums. This works because, when the image is perfectly aligned, there will be huge variance between the rows with text and the blanks in between them. When the image is rotated, these gaps are blended.</p> <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>ocropus-nlbin <span class="nt">-n</span> 703662b.crop.png <span class="nt">-o</span> book </code></pre></div></div> <div class="center"> <img src="/images/ocropus/binarized.png" width="608" height="137" /> </div> <p>The <code class="language-plaintext highlighter-rouge">-n</code> tells Ocropus to suppress page size checks. We’re giving it a small, cropped image, rather than an image of a full page, so this is necessary.</p> <p>This command produces two outputs:</p> <ul> <li><code class="language-plaintext highlighter-rouge">book/0001.bin.png</code>: binarized version of the first page (above)</li> <li><code class="language-plaintext highlighter-rouge">book/0001.nrm.png</code>: a “flattened” version of the image, before binarization. This isn’t very useful.</li> </ul> <p>(The Ocropus convention is to put all intermediate files in a <code class="language-plaintext highlighter-rouge">book</code> working directory.)</p> <h3 id="segmentation">Segmentation</h3> <p>The next step is to extract the individual lines of text from the image. Again, there are many ways to do this, some of which you can read about in this <a href="https://docs.google.com/presentation/d/1N1scoKZhmneH_qyLCjdVcAWKqqL65T3ahKrk2-1Tvcg/edit#slide=id.i151">presentation on segmentation</a>.</p> <p>Ocropus first estimates the “scale” of your text. It does this by finding connected components in the binarized image (these should mostly be individual letters) and calculating the median of their dimensions. This corresponds to something like the <a href="http://en.wikipedia.org/wiki/X-height">x-height</a> of your font.</p> <p>Next it tries to find the individual lines of text. The sequence goes something like this:</p> <ol> <li>It removes components which are too big or too small (according to <em>scale</em>). These are unlikely to be letters.</li> <li>It applies the <a href="http://www.cs.cornell.edu/courses/CS6670/2011sp/lectures/lec02_filter.pdf">y-derivative of a Gaussian kernel</a> (p. 42) to detect top and bottom edges of the remaining features. It then blurs this horizontally to blend the tops of letters on the same line together.</li> <li>The bits between top and bottom edges are the lines.</li> </ol> <p>A picture helps explain this better. Here’s the result of step 2 (the edge detector + horizontal blur):</p> <div class="center"> <img src="/images/ocropus/edge-blur.png" width="608" height="137" /> </div> <p>The white areas are the tops and the black areas are the bottoms.</p> <p>Here’s the another view of the same thing:</p> <div class="center"> <img src="/images/ocropus/edge-boxes.png" width="608" height="137" /> </div> <p>Here the blue boxes are components in the binarized image (i.e. letters). The wispy green areas are tops and the red areas are bottoms. I’d never seen a Gaussian kernel used this way before: its derivative is an edge detector.</p> <p>Here are the detected lines, formed by expanding the areas between tops and bottoms:</p> <div class="center"> <img src="/images/ocropus/line-components.png" width="608" height="137" /> </div> <p>It’s interesting that the lines needn’t be simple rectangular regions. In fact, the bottom two components have overlapping y-coordinates. Ocropus applies these regions as masks before extracting rectangular lines:</p> <p><img class="bordered" src="/images/ocropus/line1.png" width="610" height="35" /><br /> <img class="bordered" src="/images/ocropus/line2.png" width="163" height="25" /><br /> <img class="bordered" src="/images/ocropus/line3.png" width="233" height="24" /><br /> <img class="bordered" src="/images/ocropus/line4.png" width="208" height="27" /></p> <p>Here’s the command I used (the <code class="language-plaintext highlighter-rouge">g</code> in <code class="language-plaintext highlighter-rouge">ocropus-gpageseg</code> stands for “gradient”):</p> <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>ocropus-gpageseg <span class="nt">-n</span> <span class="nt">--maxcolseps</span> 0 book/0001.bin.png </code></pre></div></div> <p>The <code class="language-plaintext highlighter-rouge">--maxcolseps 0</code> tells Ocropus that there’s only one column in this image. The <code class="language-plaintext highlighter-rouge">-n</code> suppresses size checks, as before.</p> <p>This has five outputs:</p> <ul> <li><code class="language-plaintext highlighter-rouge">book/0001.pseg.png</code> encodes the segmentation. The color at each pixel indicates which column and line that pixel in the original image belongs to.</li> <li><code class="language-plaintext highlighter-rouge">book/0001/01000{1,2,3,4}.bin.png</code> are the extracted line images (above).</li> </ul> <h3 id="character-recognition">Character Recognition</h3> <p>After all that prep work, we can finally get to the fun part: character recognition using a <a href="http://en.wikipedia.org/wiki/Artificial_neural_network">Neural Net</a>.</p> <p>The problem is to perform this mapping:</p> <div class="center"> <img class="bordered" src="/images/ocropus/line4.png" width="208" height="27" /> → <code>August 5, 1934.</code> </div> <p>This is challenging because each line will have its own quirks. Maybe binarization produced a darker or lighter image for this line. Maybe skew estimation didn’t work perfectly. Maybe the typewriter had a fresh ribbon and produced thicker letters. Maybe the paper got water on it in storage.</p> <p>Ocropus uses an <a href="http://en.wikipedia.org/wiki/Long_short_term_memory">LSTM Recurrent Neural Net</a> to learn this mapping. The default model has 48 inputs, 200 nodes in a hidden layer and 249 outputs.</p> <p>The inputs to the network are columns of pixels. The columns in the image are fed into the network, one at a time, from left to right. The outputs are scores for each possible letter. As the columns for the <code class="language-plaintext highlighter-rouge">A</code> in the image above are fed into the net, we’d hope to see a spike from the <code class="language-plaintext highlighter-rouge">A</code> output.</p> <p>Here’s what the output looks like:</p> <div class="center"> <img src="/images/ocropus/network.png" width="700" height="330" /> </div> <p>The image on the bottom is the output of the network. Columns in the text and the output matrix correspond to one another. Each row in the output corresponds to a different letter, reading alphabetically from top to bottom. Red means a strong response, blue a weaker response. The red streak under the <code class="language-plaintext highlighter-rouge">A</code> is a strong response in the <code class="language-plaintext highlighter-rouge">A</code> row.</p> <p>The responses start somewhere around the middle to right half of each letter, once the net has seen enough of it to be confident it’s a match. To extract a transcription, you look for maxima going across the image.</p> <p>In this case, the transcription is <code class="language-plaintext highlighter-rouge">Auguat S, 1934.</code>:</p> <div class="center"> <img src="/images/ocropus/network-labeled.png" width="699" height="216" /> </div> <p>It’s interesting to look at the letters that this model gets wrong. For example, the <code class="language-plaintext highlighter-rouge">s</code> in <code class="language-plaintext highlighter-rouge">August</code> produces the strongest response on the <code class="language-plaintext highlighter-rouge">a</code> row. But there’s also a (smaller) response on the correct <code class="language-plaintext highlighter-rouge">s</code> row. There’s also considerable ambiguity around the <code class="language-plaintext highlighter-rouge">5</code>, which is transcribed as an <code class="language-plaintext highlighter-rouge">S</code>.</p> <p>My #1 <a href="https://github.com/tmbdev/ocropy/issues/16">feature request</a> for Ocropus is for it to output more metadata about the character calls. While there might not be enough information in the image to make a clear call between <code class="language-plaintext highlighter-rouge">Auguat</code> and <code class="language-plaintext highlighter-rouge">August</code>, a post-processing step with a dictionary would clearly prefer the latter.</p> <p>The transcriptions with the default model are:</p> <p><img class="bordered" src="/images/ocropus/line1.png" width="610" height="35" /><br /> → <code class="language-plaintext highlighter-rouge">O1inton Street, aouth from LIYingston Street.</code> <br /> <img class="bordered" src="/images/ocropus/line2.png" width="163" height="25" /> → <code class="language-plaintext highlighter-rouge">P. L. Sperr.</code> <br /> <img class="bordered" src="/images/ocropus/line3.png" width="233" height="24" /> → <code class="language-plaintext highlighter-rouge">NO REPODUCTIONS.</code> <br /> <img class="bordered" src="/images/ocropus/line4.png" width="208" height="27" /> → <code class="language-plaintext highlighter-rouge">Auguat S, 1934.</code></p> <p>This is passable, but not great. The Ocropus site explains why:</p> <blockquote> <p>There are some things the currently trained models for ocropus-rpred will not handle well, largely because they are nearly absent in the current training data. That includes all-caps text, some special symbols (including “?”), typewriter fonts, and subscripts/superscripts. This will be addressed in a future release, and, of course, you are welcome to contribute new, trained models.</p> </blockquote> <p>We’ll fix this in the <a href="http://www.danvk.org/2015/01/11/training-an-ocropus-ocr-model.html">next post</a> by training our own model.</p> <p>The command to make predictions is:</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>ocropus-rpred -m en-default.pyrnn.gz book/0001/*.png </code></pre></div></div> <p>I believe the <code class="language-plaintext highlighter-rouge">r</code> stands for “RNN” as in “Recurrent Neural Net”.</p> <p>The outputs are <code class="language-plaintext highlighter-rouge">book/0001/01000{1,2,3,4}.txt</code>.</p> <p>If you want to see charts like the one above, pass <code class="language-plaintext highlighter-rouge">--show</code> or <code class="language-plaintext highlighter-rouge">--save</code>.</p> <h3 id="extracting-the-text">Extracting the text</h3> <p>We’re on the home stretch!</p> <p>One way to get a text file out of Ocropus is to concatenate all the transcribed text files:</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>cat book/????/??????.txt &gt; ocr.txt </code></pre></div></div> <p>The files are all in alphabetical order, so this should do the right thing.</p> <p>In practice, I found that I often disagreed with the line order that Ocropus chose. For example, I’d say that <code class="language-plaintext highlighter-rouge">August 5, 1934.</code> is the second line of the image we’ve been working with, not the fourth.</p> <p>Ocropus comes with an <code class="language-plaintext highlighter-rouge">ocropus-hocr</code> tool which converts its output to <a href="http://en.wikipedia.org/wiki/HOCR">hOCR format</a>, an HTML-based format designed by Thomas Breuel, who also developed Ocropus.</p> <p>We can use it to get bounding boxes for each text box:</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ ocropus-hocr -o book/book.html book/0001.bin.png $ cat book/book.html ... &lt;div class='ocr_page' title='file book/0001.bin.png'&gt; &lt;span class='ocr_line' title='bbox 3 104 607 133'&gt;O1inton Street, aouth from LIYingston Street.&lt;/span&gt;&lt;br /&gt; &lt;span class='ocr_line' title='bbox 3 22 160 41'&gt;P. L. Sperr.&lt;/span&gt;&lt;br /&gt; &lt;span class='ocr_line' title='bbox 1 1 228 19'&gt;NO REPODUCTIONS.&lt;/span&gt;&lt;br /&gt; &lt;span class='ocr_line' title='bbox 377 67 579 88'&gt;Auguat S, 1934.&lt;/span&gt;&lt;br /&gt; &lt;/div&gt; ... </code></pre></div></div> <p>Ocropus tends to read text more left to right than top to bottom. Since I know my images only have one column of text, I’d prefer to emphasize the top-down order. I wrote a <a href="https://github.com/danvk/oldnyc/blob/master/ocr/tess/extract_ocropy_text.py">small tool</a> to reorder the text in the way I wanted.</p> <h3 id="conclusions">Conclusions</h3> <p>Congrats on making it this far! We’ve walked through the steps of running the Ocropus pipeline.</p> <p>The overall results aren’t good (~10% of characters are incorrect), at least not yet. In the <a href="http://www.danvk.org/2015/01/11/training-an-ocropus-ocr-model.html">next post</a>, I’ll show how to train a new LSTM model that completely destroys this problem.</p> Finding blocks of text in an image using Python, OpenCV and numpy 2015-01-07T00:00:00+00:00 https://danvk.github.io//2015/01/07/finding-blocks-of-text-in-an-image-using-python-opencv-and-numpy <p>As part of an <a href="http://www.danvk.org/wp/2013-02-09/finding-pictures-in-pictures/">ongoing project</a> with the New York Public Library, I’ve been attempting to <a href="http://en.wikipedia.org/wiki/Optical_character_recognition">OCR</a> the text on the back of the <a href="http://digitalgallery.nypl.org/nypldigital/dgdivisionbrowseresult.cfm?div_id=hh">Milstein Collection</a> images. Here’s what they look like:</p> <div class="center"> <img src="/images/milstein-backing.jpg" width="332" height="512" /> </div> <p>A few things to note:</p> <ul> <li>There’s a black border around the whole image, gray backing paper and then white paper with text on it.</li> <li>Only a small portion of the image contains text.</li> <li>The text is written with a tyepwriter, so it’s monospace. But the typewriter font isn’t <a href="http://digitalgallery.nypl.org/nypldigital/dgkeysearchdetail.cfm?trg=1&amp;strucID=397983&amp;imageID=708193b&amp;total=683&amp;num=160&amp;word=4002&amp;s=1&amp;notword=&amp;d=&amp;c=&amp;f=13&amp;k=3&amp;lWord=&amp;lField=&amp;sScope=Name&amp;sLevel=&amp;sLabel=Brown%20Brothers%20%28New%20York%29&amp;sort=&amp;imgs=20&amp;pos=163&amp;e=w&amp;cdonum=0">always consistent</a> across the collection. Sometimes a <a href="http://digitalgallery.nypl.org/nypldigital/dgkeysearchdetail.cfm?trg=1&amp;strucID=362388&amp;imageID=701471b&amp;total=534&amp;num=180&amp;word=Pumps&amp;s=3&amp;notword=&amp;d=&amp;c=&amp;f=2&amp;k=1&amp;lWord=&amp;lField=&amp;sScope=&amp;sLevel=&amp;sLabel=&amp;sort=&amp;imgs=20&amp;pos=184&amp;e=w&amp;cdonum=0">single image</a> has two fonts!</li> <li>The image is slightly rotated from vertical.</li> <li>The images are ~4x the resolution shown here (2048px tall)</li> <li>There are ~34,000 images: too many to affordably <a href="https://www.mturk.com/mturk/welcome">turk</a>.</li> </ul> <p>OCR programs typically have to do some sort of <a href="http://en.wikipedia.org/wiki/Document_layout_analysis">page-layout analysis</a> to find out where the text is and carve it up into individual lines and characters. When you hear “OCR”, you might think about fancy Machine Learning techniques like <a href="http://en.wikipedia.org/wiki/Artificial_neural_network">Neural Nets</a>. But it’s a dirty secret of the trade that page layout analysis, a much less glamorous problem, is at least as important in getting good results.</p> <p>The most famous OCR program is <a href="https://code.google.com/p/tesseract-ocr/">Tesseract</a>, a remarkably long-lived open source project developed over the past 20+ years at HP and Google. I quickly noticed that it performed much better on the Milstein images when I manually cropped them down to just the text regions first:</p> <div class="center"> <img src="/images/cropping.png" width="620" height="512" /> </div> <p>So I set out to write an image cropper: a program that could automatically find the green rectangle in the image above. This turned out to be surprisingly hard!</p> <p><a href="http://en.wikipedia.org/wiki/Computer_vision">Computer Vision</a> problems like this one are difficult because they’re so incredibly easy for humans. When you looked at the image above, you could immediately isolate the text region. This happened instantaneously, and you’ll never be able to break down exactly how you did it.</p> <p>The best we can do is come up with ways of breaking down the problem in terms of operations that are simple for computers. The rest of this post lays out a way I found to do this.</p> <p>First off, I applied the <a href="http://en.wikipedia.org/wiki/Canny_edge_detector">canny edge detector</a> to the image. This produces white pixels wherever there’s an edge in the original image. It yields something like this:</p> <div class="center"> <img src="/images/edges.png" width="332" height="512" /> </div> <p>This removes most of the background noise from the image and turns the text regions into bright clumps of edges. It turns the borders into long, crisp lines.</p> <p>The sources of edges in the image are the borders and the text. To zero in on the text, it’s going to be necessary to eliminate the borders.</p> <p>One really effective way to do this is with a <a href="http://scikit-image.org/docs/dev/auto_examples/applications/plot_rank_filters.html">rank filter</a>. This essentially replaces a pixel with something like the median of the pixels to its left and right. The text areas have lots of white pixels, but the borders consist of just a thin, 1 pixel line. The areas around the borders will be mostly black, so the rank filter will eliminate them. Here’s what the image looks like after applying a vertical and horizontal rank filter:</p> <div class="center"> <img src="/images/edges-rankfilter.png" width="332" height="512" /> </div> <p>The borders are gone but the text is still there! Success!</p> <p>While this is effective, it still leaves bits of text <em>outside</em> the borders (look at the top left and bottom right). That may be fine for some applications, but I wanted to eliminate these because they’re typically uninteresting and can confuse later operations. So instead of applying the rank filter, I found the <a href="http://docs.opencv.org/trunk/doc/py_tutorials/py_imgproc/py_contours/py_contours_begin/py_contours_begin.html">contours</a> in the edge image. These are sets of white pixels which are connected to one another. The border contours are easy to pick out: they’re the ones whose <a href="http://en.wikipedia.org/wiki/Minimum_bounding_box">bounding box</a> covers a large fraction of the image:</p> <div class="center"> <img src="/images/edges-bordercontour.png" width="332" height="512" /> </div> <p>With polygons for the borders, it’s easy to black out everything outside them.</p> <p>What we’re left with is an image with the text and possibly some other bits due to smudges or marks on the original page.</p> <p>At this point, we’re looking for a crop <code class="language-plaintext highlighter-rouge">(x1, y1, x2, y2)</code> which:</p> <ol> <li>maximizes the number of white pixels inside it and</li> <li>is as small as possible.</li> </ol> <p>These two goals are in opposition to one another. If we took the entire image, we’d cover all the white pixels. But we’d completely fail on goal #2: the crop would be unnecessarily large. This should sound familiar: it’s a classic <a href="http://en.wikipedia.org/wiki/Precision_and_recall">precision/recall tradeoff</a>:</p> <ul> <li>The recall is the fraction of white pixels inside the cropping rectangle.</li> <li>The precision is the fraction of the image outside the cropping rectangle.</li> </ul> <p>A fairly standard way to solve precision/recall problems is to optimize the <a href="http://en.wikipedia.org/wiki/F1_score">F1 score</a>, the harmonic mean of precision and recall. This is what we’ll try to do.</p> <p>The set of all possible crops is quite large: <em>W</em><sup>2</sup><em>H</em><sup>2</sup>, where <em>W</em> and <em>H</em> are the width and height of the image. For a 1300x2000 image, that’s about 7 trillion possibilities!</p> <p>The saving grace is that most crops don’t make much sense. We can simplify the problem by finding individual chunks of text. To do this, we apply <a href="http://homepages.inf.ed.ac.uk/rbf/HIPR2/dilate.htm">binary dilation</a> to the de-bordered edge image. This “bleeds” the white pixels into one another. We do this repeatedly until there are only a few connected components. Here’s what it looks like:</p> <div class="center"> <img src="/images/edges-dilated.png" width="560" height="373" /> </div> <p>As we hoped, the text areas have all bled into just a few components. There are five connected components in this image. The white blip in the top right corresponds to the “Q” in the original image.</p> <p>By including some of these components and rejecting others, we can form good candidate crops. Now we’ve got a <a href="http://en.wikipedia.org/wiki/Subset_sum_problem">subset sum problem</a>: which subset of components produces a crop which maximizes the F1 score?</p> <p>There are 2<sup><em>N</em></sup> possible combinations of subsets to examine. In practice, though, I found that a greedy approach worked well: order the components by the number of white pixels they contain (in the original image). Keep adding components while it increases the F1 score. When nothing improves the score, you’re done!</p> <p>Here’s what that procedure produces for this image:</p> <div class="center"> <img src="/images/edges-chosen.png" width="560" height="373" /> </div> <p>The components are ordered as described above. Component #1 contains the most white pixels in the original image. The first four components are accepted and the fifth is rejected because it hurts the F1 score:</p> <ol> <li>Accept #1, F1 Score → 0.886</li> <li>Accept #2, F1 Score → 0.931</li> <li>Accept #3, F1 Score → 0.949</li> <li>Accept #4, F1 Score → 0.959</li> <li>Reject #5 (F1 Score → 0.888)</li> </ol> <p>Applying this crop to the original image, you get this:</p> <div class="center"> <img src="/images/milstein-cropped.jpg" width="875" height="233" /> </div> <p>That’s 875x233, whereas the original was 1328x2048. That’s a <strong>92.5% decrease in the number of pixels, with no loss of text</strong>! This will help any OCR tool focus on what’s important, rather than the noise. It will also make OCR run faster, since it can work with smaller images.</p> <p>This procedure worked well for my particular application. Depending on how you count, I’d estimate that it gets a perfect crop on about 98% of the images, and its errors are all relatively minor.</p> <p>If you want to try using this procedure to crop your own images, you can find the <a href="https://github.com/danvk/oldnyc/blob/master/ocr/tess/crop_morphology.py">source code here</a>. You’ll need to install <code class="language-plaintext highlighter-rouge">OpenCV</code>, <code class="language-plaintext highlighter-rouge">numpy</code> and <code class="language-plaintext highlighter-rouge">PIL</code> to make it work.</p> <p>I tried several other approaches which didn’t work as well. Here are some highlights:</p> <ul> <li> <p>I ran the image through Tesseract to find areas which contained letters. These should be the areas that we crop to! But this is a bit of a chicken and the egg problem. For some images, Tesseract misses the text completely. Cropping fixes the problem. But we were trying to find a crop in the first place!</p> </li> <li> <p>I tried running the images through <a href="https://www.flameeyes.eu/projects/unpaper">unpaper</a> first, to remove noise and borders. But this only worked some of the time and I found unpaper’s interface to be quite opaque and hard to tweak.</p> </li> <li> <p>I ran canny, then calculated row and column sums to optimize the x- &amp; y-coordinates of the crop independently. The text regions did show up clearly in charts of the row sums: <img src="/images/milstein-rowsums.png" width="379" height="258" /><br /> The four spikes are the tops and bottoms of the two borders. The broad elevated region in the middle is the text. Making this more precise turned out to be hard. You lose a lot of structure when you collapse a dimension—this problem turned out to be easier to solve as a single 2D problem than as two 1D problems.</p> </li> </ul> <p>In conclusion, I found this to be a surprisingly tricky problem, but I’m happy with the solution I worked out.</p> <p>In the next post, I’ll talk about my experience running OCR tools over these cropped images.</p> Choosing an iOS Podcasting App 2014-12-29T00:00:00+00:00 https://danvk.github.io//2014/12/29/choosing-an-ios-podcasting-app <p>I recently switched back to iOS after a few years using Android. One aspect of Android that always bothered me was that I had trouble finding a great Podcasting app. I’d happily used <a href="http://www.macstories.net/reviews/instacast-3-review/">Instacast</a> on iOS, but I couldn’t find anything quite like it for Android.</p> <p>I eventually settled on <a href="https://play.google.com/store/apps/details?id=com.bambuna.podcastaddict&amp;hl=en">Podcast Addict</a>. It epitomizes a stereotype of Android apps: tons of features, gajillions of options and a UI that was clearly designed by an engineer.</p> <p>After switching back to iOS, I had a surprisingly hard time finding a Podcasting app which could do everything I wanted. For me, the hard requirements were:</p> <ol> <li>A view of podcasts ordered by most-recently updated</li> <li>Stream by default—don’t download episodes unless I ask!</li> <li>An option to disable play-through.</li> </ol> <p>In the last few years, <strong>Instacast</strong> received a <a href="http://vemedio.com/blog/posts/instacast-5-available-today">major update</a>. This introduced a new player interface which doesn’t work on the iOS lock screen. As a result, you can’t use the play/pause/volume features on your earbuds!</p> <p>It also lacks the ability to sort podcasts by most-recently-updated. I subscribe to a mix of podcasts that update <a href="http://www.npr.org/rss/podcast/podcast_detail.php?siteId=5183215">frequently</a> and <a href="http://www.dancarlin.com/hardcore-history-series/">infrequently</a>, so I find this to be a far more useful view than a complete list of all episodes. In particular, I don’t want an hourly podcast to crowd out less frequently updated shows.</p> <p>So I set out to find a new Podcasting app. I really wish the iOS App Store had a try-before-you-buy option. Order-by-date is a fairly obscure feature, and it was impossible to determine if an app supported it without buying it.</p> <p>Since I spent about $10 trying out apps and deleting them, here’s my guide to spare you that chore!</p> <h4 id="instacast">Instacast</h4> <p>As I mentioned, it’s missing the order by most recently updated feature and the ability to control playback via the lock screen.</p> <h4 id="overcast">Overcast</h4> <p>Overcast is the best looking of the bunch. I appreciate its freemium model and functional shoutouts to alternative apps—they made moving my list of feeds between apps painless.</p> <p>That being said, I had a few problems with Overcast:</p> <ul> <li>It doesn’t support streaming</li> <li>Disabling play-through is a paid feature (which is fine, but it’s worth noting!)</li> <li>It makes a distinction between all episodes and unplayed episodes for each Podcast which I don’t find to be very helpful. I’m not a completist—I tend to pick and choose episodes.</li> <li>It’s very aggressive about downloading new episodes!</li> </ul> <h4 id="podcasts-built-in-app">Podcasts (built-in app)</h4> <p>Podcasts is a barebones app for listening to Podcasts. I couldn’t tell if it supported streaming (I don’t think it does) and it definitely doesn’t support showing your podcasts ordered by when they were last updated. Next!</p> <h4 id="podcruncher">PodCruncher</h4> <p>This app came up when I searched for Downcast on the App store and looked promising. Unfortunately, it hasn’t been updated for the new “flat” look in iOS 7 and it doesn’t support the larger screen on the iPhone 6. It also lacks the most-recently-updated view.</p> <h4 id="downcast">Downcast</h4> <p>Finally I found <a href="http://www.downcastapp.com/">Downcast</a>, which meets all my Podcasting needs. It lets you order your podcasts by most-recently updated, though the option is tricky to find (Edit→Sort→Publication Date):</p> <p><img src="/images/sort-by-mru.png" width="375" height="327" alt="Sort by Publication Date option" /></p> <p>And there’s an option to disable play-through on the player screen, if you can recognize it:</p> <p><img src="/images/disable-playthrough.png" width="375" height="172" alt="Option to disable playthrough" /></p> <p>Next on my list to try were Pocket Casts and Castro, but I stopped when I found that I was happy with Downcast.</p> <p>After all this, I’m coming to appreciate that everyone has different ways of using their podcasting app, which makes designing them hard. I don’t care at all about Smart Playlists (which many apps emphasize) and I care about slightly obscure features like ordering my podcasts by when they were updated. This is a case of <a href="http://prog21.dadgum.com/160.html">dangling by a trivial feature</a>.</p> JavaScript String slice, substr, substring: which to use? 2014-11-17T00:00:00+00:00 https://danvk.github.io//2014/11/17/javascript-string-splice-substr-substring-which-to-use <p>I recently read Douglas Crockford’s <a href="http://www.amazon.com/JavaScript-Good-Parts-Douglas-Crockford/dp/0596517742"><em>JavaScript: The Good Parts</em></a>. It’s a classic (published in 2008) which is credited with reviving respect for JavaScript as a programming language. Given its title, it’s also famously short.</p> <p>One very specific thing it cleared up for me was what to do with all of JavaScript’s various substring methods:</p> <ul> <li><a href="https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/substr">String.prototype.substr</a>(start[, length])</li> <li><a href="https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/substring">String.prototype.substring</a>(start[, stop])</li> </ul> <p>One method takes an offset and a length. The other takes two offsets. The names don’t reflect this distinction in any way, so they’re impossible to remember. I resorted to writing an <a href="http://support.alfredapp.com/tutorials:clipboard-snippets">Alfred Snippet</a> with the syntaxes, to quickly look it up (since I always had to).</p> <p>They also have some other differences in behavior: <code class="language-plaintext highlighter-rouge">substring</code> doesn’t allow its offset to be negative, but <code class="language-plaintext highlighter-rouge">substr</code> does. This is arbitrary. It’s impossible to remember because there’s no rhyme or reason to it.</p> <p>Crockford cleared this all up. The solution is to never use <em>either</em> method!</p> <p>Instead, you should use the <code class="language-plaintext highlighter-rouge">slice</code> method:</p> <ul> <li><a href="https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/slice">String.prototype.slice</a>(start[, stop])</li> </ul> <p>This takes two offsets. Either can be negative. And it’s the exact same as the corresponding <a href="https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Array/slice">Array <code class="language-plaintext highlighter-rouge">slice</code></a> method.</p> <p>So: stop using <code class="language-plaintext highlighter-rouge">substr</code> and <code class="language-plaintext highlighter-rouge">substring</code>. Use <code class="language-plaintext highlighter-rouge">slice</code>!</p> GitHub integration and Image Diff improvements headline webdiff 0.8 2014-11-07T00:00:00+00:00 https://danvk.github.io//2014/11/07/github-integration-and-image-diff-improvements-headline-webdiff-0-8 <p>I’ve released <a href="https://pypi.python.org/pypi/webdiff/0.8.0">webdiff 0.8.0</a>, which you can install via:</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>pip install --upgrade webdiff </code></pre></div></div> <p>The most interesting new features are GitHub pull request integration and expanded image diffing modes.</p> <p>You can view a GitHub Pull Request in webdiff by running something like:</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>webdiff https://github.com/hammerlab/cycledash/pull/175 </code></pre></div></div> <p>Any github Pull Request URL will do. This will pull down the files from GitHub to local disk and then diff them in the standard webdiff UI. My main use case for this is looking at screenshot diffs and thinking “I want to see bigger images in this PR diff”. Speaking of which…</p> <p><img src="/images/webdiff-swipe.gif" width="700" height="451" /></p> <p>This version includes a few improvements to the image diff mode:</p> <ol> <li>A “shrink to fit” option, which is enabled by default. This shrinks large images to fit in your browser window.</li> <li>Consistent use of red/green borders for before/after images.</li> <li>“Onion Skin” diff mode, which fades one image into the other.</li> <li>“Swipe” diff mode, which lets you lets you drag a dividing line between the images.</li> </ol> <p>These are based on <a href="https://github.com/blog/817-behold-image-view-modes">GitHub’s image view modes</a>. I still find “blink” to be the most helpful for spotting small changes, but now you’ve got choices!</p> <p>There were a few smaller changes as well. Full release notes are <a href="https://pypi.python.org/pypi/webdiff/0.8.0">on PyPI</a>.</p> Life after Google, Six Months In 2014-10-31T00:00:00+00:00 https://danvk.github.io//2014/10/31/life-after-google-six-months-in <p>It’s been almost exactly six months since I ended an eight year run at Google. One of the biggest reasons to do this was to come back up to speed with the open source ecosystem and to experience a different working environment (sample size 1→2!).</p> <p>When I joined Google, it seemed like our tech stack was years ahead of anything else. This included tools like <a href="http://google-engtools.blogspot.com/2011/08/build-in-cloud-how-build-system-works.html">BUILD files</a> for managing dependencies and compilation, <a href="http://www.quora.com/What-is-Borg-at-Google">borg</a> for running jobs in a data center, and <a href="https://developers.google.com/closure/compiler/">closure compiler</a> for dealing with JavaScript’s idiosyncracies.</p> <p>The open source world has come a long way since 2006. It’s solved many of the same problems that Google did, but in different ways. There’s the hadoop stack for distributed work. And there are polyfills and CommonJS for JavaScript. These aren’t necessarily better or worse than Google’s solutions, but they are different. And they’re the tools that new developers are learning.</p> <p>Long term, this is a big problem for Google. “Ahead of the field” can rapidly turn into “Not Invented Here”. “Better” can become merely “different”.</p> <p>A simple example of this is <a href="http://docs.closure-library.googlecode.com/git/namespace_goog.html"><code class="language-plaintext highlighter-rouge">goog.bind</code></a>. It works around some confusing behavior involving <code class="language-plaintext highlighter-rouge">this</code> in JavaScript. It was a great tool in 2004. But modern JavaScript has its own solution (<a href="https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Function/bind">Function.prototype.bind</a>). It’s been around for years and is supported in 90+% of browsers. Google will still be writing <code class="language-plaintext highlighter-rouge">goog.bind</code> long after IE8 ceases to be relevant.</p> <p>Facebook has a better solution to this: they always use the latest version of the JavaScript standard, and then <a href="https://github.com/facebook/jstransform">transpile</a> to something that older browsers will understand. This way they stay on the mainstream of technological development.</p> <p>Here are a few other things I’ve found notable:</p> <ul> <li> <p><strong>Almost any Google technology has an open-source equivalent.</strong><br /> Examples include Travis-CI (similar to TAP), the Hadoop stack (MapReduce, CNS, Dremel, …), CommonJS (Closure modules).</p> </li> <li> <p><strong>Package managers</strong><br /> For most Google engineers, something like 95+% of the code you work with is first-party (i.e. written at Google). For a smaller group, this ratio is going to be dramatically different. As a result, third party package managers are much more important. And the good news is that they’ve gotten much better over the last eight years! Leaving Google has finally forced me to learn how to use tools like NPM and pip. I’ve found this incredibly empowering. I used to avoid external dependencies for personal projects. I suspect many Google engineers do the same.</p> </li> <li> <p><strong>Markdown is more pervasive than I’d realized</strong><br /> I was vaguely aware of <a href="http://daringfireball.net/projects/markdown/">Markdown</a> while I was at Google, but didn’t really see the point. Now that I’m out, I’ve been surprised to see how pervasive it is. I suspect that much of this comes from GitHub, where you use Markdown for READMEs, issues and code review comments. You also use it with GitHub pages, so I’m typing in it right now! It’s worthwhile to learn Markdown a bit better. It’s like HTML with infinitely less boilerplate.</p> </li> <li> <p><strong>Knowledge of git</strong><br /> It’s completely reasonable for someone with ten years of experience at Google to never have run git. I’m happy I was involved enough with open source projects that I have years of basic experience with it. But others have clearly gone much deeper. My git skills have gotten much better since leaving Google. I know how to use <code class="language-plaintext highlighter-rouge">git rebase</code> now!</p> </li> <li> <p><strong>sysadmin knowledge</strong><br /> borg meant that I never had to learn any systems administration. It’s a particular blind spot for me. For example, <a href="http://upstart.ubuntu.com/">upstart</a> was released over eight years ago, just after I joined Google. It’s incredibly widely used. I’d never heard of it.</p> </li> <li> <p><strong>Google uses a lot of email</strong><br /> We use almost no email in my new group. HipChat takes the place of a group mailing list and your coworkers @mention you when they want you to chime in. The chat format keeps things short. I still get emails for code reviews, but I could imagine this going through GitHub notifications instead.</p> </li> <li> <p><strong>Buying your own lunch isn’t that big a deal.</strong><br /> As it turns out, there are people in NYC who will make you food in exchange for money, or even deliver it! The variety is much greater than what you get at a Google Cafe, and it’s nice to have a good reason to go outside during the day. I also like the more flexible eating schedule. Almost everyone in my office eats lunch very late, perhaps at 3 or 4.</p> </li> </ul> Fully Migrated to GitHub Pages 2014-10-23T00:00:00+00:00 https://danvk.github.io//2014/10/23/fully-migrated-to-github-pages <p><img src="/images/jekyll.png" alt="Jekyll Logo" width="249" height="115" style="float:right" /></p> <p>My danvk.org site is now fully hosted on GitHub pages. I changed the DNS entry last night.</p> <p>My hope was to do this without breaking anything. That didn’t prove to be possible, but I came close. And overall, the process wasn’t too bad! It was helpful to make a <a href="https://github.com/danvk/danvk.github.io/issues/1">census of material</a> that was on my old site using access logs. This turned up a few redirects I wouldn’t have thought of, and also reminded me of the many features my old site accumulated over the years. Some of these are now accessible under the “Features” menu on the new site.</p> <p>There were a few pain points:</p> <ul> <li> <p><strong>Redirects</strong></p> <p>It would have been enormously helpful if GitHub pages supported something like <a href="https://en.wikipedia.org/wiki/Rewrite_engine">mod_rewrite</a>. As it was I had to kill a few old links because I was <a href="http://stackoverflow.com/questions/26374707/can-jekyll-serve-content-based-on-a-url-parameter">completely unable to generate 301/302 redirects</a>. I wound up <a href="https://github.com/danvk/danvk.github.io/commit/f99fa0d6ef808a2ba468587d3f7eab800d448f1e">hard-coding JavaScript redirects</a> instead. It’s not ideal, and I’ll probably lose some pagerank, but it’s the best I could do.</p> </li> <li> <p><strong>Migrating the domain</strong></p> <p>I really hate DNS. It’s impossible to know whether your site isn’t working because you’ve misconfigured your DNS, or because the new records haven’t propagated out yet. I’m surprised more sites don’t go down because of DNS problems. The <a href="https://www.whatsmydns.net/">Global DNS Propagation Tracker</a> was indispensible as a sanity check.</p> </li> </ul> <p>One thing that worked really well was migrating my old WordPress blog. I’d expected this to be a complete pain, but it was nothing of the sort. I used <a href="http://www.httrack.com/">httrack</a> to mirror the rendered version of my blog site to a folder on local disk. Then I checked that folder into GitHub. Done.</p> <p>Finally, I learned a lot about <a href="http://jekyllrb.com/">Jekyll</a> from this process. It really is a static content generator. It’s not a serving system. This is why it can’t do things like 301/302 redirects. GitHub pages would be a much more powerful serving system if it included support for things like mod_rewrite. But then it would be less Jekyll-y. The beauty of the system is that it’s pure static content, and hence insanely fast and simple.</p> Filtering JSON with pyjsonselect and jss 2014-10-13T00:00:00+00:00 https://danvk.github.io//2014/10/13/filtering-json-with-pyjsonselect-and-jss <p>The data store for <a href="http://www.comparea.org/">Comparea</a> is a <a href="https://github.com/danvk/comparea/blob/master/comparea/static/data/comparea.geo.json">giant</a> 23MB <a href="http://geojson.org/">GeoJSON</a> file. Most of the space in that file is taken up by the giant lists of coordinates which define the boundaries of each shape. But there’s also some interesting metadata hidden amongst all those latitudes and longitudes:</p> <div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">{</span><span class="w"> </span><span class="nl">"features"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="nl">"geometry"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="nl">"type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Polygon"</span><span class="p">,</span><span class="w"> </span><span class="nl">"coordinates"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="w"> </span><span class="p">[</span><span class="w"> </span><span class="p">[</span><span class="w"> </span><span class="mf">-69.89912109375001</span><span class="p">,</span><span class="w"> </span><span class="mf">12.452001953124963</span><span class="w"> </span><span class="p">],</span><span class="w"> </span><span class="p">[</span><span class="w"> </span><span class="mf">-69.89570312500004</span><span class="p">,</span><span class="w"> </span><span class="mf">12.422998046875009</span><span class="w"> </span><span class="p">],</span><span class="w"> </span><span class="err">...</span><span class="w"> </span><span class="p">]</span><span class="w"> </span><span class="p">]</span><span class="w"> </span><span class="p">},</span><span class="w"> </span><span class="nl">"type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Feature"</span><span class="p">,</span><span class="w"> </span><span class="nl">"properties"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="nl">"description"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Aruba is an island."</span><span class="p">,</span><span class="w"> </span><span class="nl">"wikipedia_url"</span><span class="p">:</span><span class="w"> </span><span class="s2">"http://en.wikipedia.org/wiki/Aruba"</span><span class="p">,</span><span class="w"> </span><span class="nl">"area_km2"</span><span class="p">:</span><span class="w"> </span><span class="mf">154.67007756254557</span><span class="p">,</span><span class="w"> </span><span class="nl">"population"</span><span class="p">:</span><span class="w"> </span><span class="mi">103065</span><span class="p">,</span><span class="w"> </span><span class="nl">"population_year"</span><span class="p">:</span><span class="w"> </span><span class="s2">"???"</span><span class="p">,</span><span class="w"> </span><span class="nl">"name"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Aruba"</span><span class="w"> </span><span class="p">},</span><span class="w"> </span><span class="nl">"id"</span><span class="p">:</span><span class="w"> </span><span class="s2">"ABW"</span><span class="w"> </span><span class="p">}</span><span class="w"> </span><span class="p">]</span><span class="w"> </span><span class="p">}</span><span class="w"> </span></code></pre></div></div> <p>I’d hoped that I could use <a href="http://stedolan.github.io/jq/">jq</a> to filter out all the coordinates and just look at the metadata. But I got bogged down reading through its extensive <a href="http://stedolan.github.io/jq/manual/">manual</a>. At the end of the day, I didn’t want to learn an ad-hoc language just for filtering JSON files.</p> <p>The maddening thing was that there’s already a great language for selecting elements in trees: <strong>CSS Selectors</strong>! I did some searching and learned that there’s already a standard for applying CSS-like selectors to JSON called <a href="http://jsonselect.org/#overview">JSONSelect</a>. It dates from 2011. It has a spec and <a href="https://github.com/lloyd/JSONSelectTests">conformance tests</a>, and it’s been implemented in a <a href="http://jsonselect.org/#code">number of languages</a>.</p> <p>So I picked my language of choice (Python) and began implementing a new command line tool for filtering JSON files.</p> <p>The first issue I ran into: the standard <a href="https://github.com/mwhooker/jsonselect">Python implementation</a> didn’t conform to the standard! It only implemented 2/3 levels of CSS selectors from the spec, and many of the interesting selectors are in level 3.</p> <p>The reference <a href="https://github.com/lloyd/JSONSelect/blob/master/src/jsonselect.js">JavaScript implementation</a> was only 572 lines of code and, with all those tests, I figured it wouldn’t be too hard to port it directly to Python. This was a fun project—there’s something very zen about coding against a spec, getting test after test to pass. I learned about a few nuances of JavaScript and Python by doing this:</p> <ul> <li>Their regular expressions differ in how they specify <a href="http://stackoverflow.com/questions/3835917/how-do-i-specify-a-range-of-unicode-characters">unicode ranges</a></li> <li>the reference implementation made use of the <code class="language-plaintext highlighter-rouge">null</code> vs. <code class="language-plaintext highlighter-rouge">undefined</code> distinction</li> <li>JavaScript’s <code class="language-plaintext highlighter-rouge">typeof</code> function is quite odd</li> <li>JavaScript’s <code class="language-plaintext highlighter-rouge">Array.prototype.concat</code> method is quite subtle in its behavior</li> </ul> <p>I wound up re-implementing all of these quirks these in Python.</p> <p>At the end of the day, I published <a href="https://github.com/danvk/pyjsonselect/">pyjsonselect</a>, the first fully-conformant JSONSelect implementation in Python. A small win for the open source world!</p> <h2 id="jss">jss</h2> <p>So, how does the tool work? You can read about installation and basic usage <a href="https://github.com/danvk/jss">on github</a>, but here are a few motivating examples.</p> <p>jss is a JSON→JSON converter. It supports three modes:</p> <ol> <li>select: find all the values that match a selector (1→N)</li> <li>filter out (<code class="language-plaintext highlighter-rouge">-v</code>): remove all values which match a selector (1→1)</li> <li>filter in (<code class="language-plaintext highlighter-rouge">-k</code>): keep only values which match a selector (1→1)</li> </ol> <p>Here’s how the filter out mode works:</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ jss -v '.coordinates' comparea.geo.json </code></pre></div></div> <div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">{</span><span class="w"> </span><span class="nl">"features"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="nl">"geometry"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="nl">"type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Polygon"</span><span class="w"> </span><span class="p">},</span><span class="w"> </span><span class="nl">"type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Feature"</span><span class="p">,</span><span class="w"> </span><span class="nl">"properties"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="nl">"description"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Aruba is an island."</span><span class="p">,</span><span class="w"> </span><span class="nl">"wikipedia_url"</span><span class="p">:</span><span class="w"> </span><span class="s2">"http://en.wikipedia.org/wiki/Aruba"</span><span class="p">,</span><span class="w"> </span><span class="nl">"area_km2"</span><span class="p">:</span><span class="w"> </span><span class="mf">154.67007756254557</span><span class="p">,</span><span class="w"> </span><span class="nl">"population"</span><span class="p">:</span><span class="w"> </span><span class="mi">103065</span><span class="p">,</span><span class="w"> </span><span class="nl">"population_year"</span><span class="p">:</span><span class="w"> </span><span class="s2">"???"</span><span class="p">,</span><span class="w"> </span><span class="nl">"name"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Aruba"</span><span class="w"> </span><span class="p">},</span><span class="w"> </span><span class="nl">"id"</span><span class="p">:</span><span class="w"> </span><span class="s2">"ABW"</span><span class="w"> </span><span class="p">},</span><span class="w"> </span><span class="err">...</span><span class="w"> </span><span class="p">]</span><span class="w"> </span><span class="p">}</span><span class="w"> </span></code></pre></div></div> <p>That knocked out all the <code class="language-plaintext highlighter-rouge">coordinates</code> keys from the GeoJSON file!</p> <p>I eventually did figure out how to do this in jq. Here’s what it looks like:</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ jq 'del(..|.coordinates?| select(. != []))' comparea.geo.json (same output) </code></pre></div></div> <p>To come up with that incantation, I had to dig through jq’s <a href="https://github.com/stedolan/jq/issues/319">github issues</a>. It’s certainly not something I could re-type from memory! The <code class="language-plaintext highlighter-rouge">jss</code> version is clear as could be.</p> <p>It’s also significantly faster. For the 23MB <code class="language-plaintext highlighter-rouge">comparea.geo.json</code> file, the <code class="language-plaintext highlighter-rouge">jss</code> command runs in 1.7s on my laptop vs. 12.9s for <code class="language-plaintext highlighter-rouge">jq</code>. The trick to this speed is <a href="http://stackoverflow.com/questions/26221309/preorder-traversal-using-python-generators-with-a-mechanism-to-ignore-subtrees">appropriate pruning</a> of the selector search.</p> <p>Here’s how the “select” mode works:</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ jss '.name' comparea.geo.json </code></pre></div></div> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>"Aruba" "Afghanistan" "Angola" "Anguilla" "Albania" "Andorra" "United Arab Emirates" "Argentina" "Armenia" "American Samoa" ... </code></pre></div></div> <p>Unlike “filter out”, which maps one JSON object to another JSON object, “select” extracts multiple values from a single object. Each line of output is its own JSON object. This is why it’s 1→N, vs 1→1 for the other modes. It’s useful if you want to do more processing using grep, sed and other familiar line-oriented tools.</p> <h2 id="fancy-selectors">Fancy selectors</h2> <p>You can specify as operations as you like. Here’s a more complex invocation:</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ jss -v .coordinates -k '.features&gt;*:has(:contains("ZAF"))' comparea.geo.json </code></pre></div></div> <div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">{</span><span class="w"> </span><span class="nl">"features"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="nl">"geometry"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="nl">"type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Polygon"</span><span class="w"> </span><span class="p">},</span><span class="w"> </span><span class="nl">"type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Feature"</span><span class="p">,</span><span class="w"> </span><span class="nl">"id"</span><span class="p">:</span><span class="w"> </span><span class="s2">"ZAX"</span><span class="p">,</span><span class="w"> </span><span class="nl">"properties"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="nl">"description"</span><span class="p">:</span><span class="w"> </span><span class="s2">"South Africa, officially the Republic of South Africa, is a country located at the southern tip of Africa. It has 2,798 kilometres of coastline that stretches along the South Atlantic and Indian oceans."</span><span class="p">,</span><span class="w"> </span><span class="nl">"population_source"</span><span class="p">:</span><span class="w"> </span><span class="s2">"World Factbook"</span><span class="p">,</span><span class="w"> </span><span class="nl">"sov_a3"</span><span class="p">:</span><span class="w"> </span><span class="s2">"ZAF"</span><span class="p">,</span><span class="w"> </span><span class="nl">"freebase_mid"</span><span class="p">:</span><span class="w"> </span><span class="s2">"/m/0hzlz"</span><span class="p">,</span><span class="w"> </span><span class="nl">"name"</span><span class="p">:</span><span class="w"> </span><span class="s2">"South Africa"</span><span class="p">,</span><span class="w"> </span><span class="nl">"population_source_url"</span><span class="p">:</span><span class="w"> </span><span class="s2">"https://www.cia.gov/library/publications/the-world-factbook/fields/2119.html"</span><span class="p">,</span><span class="w"> </span><span class="nl">"area_km2_source_url"</span><span class="p">:</span><span class="w"> </span><span class="s2">"https://www.cia.gov/library/publications/the-world-factbook/fields/2147.html"</span><span class="p">,</span><span class="w"> </span><span class="nl">"population_date"</span><span class="p">:</span><span class="w"> </span><span class="s2">"July 2014"</span><span class="p">,</span><span class="w"> </span><span class="nl">"wikipedia_url"</span><span class="p">:</span><span class="w"> </span><span class="s2">"http://en.wikipedia.org/wiki/South_Africa"</span><span class="p">,</span><span class="w"> </span><span class="nl">"area_km2"</span><span class="p">:</span><span class="w"> </span><span class="mi">1214470</span><span class="p">,</span><span class="w"> </span><span class="nl">"area_km2_source"</span><span class="p">:</span><span class="w"> </span><span class="s2">"World Factbook"</span><span class="p">,</span><span class="w"> </span><span class="nl">"population"</span><span class="p">:</span><span class="w"> </span><span class="mi">48375645</span><span class="w"> </span><span class="p">}</span><span class="w"> </span><span class="p">}</span><span class="w"> </span><span class="p">]</span><span class="w"> </span><span class="p">}</span><span class="w"> </span></code></pre></div></div> <p>After filtering out the <code class="language-plaintext highlighter-rouge">coordinates</code> fields, it keeps only elements directly under the <code class="language-plaintext highlighter-rouge">features</code> key (i.e. a top-level feature) which contains “ZAF” somewhere (the “sov_a3” field, in this case).</p> <p>Isn’t this just as complicated as the jq syntax? Sure! But at least you learned something useful. If you get better at writing CSS selectors as a result of filtering JSON files, then that’s great! You’ve become a better web developer in the process.</p> <p>You can install <code class="language-plaintext highlighter-rouge">jss</code> with <code class="language-plaintext highlighter-rouge">pip</code>. Read more <a href="https://github.com/danvk/jss">on github</a>!</p> Facebook (non-)Insights 2014-10-13T00:00:00+00:00 https://danvk.github.io//2014/10/13/facebook-non-insights <p>There was a spike of traffic to <a href="https://www.comparea.org">Comparea</a> over the weekend:</p> <p><img src="/images/comparea-spike.png" alt="Spike of traffic to comparea.org" title="Spike of traffic to comparea.org" /></p> <p>Awesome! All of it came from Facebook and went to a comparison of <a href="https://www.comparea.org/FXX+AUS">France vs Australia</a>:</p> <p><img src="/images/comparea-all-facebook.png" alt="All traffic coming from Facebook" title="All traffic coming from Facebook" /></p> <p>I can easily get insight into who <a href="https://twitter.com/TeachingIdeas/status/520656010391093248">tweeted it</a>. But Facebook is a big black box. In theory, I can use <a href="https://www.facebook.com/FacebookInsights">Facebook Insights</a> to track this. It claims that 50 actions have led to ~2400 visits to my site, but declines to say anything more:</p> <p><img src="/images/comparea-facebook-non-insight.png" alt="All information is suppressed" title="All information is suppressed" /></p> <p>I understand that Facebook is more private than Twitter, but this is frustrating. I’d hope to at least see which country the shares are happening in. Are these French people coming to Comparea? Australians? Facebook won’t even reveal that all the recent visits are to a single URL! I’d think that thousands of clicks would be enough to provide anonymity there.</p> <p>I’m not sure exactly what sorts of insights Facebook can share without revealing identities, but I’d very much hoped for more than this.</p> <p>Has anyone had positive experiences with Facebook Insights?</p> Trying out GitHub Pages 2014-10-01T00:00:00+00:00 https://danvk.github.io//2014/10/01/post <p>I’m going to try hosting my site and blog on GitHub pages. My hope is that blogging using GitHub and Markdown will lower the barrier to writing, and that GitHub pages will eliminate any worries about performance and security while hosting my own site.</p> <p>This is all very much a work in progress, so feedback is welcome!</p>