01.31.07
Reading a dictionary into a hash in Ruby
My experiment with code golf the other day forced me to think about ways to load a dictionary into a hash using only one line of code. I couldn’t think of a way then, but today I came up with this:
h = Hash[*(IO.readlines(Dict).map{|w|[w.chomp,1]}.flatten)]
Slow as hell, but it works! With little searching, I found another method, which is much better:
h = IO.readlines(Dict).inject({}) {|h,w| h[w.chop!]=1; h }
The “;” in there pushes the meaning of “one line”, but change “1; h” to “1 and h” and it’s less ambiguous. That got me thinking about all the different ways to load a dictionary into a hash. Which one is the fastest? Here’s all the ones I could think of, sorted from fastest to slowest. They were tested with Dict="/usr/share/dict/words"
:
Time | Command |
---|---|
0.443s | h = {}; open(Dict).read.split.each { |w| h[w]=1 } |
0.448s | h = {}; IO.read(Dict).split.each { |w| h[w]=1 } |
0.555s | h = {}; IO.readlines(Dict).map { |w| h[w.chop!] = 1 } |
0.568s | h = {}; IO.readlines(Dict).map { |w| h[w.chomp!] = 1 } |
0.882s | h = {}; IO.readlines(Dict).map { |w| h[w.chop] = 1 } |
0.897s | h = {}; IO.readlines(Dict).map { |w| h[w.chomp] = 1 } |
0.958s | h = {}; IO.foreach(Dict) { |w| h[w.chop!] = 1 } |
1.064s | h = open(Dict).read.split.inject({}) {|h,w| h[w.chop!]=1; h } |
1.065s | h = IO.read(Dict).split.inject({}) {|h,w| h[w.chop!]=1; h } |
1.306s | h = {}; IO.foreach(Dict) { |w| h[w.chomp] = 1 } |
1.475s | h = {}; open(Dict).read.scan(/\w+/) { |w| h[w]=1 } |
1.479s | h = {}; IO.read(Dict).scan(/\w+/) { |w| h[w]=1 } |
1.172s | h = IO.readlines(Dict).inject({}) {|h,w| h[w.chomp]=1; h } |
26.022s | h = Hash[*(IO.readlines(Dict).map{|w|[w.chomp,1]}.flatten)] |
There are some clear trends here. First and foremost, it pays to eschew line-oriented processing. The read
method slurps in the whole string and then split segments it into words. I suspect that split
with no parameter (=split on whitespace) is so common that it’s special-cased in the Ruby interpreter. A little digging into the Ruby C source confirms that guess. The relevant stuff is the awk_split
stuff starting at line 3514. Just for the record, the Ruby C source is some of the cleanest I’ve ever seen.
Another clear trend: using the self-modifying chomp!
and chop!
is much faster than chomp
and chop
. This makes a lot of sense, since Ruby doesn’t need to create bunches of new strings.
So in conclusion, the fastest technique is:
h = {}; open(Dict).read.split.each { |w| h[w]=1 }
and the fastest one-liner is:
h = open(Dict).read.split.inject({}) {|h,w| h[w.chop!]=1; h }
I suspect that a solution using scan
could win if Ruby had a special case for whitespace in that routine like it does in split
.