This article presents a set of Haskell bindings for JudySL, the functions in the Judy library that implement associative arrays with variable-length byte string keys and integer values. It concludes with a performance comparison between Judy arrays and hashing when applied to a simple dictionary lookup problem.
Judy arrays are really just compressed tries that integrates a collection of techniques to achieve a high degree of compression. Compressed tries are far from new; I used one when it was new thirty years ago in a spelling checker. Compression techniques have certainly evolved a lot in Judy arrays. It shouldn’t be surprising however that, as our computers can now hold most or all the data structures of programs in main memory, techniques like compressed tries are being rediscovered all over again.
In a previous article, I described the use of the hash table functions in the Db library as an associative array in a program. It worked OK, but I continued to look for a data structure that might be more time and space efficient. The Judy library seemed to be a good candidate because it is often mentioned in answer to complaints about poor performance of the
Data.HashTable module on Haskell mailing lists.
So I looked at two Haskell packages: HSJudy and judy. The first one has a very general design and parts of its interface beyond the basic functionalities are difficult to understand due to a lack of documentation. Looking at its source code, the rationale for some of it design choices seem unclear and unnatural (e.g., why not
ByteString instead of
String? And why does one need to
freeze before lazy IO? Why a mini GC?). I tinkered with its code a little but decided it wasn’t worth the effort. The second one covers only
JudyL, which maps integer keys to integer values.
This FFI is quite simple to use. The function
new returns a new, empty JudySL array, which is equivalent to initializing a pointer with a
NULL value in C code (the representation for an empty Judy array). The “foreign pointer” returned by
new has the added advantage of calling
JudySLFreeArray automatically (to free the Judy array) when Haskell determines that it is no longer in use. The functions
del are equivalent to their C counterparts. So are the functions
prev. For convenience, versions of these functions with the suffix
s are supplied (
gets, etc.); they operate on
String instead of
ByteString type keys.
The convenience function
update applies a function to change the value corresponding to a key if the latter exists in the JudySL array. Otherwise it does nothing.
Update0 also changes the value for a key, except when it doesn’t exist, its value is first initialized to zero before the update function is applied. Its only use is probably to implement some form of “counters”, such as frequency counters for n-grams. Versions of these two functions with the
s suffix are also provided.
There are also convenience functions for converting entire Judy arrays to and from lists of key-value pairs. The
toList function returns its results lazily, so a Judy array can be written to disk (say) very efficiently even when only a small part of main memory remains free for heap allocation.
So how well do Judy arrays work? Here’s a version of the program I used to test my Db 1.85 FFI and
Database.Berkeley.Db in my last article, now modified to use this JudySL FFI.
> module Main where > > import System.IO > import Data.Maybe (isJust) > import Control.Monad (foldM, mapM_, liftM2, liftM) > > import qualified JudySL as J
It reads all the words in the standard Unix word list file
/usr/share/dict/words, builds a Judy array with them, and writes it to the file
> buildWordDb = do > j <- J.new > > withFile "/usr/share/dict/words" ReadMode $ \ h -> do > c <- hGetContents h > mapM_ (\ w -> J.puts j w 1) (lines c) > > withFile "words.db" WriteMode $ \g -> > J.toList j >>= mapM_ (\ (k, _) -> hPutStrLn g k)
Then it reads the word list file again and verifies that each word in it is indeed in the Judy array.
> verifyWordDb = > withFile "words.db" ReadMode $ \ h -> do > c <- hGetContents h > j <- J.fromList [(w, 1) | w <- lines c] > withFile "/usr/share/dict/words" ReadMode $ \ h -> do > c <- hGetContents h > foldM (\r w -> liftM2 (&&) (liftM isJust (J.gets j w)) (return r)) > True > (lines c)
Here’s the main function which should print
True when it is run.
> main = do > buildWordDb > verifyWordDb >>= print
The running time for the tuned, Db 1.85 version of this program was:
$ /usr/bin/time -f "%E real, %U user, %S sys" ./DbTest True 0:01.20 real, 1.11 user, 0.09 sys $
The running time for the above program that uses Judy array is:
$ /usr/bin/time -f "%E real, %U user, %S sys." ./JudyTest True 0:00.60 real, 0.57 user, 0.02 sys. $
So a Judy array uses half the time required by the Db hash table version. And it doesn’t required any tuning! The
words.db file written by
JudyTest is simply a text file and contains the same text as the word list file (in my case 931708 bytes). The
words.db file used by
DbTest is a hash table file that is 2.5 Mb in size. This makes Judy arrays’ performance quite impressive really.
$ /usr/bin/time -f "%E real, %U user, %S sys." python3 BDBTest.py 0:02.63 real, 1.40 user, 1.16 sys. $
It’s only about 4x slower than the Haskell/Judy array solution, 2x slower than the tuned Haskell/Db solution, and slightly worse than the untuned Haskell/Db and Haskell/Berkeley DB solutions. Not so bad actually for an interpreted and dynamically typed language. But note that the bottleneck of the present task lies in the library routines being called. The results therefore reflect the relative performance of Berkeley DB 4.x, Db 1.85, and Judy. The size of the hash table written by this Python version is also 2.5 Mb in size.