Better compression, and MORE OF IT!

2005/09/25

Categories: GeekStuff

Well, this is a followup to Compressing the bible from yesterday.

I added a couple of features: Searching (this is the only Bible program I’ve seen that will match a phrase that crosses a verse boundary, although it can only match two verses at once), and the World English Bible translation, a modern public domain translation. (It’s a work in progress, so it isn’t as complete as the KJV.)

The WEB increases the size of the source text by a few MB, but the compression is better. Thanks to an idea I got from a comment on the last entry, the hint list is no longer needed. (Maybe it would help, but I’m too lazy to check.) Instead, I have a smarter algorithm for picking words, which doesn’t need to look at spaces; instead, I just use the high bit of each character to indicate “and then print a space after this”, whether it’s an abbreviation or a plain character. This improves compression quite noticably.

In fact, the whole program, including searching and CGI interfaces, and all the data, totals 49.5% of the size of the two source texts.

A block compression algorithm would do much better, but random access would be less convenient.

Now that the program has two versions, I’ve stopped calling the archive “kjv.tgz”, so now it’s bible.tgz. To the best of my knowledge, everything in that archive is in the public domain. Or you can just play with the program, which is available here.