Compressing the Bible

2005-09-24 01:57

This might be sort of a religious topic in some ways, but really it’s pure geek stuff.

There are dozens upon dozens of online Bibles. Most of them are neat, but totally useless for the one thing I want to do most often: Copy and paste passages. In particular, copy and paste passages to bulletin boards running vBulletin, phpBB, or other BBSs using the same formatting rules.

I have a fairly nice online King James Version; fairly nice, except that cut-and-paste from it produces “versenumber newline space space space newline newline verse newline versenumber space space space newline newline verse”. That’s a lot of text to edit away if you just wanted, say, the text of a short passage.

So, the first thing to do was find a public domain text (this pretty much implies KJV, although I believe there’s others) available with all the Apocrypha (or Deuterocanonicals) included. Then, process it into a machine-readable format.

My original goal was a command-line program that could print verses. I added the requirement that it work as a CGI script, have some pretty-printing, and so on.

Anyway, it’s easy enough to dump the entire text of the KJV into a header, record offsets into it, and work from there. But wait! There’s a lot of redundancy in this text, and it uses a fairly small set of characters.

In fact, the KJV uses only 65 distinct characters, once you’ve removed all the verse numbers and such. What that means is that every other character, except newline and null, is fair picking for a very cheap compression scheme. Now, how to pick words? The easiest way is to just count occurrences of words. This does pretty well. Deciding whether to match “word” or “word “ is more challenging; matching both is unlikely to be rewarding. There will be more matches without the trailing space, but the version with the trailing space saves one more character each time it hits.

Eventually, I adopted the additional policy of specifying certain words and phrases I thought were good bets; for instance, “n the “. It turns out that the extra characters saved in the thousands of times this shows up are worth it, even though neither “on the “ nor “in the “ was common enough individually to be worth compressing.

As a result, I’ve been able to get roughly 5MB of text, plus header information storing the location of all the verses, a CGI interface, and a bit of other support code, into a program totalling under 3MB of storage.

Here it is.

The source archive, including the data files, is also available. Note that the source archive is over a megabyte smaller; gzip compression is dramatically better than the compression algorithm I used. (Mine was selected to be easy to implement and provide convenient random access to any verse without having to read other verses.)

The specific intended function of this is to be runnable on nearly anything, although you do need to use perl to build the headers. The resulting code should run on anything with the same character set, though. The particular goal is convenient cut-and-paste. To that end, the default behavior is to display a citation and the text requested, without verse numbers. Verse numbers may be optionally requested. Similarly, the program can do HTML or BBCode markup. (bbCode is the same as vBcode and everyone else’s markup; it’s the non-HTML markup that uses [b] for bold and [i] for italics.) Finally, there’s even an option to escape HTML entities so that you get a chunk of text which could be pasted into, say, a LiveJournal or a blog and it would render nicely.

Bug reports, etc., welcome.

Edited to add: There’s a new entry about this, explaining updates since I wrote the original.

Peter Seebach

---

Comments

  1. I wrote a little Bible program once. It was originally intended to be able to work with multiple versions, support user annotations, and all sorts of other bells and whistles, but those never got done; but it has the basic functions (read specified verses, search for particular words). Its data file (for the RSV, no deuterocanonicals; I screen-scraped the text from another program which was freely available but I couldn't use because it wouldn't run on my computers) is about 1.2MB, from a raw text about 4MB in size. Actually it has four data files, but the others are much smaller.

    The basic scheme is word-based, with a bit of Huffmanesque coding. Quoting from my notes on the format:

    ------ quotation begins ------
    This is a byte stream, interpreted as follows:

    00 Verse boundary, with new paragraph
    01 Verse boundary, without new paragraph

    02 xx Character xx, followed by space
    03 xx Character xx, not followed by space

    04 .. EF Token -4 (thus 0..235)
    Fx yy Token 236 + x*253 + yy-2
    FF xx yy Token (xx-2)+253*(yy-2)+236+253*15

    A token is preceded by a space if the last item was a token.

    Book and chapter boundaries are treated as verse boundaries
    with new paragraph. (Every verse 1 does in fact have a new
    paragraph at the start.) The book and chapter positions are
    in another file.
    ------ quotation ends ------

    This provides per-chapter random access; I decided that linear search on a single chapter wouldn't be too costly, which I think was right.

    To my considerable surprise, gzip -9v doesn't do much better on the text than my own compression -- but bzip and rzip both get it down to less than a megabyte, which is pretty impressive.

    These days I mostly just load the whole text into emacs and search/browse/cut-and-paste from there; memory is cheap :-). But of course this doesn't get me any nice formatting.

    g · 2005-09-24 17:56 · #

  2. It's nice to see someone else using the "book chapter:verse-chapter:verse" specification that I used for the Electric King James Bible. One thing I did do was include abreviations and a spell-checking feature.

    Sean Conner · 2005-09-24 20:36 · #

  3. Mine is fairly tolerant about abbreviations. No spell-checking, though. :)

    I have since added the WEB (world english bible) and a search feature. New entry coming up in a bit.

    You (g) may well be right about per-chapter linear search. I redid things substantially from early versions, and started using '\0' for end of verse; it turns out that tracking end of verse that way is no worse than storing the length of a verse, and is sometimes better!

    seebs · 2005-09-24 20:50 · #

 
---