September 25, 2005

Better compression, and MORE OF IT!

Well, this is a followup to Compressing the bible from yesterday.

I added a couple of features: Searching (this is the only Bible program I've seen that will match a phrase that crosses a verse boundary, although it can only match two verses at once), and the World English Bible translation, a modern public domain translation. (It's a work in progress, so it isn't as complete as the KJV.)

The WEB increases the size of the source text by a few MB, but the compression is better. Thanks to an idea I got from a comment on the last entry, the hint list is no longer needed. (Maybe it would help, but I'm too lazy to check.) Instead, I have a smarter algorithm for picking words, which doesn't need to look at spaces; instead, I just use the high bit of each character to indicate "and then print a space after this", whether it's an abbreviation or a plain character. This improves compression quite noticably.

In fact, the whole program, including searching and CGI interfaces, and all the data, totals 49.5% of the size of the two source texts.

A block compression algorithm would do much better, but random access would be less convenient.

Now that the program has two versions, I've stopped calling the archive "kjv.tgz", so now it's bible.tgz. To the best of my knowledge, everything in that archive is in the public domain. Or you can just play with the program, which is available here.

Posted by seebs at 02:24 AM | Comments (0)

September 24, 2005

Compressing the Bible

This might be sort of a religious topic in some ways, but really it's pure geek stuff.

There are dozens upon dozens of online Bibles. Most of them are neat, but totally useless for the one thing I want to do most often: Copy and paste passages. In particular, copy and paste passages to bulletin boards running vBulletin, phpBB, or other BBSs using the same formatting rules.

I have a fairly nice online King James Version; fairly nice, except that cut-and-paste from it produces "versenumber newline space space space newline newline verse newline versenumber space space space newline newline verse". That's a lot of text to edit away if you just wanted, say, the text of a short passage.

So, the first thing to do was find a public domain text (this pretty much implies KJV, although I believe there's others) available with all the Apocrypha (or Deuterocanonicals) included. Then, process it into a machine-readable format.

My original goal was a command-line program that could print verses. I added the requirement that it work as a CGI script, have some pretty-printing, and so on.

Anyway, it's easy enough to dump the entire text of the KJV into a header, record offsets into it, and work from there. But wait! There's a lot of redundancy in this text, and it uses a fairly small set of characters.

In fact, the KJV uses only 65 distinct characters, once you've removed all the verse numbers and such. What that means is that every other character, except newline and null, is fair picking for a very cheap compression scheme. Now, how to pick words? The easiest way is to just count occurrences of words. This does pretty well. Deciding whether to match "word" or "word " is more challenging; matching both is unlikely to be rewarding. There will be more matches without the trailing space, but the version with the trailing space saves one more character each time it hits.

Eventually, I adopted the additional policy of specifying certain words and phrases I thought were good bets; for instance, "n the ". It turns out that the extra characters saved in the thousands of times this shows up are worth it, even though neither "on the " nor "in the " was common enough individually to be worth compressing.

As a result, I've been able to get roughly 5MB of text, plus header information storing the location of all the verses, a CGI interface, and a bit of other support code, into a program totalling under 3MB of storage.

Here it is.

The source archive, including the data files, is also available. Note that the source archive is over a megabyte smaller; gzip compression is dramatically better than the compression algorithm I used. (Mine was selected to be easy to implement and provide convenient random access to any verse without having to read other verses.)

The specific intended function of this is to be runnable on nearly anything, although you do need to use perl to build the headers. The resulting code should run on anything with the same character set, though. The particular goal is convenient cut-and-paste. To that end, the default behavior is to display a citation and the text requested, without verse numbers. Verse numbers may be optionally requested. Similarly, the program can do HTML or BBCode markup. (bbCode is the same as vBcode and everyone else's markup; it's the non-HTML markup that uses [b] for bold and [i] for italics.) Finally, there's even an option to escape HTML entities so that you get a chunk of text which could be pasted into, say, a LiveJournal or a blog and it would render nicely.

Bug reports, etc., welcome.

Edited to add: There's a new entry about this, explaining updates since I wrote the original.

Posted by seebs at 02:57 AM | Comments (3)

September 04, 2005

What. The. Fuck.

http://www.christianforums.com/t2050810-understanding-new-orleans.html

Mardi Gras was a demonic party and the city of New Orleans was one of the must demonic in the U.S.A.. I believe that the power of the air are being cast down from the frist heaven. And these are just the small fry's. If you think that they are coming down without bringing death with them you need to read the book. And if you have the silly idea the "chruch" is going to be rapture out you are in for a big let down. So pull out the sword and let start killing them as the Lord tells us.

What the fuck.

I mean, what else am I supposed to say? Am I supposed to explain away how this guy isn't a "real" Christian? I sure hope he's a real Christian, because it sounds to me like he could benefit a lot from the whole Jesus thing.

But wow.

To think! There is a point from which Pat Robertson is a calm moderate seeking balanced viewpoints and showing an exceptional witness of compassion.

Special thanks to Mr. Minion, who pointed this out.

Posted by seebs at 03:47 AM | Comments (13)

September 01, 2005

Pwnz3d, beeyotch

Well, well, well. Let's just cut to the chase; Defendant's motion to dismiss is DENIED. (Emphasis in original.)

Source Lending's surreal combination of argumentation has been weighed in the balance and found wanting. It's hard to express how silly their arguments have been. After they tired of accusing me of extortion, they showed up in court with a brand new argument; this is all a lark! There is no serious case here. Whether or not they committed thousands of violations of federal law is not the point. The big question is whether I am having too much fun; after all, if you're having fun in court, any remedial claims would be barred by res sola bozocata ("it's just clowning").

No, folks. This isn't a lark. A lark is one of these:

A lark
Image inverted left/right from original

Or one of these:

A lark
Image inverted left/right from original. Original copyright Carol Davis, used by kind permission.

This is not to say that every argument raised in this case has been serious or well-grounded. For instance, consider this knee-slapper, offered by defense attorneys Mahoney and Emerson as explanation for Defendant's refusal to offer even basic responses to interrogatories they have been sitting on for close to two years:

The time afforded Seebach has been more than ample to discover, on his own, the merits of his class action. [...] Despite an allegation of thousands of victims, and hundreds of days to identify them Seebach has failed to find even one more, much less the numbers required to make joinder imparcticable.

What's the judge say? The judge says this.

Plaintiff claims that Defendant has not cooperated in the discovery process which would gather from Defendant relevant and discoverable informationnecessary for the Court to evaluate the class action allegations. Pltf. Opp. pp 11-13. In its Reply, defendant does not deny these allegations. Deft. Reply pp. 4-5. Defendant only says that plaintiff has not been able to discover anything yet. This leaves the Court with the impression that Defendant has not cooperated in the discovery process.
[...]
In the present case, Plaintiff may have had the time to discover the necessary facts, but if Defendant has not cooperated, the amount of time means nothing. [...] The answer is non-responsive, especially since it appears from the question that this would be information to which Source Lending had knowledge or could easily obtain the knowledge. [...] Again, Source Lending should have this information or easy access to the information, yet, it gives another non-responsive answer. [...] The Court also notes that Defendant's claim of attorney-client privilege or the work product rule in its answers to those three interrogatories is questionable, at best.

The judge also contributed a few other choice morsels; "Defendant is incorrect." "These arguments are without merit." "The Court agrees with Plaintiff [...]"

Entirely absent from this order are any reference to Source Lending's ever more ludicrous simultaneous assertions that this amounts to some sort of blackmail or extortion (their words), or the even sillier argument that, if I am just going to give the money away, I am ill-qualified to represent a class. (Apparently, a good class rep should be amenable to bribes.) This is for the best; the "arguments" are so meaningless and ill-formed as to defy rebuttal, standing best on their own as a monument to the importance of having a case before calling names.

Of particular interest is Defendant's tiresome assertion that I am in privity with someone else who sued them. The judge has adequately addressed the argument, observing that "The only known relationship is that they both hired Attorney Appelget as their counsel."

Since Defendant's laziness about discovery covers both the answering and asking of questions, I will save them the trouble of striving to form an Interrogatory trying to figure out the actual relationship between myself and Bob ELIDE, which is apparently of great interest to them.

I met Bob online, talked to him on the phone, and got a referral to an attorney who represented him in some junk fax cases. He explained that he handled most of his cases pro se, but that he retained an attorney for difficult cases. When I presented Mr. Appelget with my junk faxes, he identified the Source Lending faxes based on Bob's existing pro se suit against Source. Bob sued Source Lending on April 20th, 2003; my case was not initiated until October 17th, 2003.

Defendant's allegation that I was necessarily aware of Bob's case is based on false beliefs. Defendant, of course, should be aware that Bob's case was originally filed pro se. They certainly can't honestly think these are the same cases, despite their claims to the contrary, given that settlement negotiations in both cases were simultaneous, and clearly distinct. (This includes the bizarre letter they sent proposing that they settle with me only if I gave them blanket indemnity against other junk fax cases!)

The question of why I didn't join my case with Bob's is easily answered; Bob was seeking financial relief and had already filed and pled the case in a way incompatible with my interests. My interests are focused around preventing future junk faxing by Source Lending; Bob wanted to get paid. While the instrumentality of the cases (unlawful faxes sent by Source) is similar, the relief desired is dissimilar. Joinder would have served neither party's interests. Source could have found this out by simply asking, but the discovery process itself seems to be a mystery to them.

If Defendant wishes to continue wasting everyone's time with this frivolous defense, I would suggest beginning by complying with discovery. Defendant's refusal to show up for a deposition on the grounds of "we dun wanna" (they refused to come to the first noticed deposition, then claimed that they were unavailable for the second; they have not returned calls seeking to reschedule) is hardly helpful. The judge has helpfully identified specific interrogatories to which a response simply must be given for the case to proceed. If Defendant's allegations that there is no basis for class certification are rooted in fact, why are they so unwilling to provide the evidence needed to let the court establish this?

What I want is the same thing I always wanted from this; I want Defendant to never send junk faxes again, and I want Defendant to make some kind of restitution to the victims of their unlawful advertising campaigns. If they wanted to settle on terms like that, we could be done with this. Of course, they'd still have to cough up the information about who sent their faxes, how many were sent, and so on. For now, it seems they'd rather call names, indulge in a sort of surreal parody of standard civil practice, and insult the judge's intelligence.

Posted by seebs at 07:30 PM | Comments (3)