Fractally Wrong: Counting GUIDs

2020/02/06

Categories: GeekStuff

I was recently pointed at some code, and the more I looked at it, the more impressed I was by how wrong it was. I think this may be the highest density of errors I’ve ever seen in a single line of code, in fact. The stated goal: Listing all GUIDs. This can’t be done, but this is more wrong than that.

If the underlying task were at all possible, the other bugs would probably get noticed, but since they all prevent each other from being relevant, we ended up with… This.

So, the source I started with was http://allguids.cdmansfield.com/, posted helpfully by someone in the Gopher Slack. It’s a capture from a Stack Overflow thread, now deleted. It looks suspiciously similar to an earlier version of code later commented on by Raymond Chen. The Raymond Chen post is about a year and a half later, and by that time, the only remaining error is the laws-of-physics error. This earlier version, though, is so much worse. I’m not going to quote the whole thing, just two relevant segments (the first trimmed to show the relevant logic only):

string start1 = "00000000000000000000000000000000";
string iteration = start;
for (Double k = 0; k < 5316911983139663491615228241121400000d; k++)
{
	iteration = NextLetter(iteration);
	// code elided
}

There’s quite a few possible problems here, but to fully appreciate them, you have to see how NextLetter is supposed to work.

public static string NextLetter(string letter)
{
const string alphabet = "abcdefghijklmnopqrstuvwxyz0123456789";
if (!string.IsNullOrEmpty(letter))
{
    char lastLetterInString = letter[letter.Length - 1];

    // if the last letter in the string is the last letter of the alphabet
    if (alphabet.IndexOf(lastLetterInString) == alphabet.Length - 1)
    {
	//replace the last letter in the string with the first leter of the alphbat and get the next letter for the rest of the string
	return NextLetter(letter.Substring(0, letter.Length - 1)) + alphabet[0];
    }
    else
    {
	// replace the last letter in the string with the proceeding letter of the alphabet
	return letter.Remove(letter.Length - 1).Insert(letter.Length - 1, (alphabet[alphabet.IndexOf(letter[letter.Length - 1]) + 1]).ToString());
    }
}
//return the first letter of the alphabet
return alphabet[0].ToString();
}

You might want to also look at the code from the year-and-a-half later query to Raymond Chen:

 public static void Main(string[] args)
 {
   using (var w = new System.IO.StreamWriter("all_guids.txt")) {
     for (long l1 = 0; l1 <= 0xFFFFFFFFL; l1++)
     for (long l2 = 0; l2 <= 0xFFFFFFFFL; l2++)
     for (long l3 = 0; l3 <= 0xFFFFFFFFL; l3++)
     for (long l4 = 0; l4 <= 0xFFFFFFFFL; l4++)
       System.Console.WriteLine(
         new System.Guid(string.Format("{0:X8}{1:X8}{2:X8}{3:X8}",
                                   l1, l2, l3, l4)));
   }
}

So stop here for a moment, and see how many flaws you can identify. There’s a lot. Ready? Let’s go!

  1. As Raymond Chen explains so much better, you can’t actually iterate this space.
  2. If you could, the correct number of iterations would not be whatever that number is. (More on this later.)
  3. But that number is too large for counting by incrementing a double – around 2^53, incrementing a 64-bit float starts producing the same 64-bit float. Even if it were 128-bit floats, you’d still hit that point long before you got to this number.
  4. The counting logic and the generation-of-GUIDs logic are separate, and don’t agree with each other.
  5. The generation logic is using the whole alphabet, not hex digits, meaning it will iterate 36^32 values, rather than 16^32 values.
  6. The magic number is supposed to be 2^122, but it’s not.
  7. The letters are before the numbers, which means that we count from 000[…]09 to 000[…]1a. This also means that it will start generating 33-character values after generating only 10/36 of the space it’s supposedly iterating.
  8. The name NextLetter does not describe what the routine does.
  9. Ignoring the overall impossibility, this is possibly one of the most expensive ways I’ve ever seen someone try to iterate through possible strings.
  10. Oh wait, the new one a year and a half later is actually worse.

So, imagine that you had an infinitely fast machine and could somehow do this. What would it do? Well, it would loop forever at the first number that isn’t changed by incrementing, while continuing to generate new values. But say you had 256-bit floats and they could represent this. You’d generate the first 5,316,911,983,139,663,491,615,228,241,121,400,000 values from a sequence. It’ll count about 8 cycles of 36^23 values, so about the top 8 characters never change. It misses the majority of the possible values even within the range it counts.

Powers of two ending in zero, a very short list

That magic number is particularly interesting. It’s apparently derived from a quoted number elsewhere, which probably comes back to a mysterious article named PPCGuidGen.asp, which I think is from Microsoft’s PocketPC documentation, now in the Memory Hole. Various searches do turn up references to it, though. The article apparently stated:

Question: How many GUID combinations are there?

Answer: There are 122 random bits (128 - 2 for variant - 4 for version) so this calculates to 2^122 or 5,316,911,983,139,663,491,615,228,241,121,400,000 possible combinations.

The Internet Archive seems to have a copy, though!

That number is wrong in several ways. Most obviously, that value is not 2^122. 2^122 is 5,316,911,983,139,663,491,615,228,241,121,378,304. You don’t need to know that off the top of your head to spot this; it’s instantly obvious, because powers of two can’t end in a zero in base 10. For some mysterious reason, no matter how many times you multiply something by two, its set of factors never includes any fives. But it’s also not an answer to the question “how many GUID combinations are there” in general. It’s the answer to how many combinations there are which are Version 4, Variant 1. But nothing about the original question implies that we definitively know that all the GUIDs we want to find are version 4!

UPDATE (2021 September 1): A year and change later, I was thinking about this again, and I finally understood where that number comes from. The answer is: While 2^122 can be represented exactly as a 64-bit floating point value, you don’t need all of those digits to specify it. If you ignore the trailing zeros, this number is the shortest sequence of decimal digits which uniquely specifies 2^122 when represented in float64. So what’s really happening here is that the decimal-conversion routine used to render that number realized that it had produced all the digits it needed, so it stopped and then just zero-padded to get the order of magnitude right. Go and C stdlib will both do this.

Performance thoughts

I got curious about this, because it sure does look expensive. It’s not as bad as it might seem. I translated this naively to Go, and in Go, the compiler is smart enough to manage the temporary strings used to build the new string, in the non-carry case, without needing to allocate those temporaries. As such, it’s just one allocation per loop, or close to it. If you do it with hex digits, you start seeing that climb up, because carrying happens one time in 16 instead of one time in 36. On the other hand, doing this the way anyone would normally do it (using an array of bytes and incrementing them there) is about 20x faster. The new version used a year later, using nice fast integer counting, and then formatting the results, is dramatically slower. It’s not quite a fair comparison, but in Go, it came out around 10x slower for the 2^128 case (but only 5x slower if you only used one uint32). This is probably because most of what it’s doing is regenerating the same values it just generated last time – in practice, the number of changes is very close to exactly one in the last position (it only exceeds that 1/16 of the time, after all), and just altering that is a lot less work than redoing the formatting.

Conclusions

Everything in this code is wrong. You couldn’t fix this just by fixing one or two or three bugs. Even just granting the infinitely fast computer (with infinite storage; the original article has the GUIDs written to a file named c:\temp\scanner\guids.txt), you would have to fix at least five different bugs before this would do what it’s nominally supposed to do.

You might wonder why the person who wrote this question would completely ignore all the explanations of why this can’t actually be done, and the ethical questions about running a few quintillion queries against someone’s web server after they told you no, and then go ask the exact same question of Raymond Chen over a year later. But once you look at the code, it becomes clear; this is not a single isolated misunderstanding. Every part of this is wrong because someone looked around for a thing that could conceivably be an answer to part of a question and threw it in. There is no overall cognitive model of the problem or the solution.

Strings have both letters and numbers? It’s letters and numbers, so it’s [a-z0-9]. Starting point? Obviously all zeroes. But these questions weren’t connected with each other in the writer’s mind. The all zeroes starting point, and the comment in the text about ending with an all-f value, were not related to the decision to use the entire alphabet followed by the digits. Number of things to iterate? Look up some random blog post somewhere claiming a total number of GUIDs, generate that many, without connecting that in any way to the generation process, or considering whether that answer’s context fits the actual problem space. Floating point math? Floats are just numbers that can hold really big values. Nevermind that, at this scale, the smallest value that you can increment something by and get it to change is 2^70; it’s a number that can express values this big, therefore it can express every value this big.

Anyone can make a mistake. Anyone can make several mistakes. But looking at this code, we see something much deeper than mere “mistakes”. A mistake is when you have a mental model of a thing, and there is a flaw in that model, or you execute part of your design wrong. This code doesn’t reflect a flawed model; it reflects either no model or several models being used interchangeably. It’s not exactly iterating through a fixed range of values, it’s not exactly iterating until a counter ticks over, it’s a mix of both of those. It’s not generating hex strings, even though the writer sort of understands that the maximum value should look like ffff....

It’s also hypnotic. I thought I was done with this and then noticed the NextLetter name. My guess, having dealt with code like this before, is that originally the code was trying to do something like assigning the next letter after a given letter to a single position in the string, but that didn’t handle carrying, and that’s why we get this elaborate recursive solution.