Debugging with a Validate Method

I was live streaming a coding session today, wrapping up a Tetris clone in Mini Micro. All was going swimmingly until suddenly we noticed a bug: after clearing a line, subsequent pieces would sometimes stop before hitting anything. Upon closer inspection, we noticed that the higher-up lines, after moving down to fill the cleared row, often had blocks in the wrong place.

What ensued was an abnormally long debugging session. I tried confessional debugging; I tried walking through the code; I tried checking my assumptions by having the code print out what it was doing. After 20 minutes of this I was still stumped. In desperation, I reached for a tool I should have brought out much faster: the validate method.

A validate method is a method that checks the integrity of your data. In my case I had a 2D array called bin that was supposed to contain references to all the little blocks that were on the game board. And each block was supposed to know which cell of that bin array it was in. Moreover, every block in the bin should actually be on the sprite display. So I wrote a method to check all that:

validate = function()
	for col in range(0, kBinWidth-1)
		for row in range(0, kBinHeight-1)
			b = bin[col][row]
			if not b then continue  // no block here
			// check that the block's column/row matches where it is in the bin
			qa.assert b.column == col and b.row == row

			// check that it is actually on the screen
			qa.assert spr.sprites.indexOf(b) != null

			// check that its position on screen is correct
			qa.assert b.x == xForColumn(col)
			qa.assert b.y == yForRow(row)
		end for
	end for
end function

My validate method was fairly simple, as you can see above. It uses the qa module (in the /sys/lib folder in Mini Micro) to assert each fact that should be true about some block in the bin. I could have also iterated over all the sprites in the display sprite list, but I decided to stop here and see what I could catch with this net.

So once I had this method written, I called it from every sensible place I could think of in the rest of the program: when a piece is added to the bin; before clearing a completed row; after clearing a completed row. When I ran the program, all was quiet until I cleared a line. Then the screen went dark, with bright red messages telling me that one of my assertions had failed.

This may seem scary, but it's actually a very good thing! It means the bug has tripped my trip-wire. The call stack shown tells you exactly where in the code the problem was detected, and how it got there. Call stack entry 0 (line 235) told me that the assertion failed was the one about the sprite actually being on screen. And the next one, line 262, told me that this was found at the bottom of the clearLine function.

So my data was just fine at the start of clearLine, but bogus by the end of it. The bug had to be there! I studied the code in that function, but still couldn't see it.

So I went one step farther. The code in that function loops over all the columns of a row, and does basically two steps: it removes the block from the bin and screen, and then it updates all the higher blocks so they knew their new positions. I added another call to validate in between these steps, and tested again. And this time the stack trace showed that the error was in the first of these two steps (because it was this new, middle validate call that caught it).

At that point I had the bug narrowed down to about a half-dozen lines, and finally, looking at them with absolute assurance that the bug was there, I was able to see what had eluded me for a good half-hour.

What that bug was isn't important to this blog post. So I'll just wrap up...

Haha, just kidding. I'm sure you want to know what it was! It's such a simple thing I am embarrassed to admit it, but then, that's often the case in programming. I had a line in my clearLine function whose job it was to remove the deleted block from the sprite display:

	if b then
		print "Removing block at " + b.column + "," + b.row + " with tint " + b.tint
		spr.sprites.remove b
	end if

That was the intent, anyway. And that's what the print statement (which I added to confirm the logic) claimed it was doing. But that's not how remove works. The argument to remove is the index of the item to remove, not the value of the item. The fix was to change that remove line to:

		spr.sprites.removeVal b

(using the removeVal function added by the listUtil module, also in /sys/lib).

So that was it. I had looked ta this line many times before, but I saw what I expected to see, not what it was actually doing. By adding a validate method, and calling it liberally anywhere I thought the data should be in a valid state, I was quickly able to narrow it down to a handful of lines — an area small enough that I could really question everything and force myself to see what was actually there.

So the next time you're struggling with code that just seems to be doing something impossible, consider how you can check the integrity of your data with a validate method. With any luck, it will box your bug in to a very tiny space from which it can't escape.