Sudoku Solver v2.0 for the Apple II

Michael J. Mahon & Scott Hemphill – July 8, 2006

Version 2.0 revision – July 15, 2006

Introduction

"Sudoku" is the name given to a type of logic puzzle similar in concept to a crossword puzzle. It is based on a 9x9 grid composed of nine 3x3 subgrids, where each grid box can contain a single digit from 1 to 9. Initially, most of the grid is blank, with a dozen or so boxes already filled in. The object is to complete the entire grid in such a way that each 9-box column, each 9-box row, and each 9-box subgrid contains each digit 1 through 9 exactly once.

The definition of the puzzle is deceptively simple, allowing for the construction of puzzles of a wide range of difficulty. If you’ve solved some Sudoku puzzles, you know the kind of thinking that is involved in arriving at the (usually unique) solution. Some puzzles can be solved completely methodically by applying logic at each stage. Others have "branch points" where a digit choice must be made that is not determined logically, but is instead selected at random from the remaining possibilities.

This program, SUDOKU, solves all types of Sudoku puzzles very quickly on an Apple II computer. Many people are surprised that a 1MHz Apple II with only 64KB of memory can solve most difficult Sudoku puzzles in less than 30 seconds, and many in just a few seconds! Occasionally, a puzzle may be found which requires several minutes, but these are rare. The efficiency of solution is primarily a result of the algorithm and data structures used, and will be discussed in greater detail later.

SUDOKU v2.0 runs on any Apple II with Applesoft BASIC, lower case support, and 64KB or more of memory.

Using SUDOKU

SUDOKU is distributed as a 5.25" disk image and as a ShrinkIt disk archive with ProDOS 1.8, and is suitable for all 64KB machines with Applesoft and lower case support. The disk boots into a STARTUP program which runs SUDOKU. It can be run easily from a hard disk by copying the disk files, with the exception of ProDOS, BASIC.SYSTEM, and STARTUP to a directory. Note that the files include a subdirectory, PUZZLES, which should also be copied into the program directory.

SUDOKU opens with a help screen, which can be summoned from the edit screen at any time by pressing "?" or "/". Pressing any key takes you to the edit screen where puzzles can be entered or loaded and saved from disk.

The edit screen is an intuitive front end for moving around the grid and entering digits ("1" through "9") or space, to blank a grid entry. Grid navigation is done with the arrow keys or with the IJKM keys. Most potentially destructive operations are invoked by "Control" key combinations, to make errors less likely.

The key functions are as follows (as summarized on the screen):

 Key Function 0 to 9 and Space Enter/clear digit and advance to next box Arrow keys, IJKM Move within grid Ctl-X Clear grid Ctl-L Load saved puzzle Ctl-S Save puzzle Ctl-P Print puzzle Return Solve puzzle ? or / Display help screen Esc Exit the program to BASIC

As a puzzle is being solved, the grid is animated to show the digit choices being tried. When a solution is found, the solver displays the number of the solution and pauses for the user to examine the solution. At this point the user can either press Esc to skip any other solutions, or press any other key to continue to find other solutions, if any others exist. After the solver finishes, a summary of the number of solutions found is presented, and pressing any key will return you to the edit screen.

The Algorithm

A common approach to computer solution of Sudoku puzzles is to attempt to replicate the kind of logical thinking and process of elimination that is used by a human solver. Since there are multiple strategies required for various situations, such algorithms tend to be complex and therefore time consuming. The algorithm used by the 65C02 machine-language program SUDOKU.SOLVER is much simpler, and uses an exhaustive trial of all possible digits as it proceeds through the grid. At each point, it chooses to select a digit for the grid box for which the least possibilities remain. This heuristic serves to "prune" the search tree significantly, and is very important to its speed.

After it finds the empty box with the fewest choices (or the first such box, if there are several with the same number of choices), then tries the possibilities by selecting the next permitted digit and calling itself recursively. Since there are 81 boxes, the maximum depth of recursion is 81 levels.

When it finds no empty box, then it knows that it has found a solution. When it finds that the empty box with the fewest permitted choices has no choices, then it has reached an impasse, and it returns to try another choice at the previous level.

This kind of algorithm is known as a "backtrack" algorithm, since it goes forward making choices until it either succeeds or cannot go further, and then "backs up" to the previous choice to try another alternative. Note that it will try all possible solutions, and is therefore exhaustive. Also note that it only searches the tree of choices as constrained by the puzzle givens and by its previous choices, so it need not waste time examining a myriad of alternatives that are not possible. The "fewest-choice empty box first" heuristic is extremely effective in reducing the number of actual alternatives that it must try to complete its search.

Scott’s original C program is here, as is the Merlin listing of SUDOKU.SOLVER.

Performance

Since it is common for Sudoku solvers on machines running thousands of times faster than an Apple II to take several seconds to find a solution, it is evident that any Apple II program must be extremely efficient in both processor use and memory use to be practical. It is a well-accepted principle that "premature optimization is the root of all evil", and that one should first consider efficiency in the big picture—the algorithm and data structures—before becoming concerned with the coding details. While this has general validity, it is also true that the sooner the designer can identify the performance limiters of an algorithm, the sooner they can be addressed, either by selecting more efficient data representations or by changing the algorithm.

The fastest computation is the one you avoid doing! So the heuristic to go deeper in the search through the grid box with the least possible choices is critical. It avoids generating a search tree with a large fanout at each level, instead choosing the minimum fanout at each level of recursion. Although it is "only" a heuristic, and therefore does not always pick exactly the right search path, it is responsible for "pruning" the search tree by a huge factor.

When a search alternative has been chosen, Scott’s solver addresses some potentially very time consuming operations by the clever use of tables, which significantly speed up some of the most frequent operations. For example, when a digit is chosen for a box, it constrains a whole neighborhood of choices for other boxes in its row, column, and 3x3 sub-grid. Computing this neighborhood can be time-consuming, and it must be done each time a digit is chosen. He therefore uses an 81-entry array, each of 20 grid numbers, to specify the "neighbors" of a box to accelerate updates.

Another example of the use of tables is in computing the empty box with the least number of choices. This involves counting the number of "1" bits in a byte, and it is accomplished by table lookup.

A profile of the execution of the solver while solving "EVIL01" is available here. It shows the number of times that each instruction in the solver was executed.

Several other actual and potential optimizations for the solver were noticed during development, to be discussed next. You may be motivated to further improve the solver’s performance.

The hardest "representative" puzzle to solve was "EVIL01". It takes 15 seconds to find the solution, and an additional 9.5 seconds to search the rest of the space to prove that there are no other solutions. There is little if any correlation between how hard a puzzle is for a human and how long it takes the solver to solve it. All the other "evil" puzzles (from http://www.sudokusolver.co.uk/grids_nologic.html) are solved in much less time.

There is one puzzle (in two forms) that is much harder to solve, though I don’t know its difficulty to a human. It is in the PUZZLES directory as VEVIL, and rotated 180 degrees as RVEVIL. VEVIL takes 2:01 (minutes:seconds) to find the solution, and an additional 1:09 to rule out any others. Interestingly, the corresponding times for the same puzzle rotated 180 degrees, RVEVIL, are almost 3 times longer! Clearly, the order in which the grid is scanned can have major effects on the "bushiness" of the search tree.

A short side note on accelerators—I have an 8MHz Zip Chip in my machine, and the solver runs about 6.8 times faster than a stock 1MHz Apple II. That means that it has very good cache locality, since 8MHz is its peak speed, and that can only be achieved with perfect cache hits and no stores!

Development History

The development of the Apple II version of SUDOKU began with a posting on comp.emulators.apple2 by Scott Hemphill. He reported that he had a Sudoku solver written in C that solved a hard puzzle in 4 milliseconds on a 3GHz Pentium 4 machine. He noted that its speed might make it a good candidate for Apple II implementation.

A couple of days later he posted the C source code, and a quick examination convinced me that it would fit nicely on an Apple II.

I contacted Scott by email and we began a conversation about the project.

I had coded about 80% of his solver in assembly language when I sent a message asking a question about the ‘setbox’ procedure. He responded immediately, and I discovered that he was also coding the solver for the Merlin assembler! He noted that the two-dimensional matrix ‘others’, which contains the indices of the "neighbors" of a grid box, could be more efficiently addressed if its 20-byte rows were "rounded up" to 24 bytes, allowing a simple x3x8 multiply of the subscript. This also permitted adding "sentinel" values to the end of each row, making it easier to free up an index register for the loop body by dispensing with count control.

I suggested storing the indices premultiplied by 2 for indexing into the ‘bits’ array, but since 2x80 > 127, the sign bit of the sentinel could not be used to quickly test out of the loop, so this was not done. Another possibility was splitting the ‘setbox’ loop into two loops, one for cases of ‘digit’<9 and one for cases =9. This allows both loops to be sped up somewhat, since it pulls the decision out of the loop. The improvement looked small, though, so it was left for later.

A potentially more significant possibility was splitting the 2-byte-per-entry ‘bits’ array into two sequential byte vectors. This change would require another ‘bits’ indirect pointer, pointing at ‘bits’+81, but would eliminate the need to multiply indices by 2 and could save some register manipulation. This, too, was left for a later decision.

A couple of days later, Scott made a more radical suggestion for ‘setbox’: replacing the ‘others’ array of neighbor indices with an array of code that setbox would simply branch to. If the ‘bits’ array were placed at a static location (instead of floating at the top of the ‘bitmap’ stack, the 65C02 TSB instruction could replace each "neighbor" index, and the entire ‘setbox’ loop would reduce to executing 20 in-line instructions (followed by an RTS)! The downside of this approach is that if ‘bits’ is kept at a static address, then it must be copied from the stack when it is popped, as well as to it when pushed, thus doubling the ‘bits’ copying time. By the next day, Scott had profiled his code and found that almost half the time was spent copying ‘bits’, so this change seemed like a losing proposition in spite of the huge speed improvement in ‘setbox’.

However, the profile shone a bright light on the most time-consuming operation in the solver—the ‘bits’ stack push operation. Scott observed that the only time ‘bits’ needed to be saved was when there were additional non-forced digit choices for the box after returning from the recursive call. If there were no choices, then ‘select’ would never look at the old ‘bits’ values, so there was no need to save (or restore) them.

A quick statistical study of thirty puzzles revealed that, since the solver’s heuristic always selects the box with the fewest possible digit choices, there was a very high likelihood that when the recursive call to select was made, the ‘bits’ array would not need to be saved at all. In fact, ‘bits’ needed to be pushed less than 10% of the time, for a 90% potential saving in the most time-consuming operation!

Scott then coded the recursion in two flavors: expensive recursion, where ‘bits’ would be pushed, and cheap recursion, where only the value of a single byte, ‘best’, would be saved. This single change almost doubled the speed of the solver.

This was the first version Scott sent to me for integration with the Applesoft front-end. I modified it to animate the placement and removal of trial digits, so that the screen display of the grid would be "live" and a user could see that the program was running. Scott expressed some concern that doing a screen update for each ‘setbox’ and ‘unsetbox’ might slow the program down excessively, and suggested making the animation optional.

By the next day, the solver had been integrated, directly animating the screen display and communicating its status to the calling BASIC program through a return code in memory. I used a table of screen pointers to do the screen update, so it only took a few cycles. The net impact on speed was less than 5%, so I decided not to bother with an option to turn it off. Besides, having something interesting to watch makes subjective time go much faster!

I sent the integrated program to Scott for his trials, and he made some suggestions about both the animation and the (lack of) preservation of the grid upon return to the editor from the solver. I made both changes, and cleaned up the program further, including adding a grid consistency check so that the solver won’t try to solve a grid whose "givens" violate the rules.

I then turned to "packaging" the first version of the program for release, and to writing this article. As often happens when reviewing what you have done, you discover things forgotten—in this case, I had forgotten to include a command to print the puzzle grid! So that led to another quick turn-around, and brought us to the first release point.

There were several optimizations that we considered but didn't make in the first version, and some more ideas were contributed on comp.sys.apple2 after the first version's release. Anton Treuenfels in particular made several suggestions for optimization, and Paul Schlyter suggested a version that did not require a 65C02 or 80-column firmware. So I decided to develop version 2.0 to run on all Apple IIs and incorporate some optimizations.

I realized that my aversion to code modification had prevented me from making some changes to increase speed, and I decided to see what lifting this restriction could do. I also traded more space for time, by page-aligning the 'bits' stack and making the frame a whole page, wasting 94 bytes per frame, but speeding up address modifications. The 'count' bit-count table was also page-aligned for similar reasons. I also took Anton's suggestion and added a table of pointers to rows of the 'others' matrix, eliminating the multiply-by-24 code and enabling both shorter rows (no sentinels) and entries that were pre-multiplied by 2. In version 2.0, I used address modification in 'setbox' and 'select', the real time-consumers, to simplify the code and reduce register juggling. I also added a table for multplying an index register by 2 into the other register, which saved some time.

Although it had nothing to do with the solver speed, I added a machine-language routine to paint the grid on the screen, making interaction much faster on a 1MHz machine. A short routine to clear the bottom 3 lines of the screen was also added to remove the dependency on the 80-column firmware's "clear to end of screen" control character.

The net impact of all the solver optimizations was a little more than an 18% speedup--which almost makes this a case study in why "technical" optimizations should be left to the end. The payoff for a significant amount of code change was much less than a factor of two. Of course, there may still be gold left to mine...

From here on, it’s up to you—have fun!