Page 1l of 62 MCM 2008 Team #3780 brown d=1 8 Number of choices Number of choices (a) Original histograms. (b) Histograms with mean removed. Figure 10: Examples of choice histograms do that, we need to find the direct restrictions WebSudoku [4 on each cell by examining the row, column and block that it is located in. Doing so in the least 10 Easy puzzles. efficient way that is still reasonable, we look at each of the 8 other cells in those three groupings 0 Medium puzzles even though some are checked multiple times, re- 10 Hard puzzles. sulting in 24 comparisons per cell. For a total of 81 cells, this results in 1, 944 comparisons being 10 Evil puzzles made. Of course, we only check when the cell is empty, and so for any puzzle, the number of com- Games World of Sudoku [7] parisons is strictly less than 1, 944. That bound is constant for all puzzles, and so we conclude that 10大 puzzles finding the WneF is a constant time operation 10大 puzzles. with respect to the puzzle difficulty. 10*★** puzzles. 4 Metric Calibration and Testing GNOME Sudoku [5] 4.1 Control puzzle sources 2000 Hard puzzles In calibrating and testing the metrics, we used published puzzles from several sources and at several levels of difficulty, as labeled by their au- ·“top2365”2 thors. The puzzles we obtained include the fol- lowing This list of puzzles was obtained from [9] and named by regulars of the Sudoku Players Forum By forum tradition, lists of test puzzles tend to get short and minimal names. Other names for lists include"topn87"and"subig 20
Page 11 of 62 MCM 2008 Team #3780 (a) Original histograms. 0 2 4 6 8 −5 0 5 10 Number of Choices brown easy blue medium green hard red evil (b) Histograms with mean removed. Figure 10: Examples of choice histograms. do that, we need to find the direct restrictions on each cell by examining the row, column and block that it is located in. Doing so in the least efficient way that is still reasonable, we look at each of the 8 other cells in those three groupings, even though some are checked multiple times, resulting in 24 comparisons per cell. For a total of 81 cells, this results in 1,944 comparisons being made. Of course, we only check when the cell is empty, and so for any puzzle, the number of comparisons is strictly less than 1,944. That bound is constant for all puzzles, and so we conclude that finding the WNEF is a constant time operation with respect to the puzzle difficulty. 4 Metric Calibration and Testing 4.1 Control Puzzle Sources In calibrating and testing the metrics, we used published puzzles from several sources and at several levels of difficulty, as labeled by their authors. The puzzles we obtained include the following: • WebSudoku [4] – 10 Easy puzzles. – 10 Medium puzzles. – 10 Hard puzzles. – 10 Evil puzzles. • Games World of Sudoku [7] – 10 ? puzzles. – 10 ?? puzzles. – 10 ? ? ? puzzles. – 10 ? ? ?? puzzles. • GNOME Sudoku [5] – 2000 Hard puzzles. • “top2365” 2 – 2365 Evil puzzles. 2This list of puzzles was obtained from [9] and named by regulars of the Sudoku Player’s Forum. By forum tradition, lists of test puzzles tend to get short and minimal names. Other names for lists include “topn87” and “subig20
Page 12 of 62 MCM 2008 Team #3780 4.2 Testing Method To test these hypotheses, we used the following test statistic. 4.2.1 Defining Difficulty ranges In analogy with published puzzle collections, we wnef(d)-wnef(d+1) separated our control puzzles into four broad ranges of difficulty: easy, medium, hard and evil For the sake of brevity, we will often refer to these where nd is the number of control puzzles ha oy the indicies 1, 2, 3 and 4, respectively. difficulty d and where s, is the sample variance of the WneF, over control puzzles at level d(this 4.2.2 Information collection data is shown in Table 1). With a significance We used the control puzzles described in 4.1 to level of a =0.0025, we performed a hypothe calibrate and the metrics by running programs test using the Student's t distribution, and found designed to calculate the metrics on each puzzle. that t"> ta. Thus, we rejected the null hypothe- The information collected from the program for sis for each of d= 1, 2 and 3, and concluded that each puzzle Pi included: the WneF is able to distinguish different diffi- culty levels JE(Pi)l, the total number of empty cells in P 4.3 Choice of Weight Function C(Pi)=ExeP X?, the number of possible As alluded to in Section 3.3, we tried three differ choices for all cells ent weighting functions for finding WNEF values The choice histogram c defined in Equation exponential, quadratic and linear ep(n)=29 sg(n)=(10-n) 4.2.3 Statistical Analysis of Control Puzzles Win(n (10-n) When looking for a possible correlation between where n is the number of choices for a cell. We the data and the difficulty level, we found that discovered that regardless of the type of weight the number of empty cells and number of total ing function we used, the graph showing the hoices lacked any correlation. However, when weights of the puzzles vs. difficulty all looked we looked at the choice histograms for each puz- very similar, in that the all produced a strong cle, we noticed trends in the data In easier puz- negative correlation(Figure 12) les, there seemed to be more cells with fewer we We concluded that we could choose any of the choices than in the more difficult puzzles(figure three weighting functions, as long as we used the 10) same function throughout. We arbitrarily chose We then calculated the wef(P) for the control puzzles to try to further explore the relationship and found a clear negative correlation between the difficulty level of P and wnef(P)for the con- 5 Generator Algorithm trol puzzles(Figure 11). This leads us to intro duce wnef()as the mean WNEF of all control 5.1 Overview puzzles having difficulty d The generator algorithm works by creating first In order to conclude that the Wnef produces a valid solved sudoku board, and then"punch distinct difficulty levels, which is to say that ing holes"in the puzzle by applying a mask. wnef(d)f wnef(d+1)for d E [1, 2, 3], we con- The solved puzzle is created via an efficient ducted a hypothesis test for d= 1, 2, 3 with the backtracking algorithm, and the masking is per- following hypotheses formed via the application of various strategies A strategy is simply an algorithm which Ho: wnef(d)= wef(d+1) cell locations to attempt to remove based on some wnef(d)≠wnef(d+1) goal. After any cell is removed, the puzzle is
Page 12 of 62 MCM 2008 Team #3780 4.2 Testing Method 4.2.1 Defining Difficulty Ranges In analogy with published puzzle collections, we separated our control puzzles into four broad ranges of difficulty: easy, medium, hard and evil. For the sake of brevity, we will often refer to these by the indicies 1, 2, 3 and 4, respectively. 4.2.2 Information Collection We used the control puzzles described in 4.1 to calibrate and the metrics by running programs designed to calculate the metrics on each puzzle. The information collected from the program for each puzzle Pi included: • |E (Pi)|, the total number of empty cells in Pi . • C (Pi) = P X∈Pi X?, the number of possible choices for all cells. • The choice histogram ~c defined in Equation 2. 4.2.3 Statistical Analysis of Control Puzzles When looking for a possible correlation between the data and the difficulty level, we found that the number of empty cells and number of total choices lacked any correlation. However, when we looked at the choice histograms for each puzzle, we noticed trends in the data. In easier puzzles, there seemed to be more cells with fewer choices than in the more difficult puzzles (Figure 10). We then calculated the wnef(P) for the control puzzles to try to further explore the relationship and found a clear negative correlation between the difficulty level of P and wnef(P) for the control puzzles (Figure 11). This leads us to introduce wnef(d) as the mean WNEF of all control puzzles having difficulty d. In order to conclude that the WNEF produces distinct difficulty levels, which is to say that wnef(d) 6= wnef(d + 1) for d ∈ {1, 2, 3}, we conducted a hypothesis test for d = 1, 2, 3 with the following hypotheses: H0 : wnef(d) = wnef(d + 1) Ha : wnef(d) 6= wnef(d + 1) To test these hypotheses, we used the following test statistic: t ∗ = wnef(d) − wnef(d + 1) r s 2 d nd + s 2 d+1 nd+1 where nd is the number of control puzzles having difficulty d and where s 2 d is the sample variance of the WNEF, over control puzzles at level d (this data is shown in Table 1). With a significance level of α = 0.0025, we performed a hypothesis test using the Student’s t distribution, and found that t ∗ > tα. Thus, we rejected the null hypothesis for each of d = 1, 2 and 3, and concluded that the WNEF is able to distinguish different diffi- culty levels. 4.3 Choice of Weight Function. As alluded to in Section 3.3, we tried three different weighting functions for finding WNEF values: exponential, quadratic and linear. wexp (n) = 29−n wsq (n) = (10 − n) 2 wlin (n) = (10 − n) where n is the number of choices for a cell. We discovered that regardless of the type of weighting function we used, the graph showing the weights of the puzzles vs. difficulty all looked very similar, in that the all produced a strong negative correlation (Figure 12). We concluded that we could choose any of the three weighting functions, as long as we used the same function throughout. We arbitrarily chose wexp. 5 Generator Algorithm 5.1 Overview The generator algorithm works by creating first a valid solved sudoku board, and then “punching holes” in the puzzle by applying a mask. The solved puzzle is created via an efficient backtracking algorithm, and the masking is performed via the application of various strategies. A strategy is simply an algorithm which outputs cell locations to attempt to remove, based on some goal. After any cell is removed, the puzzle is
Page 13 of 62 MCM 2008 Team #3780 pd=E(y)0.26807560.11082680.092448320.04078146 a=s20.000969630.00021350000250630.00012557 Table 1: Estimated means and variances of control wnef metrics 8 1.01.52.02.53.03.54.0 Difficulty Level Figure 11: WNEF for control puzzles by difficulty. Blue 四>uz3 gooo寸o?oco-ooo 1.0152.02.53.03.54.0 Difficulty Leve Figure 12: WNEF correlations for various weighting functions
Page 13 of 62 MCM 2008 Team #3780 d 1 2 3 4 µˆd = E (y) 0.2680756 0.1108268 0.09244832 0.04078146 σˆ 2 d = s 2 0.00096963 0.000502135 0.000255063 0.000125557 Table 1: Estimated means and variances of control WNEF metrics. ● ● ● ● 1.0 1.5 2.0 2.5 3.0 3.5 4.0 0.05 0.10 0.15 0.20 0.25 Difficulty Level WNEF Value Figure 11: WNEF for control puzzles by difficulty. 1.0 1.5 2.0 2.5 3.0 3.5 4.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Difficulty Level WNEF Value Blue Linear Green Quadratic Red Exponential Figure 12: WNEF correlations for various weighting functions
Page 14 of 62 MCM 2008 Team #3780 checked to ensure that it still admits a unique 5.2.2 Cell removal solution. If this test succeeds, another round is started. Otherwise, the board s mask is reverted. Having a solved puzzle is nifty, yes, however it and a different strategy is consulted. Once is not very useful. In order to change this into a strategies have been exhausted, we do a final puzzle that is actually entertaining to solve we cleanup"phase where additional cells are re- perform a series of removals that we shall call moved systematically, then return the completed mashing puzzle. For harder difficulties, we introduce an The basic idea behind masking is that one nealing or more cells are removed from the puzzle(or masked out of the puzzle) and then the puzzle is checked to ensure that it still has a unique so- 5.2 Detailed Description lution. If this is not the case, then the masking action is undone (or the cells are added back into As mentioned, our algorithm for generating a de- the puzzle) terministic Sudoku board consists of two stages. Random masking is one of the simplest and We first generate a solution, and then remove fastest forms of masking. Every cell is masked in cells until we reach the desired difficulty, as mea- turn, but in random order. Every cell that can be sured by the WNEF metric. Also important is the removed is, resulting in a minimal puzzle. This is uniqueness test algorithm used heavily in the pro- very fast and has potential to create any possible cess of removing cells minimal puzzle, though with differing probabil y Tuned masking is slower and cannot create a 5.2.1 Completed Puzzle generation puzzle any more difficult then that which can be gained with Random Masking(though easier Completed puzzles are generated via a method puzzles can be created if they are not minimal) called backtracking. A solution is generated via The idea behind tuned masking is that we can in some systematic method until a contradiction is crease the probability that a given type of puzzle found. At this point the algorithm reverts back to is generated. This depends heavily on probabil a previous state and attempts to solve the prob- ity and hence takes some tweaking to make ac lem via a slightly different method. All methods curate. It can be done. however. such that the de- should be attempted in a systematic manner. If a sired type of puzzle will be generated the majority valid solution is found then we are done. of the time. As such, it is possible to ensure the Backtracking can be a slow process, and as generation of the puzzle type in question by re such one must make care to do so in a smart generating the given type is generated. This has and efficient manner. In order to gain better effi- a terrible worst case. however probabilistic anal ciency, we take the 2D sudoku board and view it ysis may be used to show that, assuming your as a 1D list of rows. The problem now reduced to tuning is configured well, the probability of not that of filling rows with values, and if we cannot, gaining the desired puzzle type on a second try is then we backtrack to the previous row. We are very small finished if we complete the last row The issue here is something I like to call bleed- This recasting of the problem also simplifies ing. a given tuning, when ran many times, will the constraints; with a little care we can make produce a probability curve. In all likelihood, the it so that we only need concern ourselves with produced puzzle will be of the type that consti the values in each column, and the values in the tutes the mean of the curve. However should the three clusters (or blocks) that the current row puzzle lie far from these mean, on a tail, then it intersects. These two constraints may be main- could overlap with a different tunings curve and tained by updating them each time a new value hence give you a conflict(such that you attempt to is added to a row generate a hard puzzle and result in an evil puz There exists, of course, implementation details zle, for example). Spacing the tunings out and that one would need to iron out. To see our imple- minimizing their curves spread is crucial to cre- mentation see Section 5.3 ating accurate tunings
Page 14 of 62 MCM 2008 Team #3780 checked to ensure that it still admits a unique solution. If this test succeeds, another round is started. Otherwise, the board’s mask is reverted, and a different strategy is consulted. Once all strategies have been exhausted, we do a final “cleanup” phase where additional cells are removed systematically, then return the completed puzzle. For harder difficulties, we introduce annealing. 5.2 Detailed Description As mentioned, our algorithm for generating a deterministic Sudoku board consists of two stages. We first generate a solution, and then remove cells until we reach the desired difficulty, as measured by the WNEF metric. Also important is the uniqueness test algorithm used heavily in the process of removing cells.. 5.2.1 Completed Puzzle Generation Completed puzzles are generated via a method called backtracking. A solution is generated via some systematic method until a contradiction is found. At this point the algorithm reverts back to a previous state and attempts to solve the problem via a slightly different method. All methods should be attempted in a systematic manner. If a valid solution is found, then we are done. Backtracking can be a slow process, and as such one must make care to do so in a smart and efficient manner. In order to gain better effi- ciency, we take the 2D sudoku board and view it as a 1D list of rows. The problem now reduced to that of filling rows with values, and if we cannot, then we backtrack to the previous row. We are finished if we complete the last row. This recasting of the problem also simplifies the constraints; with a little care we can make it so that we only need concern ourselves with the values in each column, and the values in the three clusters (or blocks) that the current row intersects. These two constraints may be maintained by updating them each time a new value is added to a row. There exists, of course, implementation details that one would need to iron out. To see our implementation, see Section 5.3. 5.2.2 Cell Removal Having a solved puzzle is nifty, yes, however it is not very useful. In order to change this into a puzzle that is actually entertaining to solve we perform a series of removals that we shall call masking. The basic idea behind masking is that one or more cells are removed from the puzzle (or masked out of the puzzle) and then the puzzle is checked to ensure that it still has a unique solution. If this is not the case, then the masking action is undone (or the cells are added back into the puzzle). Random masking is one of the simplest and fastest forms of masking. Every cell is masked in turn, but in random order. Every cell that can be removed is, resulting in a minimal puzzle. This is very fast and has potential to create any possible minimal puzzle, though with differing probability. Tuned masking is slower and cannot create a puzzle any more difficult then that which can be gained with Random Masking (though easier puzzles can be created if they are not minimal). The idea behind tuned masking is that we can increase the probability that a given type of puzzle is generated. This depends heavily on probability, and hence takes some tweaking to make accurate. It can be done, however, such that the desired type of puzzle will be generated the majority of the time. As such, it is possible to ensure the generation of the puzzle type in question by regenerating the given type is generated. This has a terrible worst case. however probabilistic analysis may be used to show that, assuming your tuning is configured well, the probability of not gaining the desired puzzle type on a second try is very small. The issue here is something I like to call bleeding. A given tuning, when ran many times, will produce a probability curve. In all likelihood, the produced puzzle will be of the type that constitutes the mean of the curve. However, should the puzzle lie far from these mean, on a tail, then it could overlap with a different tuning’s curve and hence give you a conflict (such that you attempt to generate a hard puzzle and result in an evil puzzle, for example). Spacing the tunings out and minimizing their curve’s spread is crucial to creating accurate tunings
Page 15 of 62 MCM 2008 Team #3780 Behind the tuning algorithm is a series of and and any cell who is the only cell in some ref trategies. A strategy is simply a function that erence frame(such as its cluster, row, or column examines the board and returns the cell it would with the potential of some value may be filed in like to try to remove. This should be based on with that value. These two logic processes are some rule, perhaps it is in a cluster that has a lot performed on a board until either the board is of other filled cells in it, or its value is one that is solved indicatng a unique solution, or no logic ap currently very common. A set of these strategies plies which indicates the need to guess and hence defines how a tuning attempts to reduce a board. a high probability that the board has multiple so- The second stage of tuning is performed right lutions. If this test succeeds, then we know that after a value is removed from the board. This is the board always has a solution, as we generated that the board is evaluated to see if it is of the the board from a solution. On the other hand. it type that the tuning is seeking, and then the tun- may produce false negatives, and reject a board ng's strategy is adjusted accordingly. In our ex- with a unique solution ample, if a board is found to be too difficult, then The slow solution is to try every valid value in we might add back in a cell that will decrease the some cell, and ask if the board is unique for each overall difficult If more then one value produces a unique result For our tuning we are seeking a board with a then the board has more then one solution. This given WNEF. As such we apply strategies that solution calls itself recursively to determine the will reduce the WNEF until we have reduced it uniqueness of the board with the added value sufficiently. Strategies that should have a large The advantage of this solution is that it is com effect on the WNef should not be applied if a low pletely accurate, and will not result in false neg WNeF is not being sought. In the case that we atives reach a minimum WNef that is not low enough, A hybrid method is to utilize the slow solution we can use a method from mathematical opti- in the case that the fast one fails. A further op- mization known as simulated annealing. Here we timization is to restrict the number of times the add some number of values back into the board slow solution may be used. This is similar to say and then optimize from there, in hopes that do- ing"if we had to guess more then twice, then we ing so will allow us to reach a lower minimum. reject the board. In the interest of expedience, State saving allows us to then, after a time, re- it is the hybrid method that we adopt here. This vert to the board with the lowest WNEF. Experi- allows us to prevent a large amount of false neg- mentally we observed that annealing allowed us atives while still offering quick solutions to produce puzzles with lower WNEF values than re could without applying the technique The de- 5. 3 Pseudocode tails of this test are given in Section 19 5.3.1 Completed Board Generation 5.2.3 Uniqueness Testing Given an empty 9 9 array that we shall call board", do the following In order to ensure we generate boards with only 1. Fill the top row of the board with a random one solution, we must be able to test if this con- permutation of the sequence 1 through 9 dition is true. There is a fast and a slow way of doing this. The fast way will find the uniqueness 2. Initialize a 9 element array of lists. This of any board which can be solved using logic. Any shall hold all numbers placed so far in each board which does not confirm to the rules of logic but my still have a single solution, will fail the fast test. The slow test can determine this for 3. Initialize a 3 element array of lists. This any board shall hold all numbers placed in the three clusters that the current row (right now, The fast solution utilizes the two basic logic rules of sudoku sol Hidden Single and this is the first row)spans Naked Single. That is that any cell with only 4. Add the values of the first row to their re- one possible value can be filled in with that value spective column lists
Page 15 of 62 MCM 2008 Team #3780 Behind the tuning algorithm is a series of strategies. A strategy is simply a function that examines the board and returns the cell it would like to try to remove. This should be based on some rule, perhaps it is in a cluster that has a lot of other filled cells in it, or its value is one that is currently very common. A set of these strategies defines how a tuning attempts to reduce a board. The second stage of tuning is performed right after a value is removed from the board. This is that the board is evaluated to see if it is of the type that the tuning is seeking, and then the tuning’s strategy is adjusted accordingly. In our example, if a board is found to be too difficult, then we might add back in a cell that will decrease the overall difficulty. For our tuning we are seeking a board with a given WNEF. As such we apply strategies that will reduce the WNEF until we have reduced it sufficiently. Strategies that should have a large effect on the WNEF should not be applied if a low WNEF is not being sought. In the case that we reach a minimum WNEF that is not low enough, we can use a method from mathematical optimization known as simulated annealing. Here we add some number of values back into the board and then optimize from there, in hopes that doing so will allow us to reach a lower minimum. State saving allows us to then, after a time, revert to the board with the lowest WNEF. Experimentally we observed that annealing allowed us to produce puzzles with lower WNEF values than we could without applying the technique. The details of this test are given in Section 19. 5.2.3 Uniqueness Testing In order to ensure we generate boards with only one solution, we must be able to test if this condition is true. There is a fast and a slow way of doing this. The fast way will find the uniqueness of any board which can be solved using logic. Any board which does not confirm to the rules of logic, but my still have a single solution, will fail the fast test. The slow test can determine this for any board. The fast solution utilizes the two basic logic rules of Sudoku solving: Hidden Single and Naked Single. That is that any cell with only one possible value can be filled in with that value, and and any cell who is the only cell in some reference frame (such as its cluster, row, or column) with the potential of some value may be filed in with that value. These two logic processes are performed on a board until either the board is solved indicatng a unique solution, or no logic applies which indicates the need to guess and hence a high probability that the board has multiple solutions. If this test succeeds, then we know that the board always has a solution, as we generated the board from a solution. On the other hand, it may produce false negatives, and reject a board with a unique solution. The slow solution is to try every valid value in some cell, and ask if the board is unique for each. If more then one value produces a unique result then the board has more then one solution. This solution calls itself recursively to determine the uniqueness of the board with the added values. The advantage of this solution is that it is completely accurate, and will not result in false negatives. A hybrid method is to utilize the slow solution in the case that the fast one fails. A further optimization is to restrict the number of times the slow solution may be used. This is similar to saying “if we had to guess more then twice, then we reject the board.” In the interest of expedience, it is the hybrid method that we adopt here. This allows us to prevent a large amount of false negatives while still offering quick solutions. 5.3 Pseudocode 5.3.1 Completed Board Generation Given an empty 9 × 9 array that we shall call “board”, do the following: 1. Fill the top row of the board with a random permutation of the sequence 1 through 9. 2. Initialize a 9 element array of lists. This shall hold all numbers placed so far in each column. 3. Initialize a 3 element array of lists. This shall hold all numbers placed in the three clusters that the current row (right now, this is the first row) spans. 4. Add the values of the first row to their respective column lists