ChatGPT lol
I sold my soul for 2400 ELO
They say chess is a game of strategy. Writing a chess engine, on the other hand, is a game of suffering...Do you enjoy suffering? Well, seeing as you’re on my blog, you probably do. So, let’s dive right into the obituary of my sanity.
A quick analogy
You know, building a chess engine kind of reminds me of going to Subway. You mix and match different parts that look like they’d work well together, and you end up creating a monstrous mess that tastes like absolute trash. Except instead of a bad-tasting sandwich, you end up with a program that barely works, hangs its queen, and Segmentation fault (core dumped).
But maybe the real Subway was the friends we made along the way. Or more like the sanity we lost along the way.
Board representation
There are typically two options for representing the board: mailbox and bitboard.
Mailbox is simpler but slower. It’s exactly what makes the most intuitive sense: you store an 8x8 array of characters, each character denoting the piece type. Although it’s a lot easier to code, it’s also a lot slower.
Bitboards, on the other hand, rely on the fact that chess boards and 64-bit integers both have 64 slots. Essentially, we use several integers to represent the full board. Bitboards are a ton faster than mailbox boards since we can use tricks to generate moves using bitwise operators. If you want to take a look at our move generation (good luck understanding anything) you can see here: https://github.com/kevlu8/PZChessBot/blob/main/engine/movegen.cpp
Obviously, bitboards were the natural choice. After all, why have a good time when you can have a bad time?
“Well, how hard could it be to just generate all legal moves? Surely it’s basically just an implementation exercise?”
Rooks and bishops start off simple. They just move logically! And the queen is just a rook-bishop fusion.
But the knight moves in an L shape. Okay, weird but not too bad.
Pawns just move forward, right? Wrong. They move twice on their first turn, and they don’t even capture in the same direction as they move. Oh, et en passant. Vous savez, je pense vraiment que les gens qui ont inventé le coup d’en passant voulaient causer le plus de douleur possible aux informaticiens.
The king? Well, he just moves one square. It can’t be that hard, can it? Oh, wait, there’s castling. Oh, wait, he can’t move into check. Oh, wait, if the rook moves it can’t castle either.
Actually, there’s an easier way to deal with kings. It’s simple: don’t actually check for the king moving into check, and let the search function deal with it instead. This works because we rely on the engine realizing it’s never optimal to move the king into check because of the evaluation dipping to -30000 whenever it does that.
After spending many painstaking hours hardcoding each rule into the move generation function, I run a test.
Expected moves: 20 Got: 68.
What’s going on? After opening my debugger, I realize that somehow I’m castling through my own pieces and taking my own pieces. Okay, easy fix. Now it’s generating depth-1 moves correctly. Time for depth 2!
Expected moves: 4902 Got: 4896
Oh... A few hours later, I realize that my engine isn’t handling double checks correctly. Okay, fixed it.
Expected moves: 4902 Got: 4901
What? Off-by-one? Seriously?? And why are depth-1 tests failing again?
I swear if this is an en passant issue- oh, yep, it was an en passant bug. Okay, now it’s fixed and passing all tests!
Surely that’s the hardest part of the engine! The rest is easy!
Evaluation
In all honesty, it wasn’t that bad. For starters, we can simply count up all the material for each side on the board, and compare it. Easy! We can also add piece-square tables, which are 2-dimensional arrays of values that denote how “good” each square is for each piece. For example, the pawns are better towards the center of the board than they are towards the edge. This allows the engine to understand basic positional chess.
Maybe I was right. The rest should be easy now.
Search
Well, it’s just recursion, right? After implementing a basic search algorithm, I try playing against it.
- e4
...
...
Hello? Are you going to play a move?
...
Hmm... right. I need to implement alpha-beta pruning. Just a few lines of code for a square root level speedup (O(m^n) -> O(m^(n/2))).
Alpha-beta pruning is a method to speed up search drastically by checking to see whether or not reaching a position is realistic. If we encounter a position that’s worse than the best guaranteed position we’ve encountered, we don’t need to consider it because we know we have a better alternative. In simple terms, if we know a move is losing, we don’t need to waste time computing exactly how it’s losing, since it wouldn’t matter to us either way.
Also, it is important to sort the positions we examine by how good we think they are, since alpha-beta is more effective when the positions are ordered from best-to-worst. If we establish a good guaranteed position, we can more effectively decide between that position and the others. Alternatively, if we start with the worst possible position, we would have to update the best guaranteed position every move, wasting a ton of time.
Anyways, let’s try playing a game against it!
- e4 a5
Huh. I guess I didn’t tell it to prioritize the center. But it’s fine as long as it doesn’t blunder pieces. - d4 Ra6
What? It just hung a rook? Okay, maybe it will take back at least. - Bxa6 Nc6
Okay bro. What the hell.
After another 5 hours spent debugging (and questioning life choices), the problem appears to be in the way I’m doing recursion. Oops... forgot a negative sign. Also, let’s tweak the evaluation function to consider control of the center.
- e4 e5
Is this it? Is the engine competent now? Let’s not jinx it. - Nf3 Nc6
Let’s see if it will take a hanging knight. - Nxe5 Nxe5
No way. No way. This is it.
So it’s playing coherent moves now. Nice! Let’s see how it does on some tactics.
After implementing tests for tactics, it’s the big moment:
Tests passed: 2 Tests failed: 8
Damn. Okay, maybe we need more than just bare alpha-beta search.
Time to implement quiescence search, where we continue searching captures and “loud” moves after we’re finished our regular search. Since our evaluation function is more accurate in quiet positions without explosive tactics, maybe this will help.
Tests running...
...
...
Okay, or it’s just going to hang. Also, is it just me or is it getting hot in here? Oh wait, that’s my CPU melting.
Well, this is expected. Of course quiescence search will slow down our search. I guess I’ll just change it to search to a lower depth. Depth 4 sounds reasonable.
Tests passed: 9 Tests failed: 1
Okay, more improvement! The only test it’s failing is a stalemate trap, but that’s easily fixable.
And running an SPRT (Sequential Probability Ratio Test) against the previous engine, we see a +60 ELO gain. Not bad!
Time to play against it.
I lost?? No way. The engine is competent now! Let’s enter it into the bot league tournament and see how it does.
...
Zero points. ZERO. Not even a draw against another bot. Oh man.
Well, to be fair, there are many improvements to be made. One of which is the addition of a transposition table – like memoization DP but for chess positions. We can then use the cached results of the transposition table to improve on our move ordering function, which in turn will speed up alpha-beta by ensuring we explore less nodes.
After another 10 or so hours spent on debugging hashing, transposition tables were finally implemented. Time to run another SPRT test!
-150 ELO.
What????????????? How is it this much worse? I thought transposition tables were supposed to help.
Okay, I don’t want to deal with this mess. Maybe I should get some help from professionals.
After getting constructively cooked by Stockfish developers, I realized that I was actually using the transposition table wrong. Instead of trying to use entries from the table to sort moves, you’re actually only supposed to use the best move from the entry and search it first.
It seems counterintuitive, but the SPRT test confirms the results: +50 ELO.
Also, at this point I realized that the standard library containers I was using were notoriously slow. I had known of the problem with std::unordered_map<>, which is why I wrote my own hashtable, but even std::vector<> was a problem. I wrote my own vector which led to a speedup of 40%. No exaggeration.
I then hosted the bot onto Lichess and had it play a couple of games to see how it did.
+45 ELO. Huh, I guess my LMR implementation was wrong.
Yep, it was definitely wrong. LMR, or Late Move Reductions, is a heuristic that allows us to reduce the search depth for nodes that are ordered towards the end of the move list. Its accuracy heavily depends on the accuracy of the move ordering, and since our move ordering is basically nonexistent, it makes no sense to add LMR. Also, LMR requires a re-search whenever a move exceeds alpha (i.e. the result wasn’t as bad as what we initially thought), which I did not do. Oops.
But first of all, the more pressing issue is my handling of threefold repetition. I’ve seen my bot draw so many clearly winning games because it didn’t know about threefold repetition, so it’s towards the top of my to-do list.
Surprisingly, it only took me around 15 minutes to implement threefold detection, which was weird since I had tried before and failed miserably. Well, the SPRTs indicate another +60 ELO gain so I can’t complain!
Back to LMR. I implemented a basic formula that reduces moves by depth 2 if they’re ordered past index 31. Although this reduction seems very small, it sped up search by about 2 times while still maintaining accuracy! Also, I found an issue in our draw detection where it would think it could threefold even if it was in check. This would lead to illegal moves in very rare situations.
This is eerie... I’ve implemented several features without too much ELO loss nor too many bugs. I sense a bad premonition...
Another interesting optimization is the MVV-LVA move ordering scheme, which stands for Most Valuable Victim-Least Valuable Attacker. This optimization prioritizes captures made by lower-value pieces on higher-value pieces, since they are usually better than captures made by higher-value pieces on lower-value pieces.
Since before this, we were sorting moves by a static evaluation of the board after playing it, I’m expecting good results from this. Before running an SPRT though, let’s take a look at the number of nodes explored by the engine before and after.
Before: ~1M
After: ~50M
What? I’m not dealing with this. There’s probably a bug somewhere, but I really don’t want to expend the effort to find it. I’ll just disable MVV-LVA entirely.
Wait... Since MVV-LVA only works for captures, what if I only use it in quiescence search?
Before: ~33M
After: ~2.5M
Wait what? That’s so much better than expected. Straight to the SPRT!
+80 ELO
Nice! Although, some parameters had to be adjusted since sorting introduced quite a large overhead to the search speed.
Another improvement that I added was PV search. PVS, or Principal Variation Search, is a method where we only deeply search the main line (for those of you more acquainted with alpha-beta, we search the other moves on a null-window). The general idea is that the first move in the ordering (ideally) is probably the best line, and therefore we can search the other moves less. However, if we find a move that does better than the first, we re-run a full search on it. This scenario is less likely, so overall this approach does speed things up a bit.
With these improvements, the engine is now able to explore up to depth 10 each move. Pretty good!
Hmm. I’m still seeing some weird glitches in the bot’s evaluation. For instance, how on earth did it see this position after Kxd2 as 0.0?
Seeing as it was the day before the second bot league tournament, I decided to dig into the bug, even though it was 10PM and I had several assignments due tomorrow (blame AP Physics C, arguably my least favorite high school course). The first issue that I noticed was that the board somehow ...broke after trying out castling moves for black. Somehow, the black king and rook would change into a white king and rook. Why in the world would this be happening?
...Oh. So apparently trusting GitHub Copilot to generate repetitive code isn’t the greatest idea. We had originally only written the correct code for white kingside castling, and trusted Copilot to fill in the rest correctly. Turns out that it also used the WHITE_KING and WHITE_ROOK values for BLACK castling, leading to this bizarre behavior. How this wasn’t spotted earlier really baffles me.
Anyways, that’s fixed. But the engine still thinks it’s a draw?
Oh god. This is probably the worst case scenario. My method of checking for threefold draws was completely wrong. Using a hashtable-like approach would definitely fail, due to the huge amount of collisions that can be encountered in a single search.
Let’s do some very simple math. Don’t worry, this will make sense, unlike literally any other math blog.
A typical search (of ours) explores 10 million nodes. Our hashtable could hold 4 million distinct elements at once. That value is far too small! A temporary fix would be to expand it, but even still, the probability of a collision happening is extremely high.
Instead, we can observe that threefold repetition can only happen down a single line. So, we can store an array of hashes of the board down the single line, and whenever we need to check for threefold, we loop through that array and check if the position has occurred more than two times. Easy!
This method might seem slow - it is O(n) as opposed to the O(1) hashtable method. However, the engine is actually faster in practice with this method, because it is way more cache friendly. The array is a contiguous section in memory, usually not exceeding 400 bytes, meaning that it can be completely stored in L1 cache - the fastest memory that the CPU can access. Meanwhile, probing the 64-megabyte hashtable requires random memory accesses all over the place, which is much slower. This is worsened by the fact that virtually no consumer CPUs have any cache line large enough to store the entire table, thus we were performing extremely expensive fetches from RAM.
...Re-reading what I just wrote, it may have been too technical. Just know that small arrays stay in the CPUs fastest memory (called the L1 cache), whereas large ones (like hashtables) force an expensive fetch to RAM. For reference, the L1 cache takes around 2 nanoseconds to fetch values from, while RAM can take upwards of 50 nanoseconds.
With this change, I was maybe expecting an ELO change of 50. Let’s run the SPRTs!
Results of new vs old (8+0.08, NULL, NULL, 8moves_v3.pgn):
Elo: 247.69 +/- 42.79, nElo: 353.60 +/- 43.96
LOS: 100.00 %, DrawRatio: 17.50 %, PairsRatio: 32.00
Games: 240, Wins: 169, Losses: 22, Draws: 49, Points: 193.5 (80.62 %)
Ptnml(0-2): [1, 2, 21, 41, 55], WL/DD Ratio: 6.00
LLR: 2.96 (-2.94, 2.94) [0.00, 8.00]
Uh. What. The. Hell.
???????????????????????
HUH??????????
I mean... I’ll take it. But I didn’t expect it to improve playing strength this much.
Anyways, time to continue with the improvements. Next on the list is... Null move pruning!
Null move pruning works off the null move observation - that is, the assumption that almost all of the time, doing nothing and skipping your turn is worse than the best move that can be played in a position. This is very rarely untrue, usually only in zugzwang positions.
So, we can play a null move, and continue the search from there. If the final position that we obtain after that is still greater than beta, we can terminate and return the guaranteed evaluation immediately. Literally 5 lines of code!
My engine got +140 ELO from that. Great!
At this point, it was comfortably 2000 on Lichess. For reference, that’s around 2300 by human standards!
I wasn’t too sure how to proceed. I could either continue tuning my search, or start working on NNUE, a concept completely new to me.
Naturally, I picked the more painful option. NNUE time!
NNUE
NNUE stands for Efficiently-Updatable Neural Network. Yes, it’s reversed. No, I don’t know why. It was discovered in 2018, initially in the Shogi community, then someone tried adding it to Stockfish. This led to Stockfish gaining over 400 ELO points in the past few releases.
Why is NNUE so powerful? Well, HCE, or hand-crafted-evaluation, only goes so far when it comes to accurately assessing positions. There are simply way too many rules for HCE to be able to capture. But, neural networks are really good at picking up these kinds of subtleties. The problem? Neural networks are generally very slow to evaluate. How can we fix this?
The key observation is that in a chess position, there generally isn’t too much change between moves. So, if our network is simple enough, we can reuse most of the values.
I expected this to be extremely tedious and painful, but surprisingly, it was only around 200 lines of code. Plus, training only took 5 minutes! First off though, I only implemented the NN eval and left the UE part to be done later. SPRT time!
Elo: -inf +/- -nan
????
What... Surely there’s a bug in my NN implementation, right?
I meticulously debugged every part of the neural network, and saw nothing wrong. It was working exactly as expected.
At this point, I was completely stumped. I went to the Stockfish Discord again to ask, but everyone else was equally stumped. This should not be happening. Not in a million years. What in the world was going on?
[3 hours later]
Oh. My. God. You cannot be serious. This might actually be the stupidest bug I spent time trying to fix.
Turns out, I forgot to put the network weights in my testing folder. That meant that my engine was running off of... no network. That’s why it was losing every game.
Here are the chat logs of when I realized the issue:
kevlu8: FIXED
kevlu8: tu seras pizdecé quand tu verras la fixe
wdotmathree: OH
wdotmathree: mAIS
wdotmathree: NON
wdotmathree: C’est
wdotmathree: Pas
wdotmathree: Vrai
kevlu8: c’est ce que je dis
kevlu8: je suis une informaticien terrible
(Yes, we communicate in a mix of several languages. Don’t judge.)
Well, after coming down from the dopamine high of fixing that stupid bug, let’s take a look at the SPRT results.
Elo: 309.44 +/- 58.72
Great. Amazing. Wonderful, even. This network was trained from Stockfish evaluation data, though, so it wasn't entirely my own. But this is good progress! I’ll train the NNUE on my own data sooner or later.
I feel like this is a good stopping point for this blog -- the bot was now consistently hovering around 2400 on Lichess, able to beat most strong players.
So, what have I learned from all this?
- Writing a chess engine should be regulated by the CIA as a form of cruel and unusual punishment
- The UCI protocol should really have more standardization
- People who do this for a living are inhuman
- The remaining tatters of my sanity are shriveling up while writing this
But was it worth it?
Don’t ask me that.
