- Blind mode tutorial
lichess.org
Donate

Preparing for next analysis

ChessSoftware DevelopmentLichess
Recapping cheater gameplay and moving onwards

Introduction

For the past week, I've been primarily working on a new analysis of cheater gameplay. However, there have been a lot of problems met during this analysis. I am still working on it but wanted to make an update post and maybe gather some ideas from the general public.

Managing a large dataset

Lichess monthly dataset is given as a 32GB PGN file. Most of the dataset has a timestamp but a few has evaluations with it. I only select the evaluated and time-stamped games and store them by move which results in a dataset over 40GB. The extensive dataset itself is usually not the problem. However, some issues may result from it.

Recapping on cheater analysis part 1

Working on the new analysis, I have found a lot of limitations from my previous analysis.

  • players might not have cheated in the 1+0 bullet games
  • the timeframe of cheating is unclear
  • the proportion of cheated games is unclear

For the new analysis, I have been working on more time controls: 60+0, 180+0, 180+2, 300+0, 600+0. The reason why I don't allow all forms of time control is it complicates the analysis when trying to involve any form of time component. Since Lichess provides timestamps on most games, some form of relative definition of time has to be used in the analysis.

Incorporating most time controls into the analysis

This itself is not that big of a problem. However, the problem is the way of identifying all account types. Thankfully, I got a comment on my previous post where someone introduced me to the Lichess berserk API. However, the rate limit made it not that more efficient than web scrap which takes 1 minute for 100 accounts. 1 million players have played games in June which results in around 6.9 days for the code to determine every player which is not happening. If there are any good suggestions for identifying mass amounts of Lichess accounts please let me know in the comments. Please...

Timeframe of cheating

This part is really hard to determine. The best I can do seems to be to get the account status as fast as possible and assume some game in June got them banned. This will relate to the next topic.

The proportion of cheated games

I have also seen another useful comment giving me an API on detecting cheated games but that would require a lot of change from my dataset. My first intuition working on the dataset is to identify accounts from accounts who played the most games to the least. I decided to work on the top 1% of players who played the most. The top 1% of the players have played more than 82 games in the past month. However, given their account status, the gameplay was not as vastly different from cheaters and normal players. Probably this was due to the many games played in the sample population. If a cheater played 82 games, most games would still be normal. I will work on analyzing the last few games they played and looking at the difference (maybe 5?).
I moved on to think that maybe it's better to see if some players got banned from only playing 3 to 5 games in Lichess. These people may have stronger traces of cheating in their few games played.
If there are any suggestions about these categories as well, I would love to know.

Future posts

Until I get some good analysis, I would hold on to posting some cheater analysis. I am also disappointed about my level of analysis. Any good idea about this topic, I would love to read in the comments. I will come up with some other chess-related data posts in the meanwhile. Thanks for reading!