Reinforcement Learning to Solve the Abelian Sandpile Game

I’ve been recently on a kick about chaos theory ever since I read James Gleick’s book about it. In my opinion there’s something astonishing about chaos, and how it links together seemingly unrelated concepts of self-organized criticality, fractals, and statistical power laws. On that last one, since learning about power laws I’m convinced that all real systems that are interesting or important follow power laws (e.g. the economy, the rate of technological innovation, networks of friends) since they all feature similar characteristics, like strong nonlinearities in outcomes and a high interconnectedness. That these laws of chaos keep showing up in disparate scenarios is itself a feature of chaos, universality! Consequently, I shortly devoured three other books that had been influenced by the idea of Chaos Theory: Ormerod’s Why Most Things Fail, Ball’s Critical Mass, and Buchanan’s Ubiquity: Why Catastrophe’s Happen. Universally, chaos theory reared its fractally, unsettling head in each of these.

One system in particular stood out from Ubiquity, although it also appeared in Gleick’s book and in Critical Mass: the Abelian sandpile model. In this model, grains of sand are continuously dropped onto a grid of cells. When a cell accumulates a certain number of grains, it becomes "unstable" and topples, distributing its grains to neighboring cells. The process of toppling continues until the entire system reaches a stable state. The Abelian sandpile model was studied back in the 1980s and early 1990s as a result of studies in the field of statistical physics, particularly within the context of self-organized criticality (SOC). The concept of self-organized criticality, introduced by Per Bak, Chao Tang, and Kurt Wiesenfeld in 1987, proposed that complex systems could naturally evolve to a critical state without the need for external tuning or a central organizing principle (further explored by Critical Mass). In these models, simple, local rules (like sandpile dynamics) could lead to a global behavior characterized by criticality, where the system exhibited power-law distributions and scale-invariance. In the sandpile model, the size of the topplings (alternatively called avalanches) follow such a power law distribution, with larger avalanches occurring less frequently than smaller avalanches. However, ahead of time it is difficult (perhaps impossible) to predict if a single sandgrain will cause a large avalanche or a small avalanche.

An animation of the sandpile model featuring its dynamics. A grain is dropped every turn on a random cell in a 10x10 grid and increments that cell’s grain count by one (darker blue denotes a higher grain count per cell). If any cell has more than four grains, it is said to be unstable and is toppled, which causes an avalanche. An avalanche at a cell increments the grain count of all of its available neighbors by one. The grid is then checked again for any unstable cells, and the process repeats until no more cells are unstable. Avalanches can vary in size, with larger avalanches being less frequent.

A log-log plot of the size of the avalanches vs the frequency of the avalanches. The distribution mostly follows a power-law, aside from some noise effects at the highest end of the distribution.

Given the highly nonlinear dynamics of the sandpile, I thought it would be interesting to attempt to develop a way to predict where avalanches would occur in order to avoid them. Naturally, to make this fun, and to gain experience, my solution would involve using Reinforcement Learning (RL) methods, which have previously been used to solve other games like AlphaGo and Pac-Man. I developed a game featuring the Abelian sandpile model in which an RL agent would learn to maximize its score within the game. Agents are allowed to move in any of the four cardinal directions (plus staying in place). The system is simulated as follows: During each step, the agent selects its move. A sandgrain is then dropped randomly on a location in the grid and incrementing the number of grains in that location, per the rules of the game. If any locations contain more than the threshold value of four grains, that location is toppled, decreasing its count by four and incrementing all available neighbors by one; if the location is at an edge or a corner, this leads to a net loss of grains. If the agent was at the location of the avalanche, it is moved into randomly into a neighboring cell (which, if the neihboring cell is off the grid, causes the agent to stop playing the game). This process continues until no more locations on the grid exceed the threshold value of four. The next step then commences and the process repeats.

Agent score is calculated as follows: if the agent moved and wasn't avalanched, it accrues a reward equal to the number of grains in its square; if it was avalanched at all (i.e. moved by an avalanche), it does not accrue any reward for the turn; finally, if it was avalanched off the grid, it leaves the game.

The RL approach I chose was REINFORCE/policy gradients, which seemed like a straightforward initial choice. In my formulation, the agent was a policy, represented by a neural network, that represents the strategy the RL agent should follow. This neural network takes the current state of the sandpile as input and outputs a probability distribution over the five possible actions. In a series of Monte Carlo runs (10000), I simulated the RL agent in the sandpile environment, allowing it to interact with the game and its dynamics. During each interaction, the agent selects actions according to its current policy, generating trajectories of states and corresponding actions. Policy gradients were used to compute the gradient of the expected cumulative reward with respect to the parameters of the policy. The policy parameters are then updated in the direction that increases the probability of actions leading to higher rewards. The maximum number of simulated steps in the sandpile per run was 1000. One quirk of the environment is that the sandpile game is primarily about survival, as the rewards from any individual cell cap out at three, while the potential penalty is massive (getting avalanched off the grid). The full implementation is available on my GitHub.

Above is an animation of a sample simulation of the fully trained RL agent in the sandpile environment. The red dot represents the RL agent. Black arrows appear when the agent moves of its own volition, i.e. the move it chooses. Translucent red dots representing the agent’s past positions appear when the agent gets avalanched around. The strategy that the agent seems to have settled on is to mostly stay in the center while moving within that area to the cells with the highest grain count locally, which reflects the dynamics of the environment. This game is about survival to the maximum count, which is primarily threatened by large avalanches carrying the agent over the edge of the grid. While it is hard to predict precisely where the large avalanches that could potentially kick the agent out of the game could occur, the odds of getting moved far by a large avalanche are lowest at the center, compared to the edges of the board. Thus, the policy agent prefers a strategy that mostly sticks in the center to survive for the duration of the simulation.

This was a pretty fun project to implement! Given the project’s heavy reliance on the simulated sandpile environment, I heavily unit tested all the mechanics of the sandpile and the agent movements to ensure that the RL agent learned the proper dynamics of the environment free of bugs. If I were to continue on this project I’d consider a few improvements:

Implement Generalized Advantage Estimation: This seems like a standard RL approach to help convergence given that RL is heavily sample-inefficient. I believe that the sandpile environment would in particular benefit from GAE since currently information about how well the agent did is only gained at the very end of each run (in particular whether the agent got avalanched off), while under GAE we would also gain information throughout each step.
Replace RL with a standard supervised learning approach: In my formulation of the sandpile game, the agent itself doesn’t impact the dynamics through its actions, as it’s purely along for the ride generated by the avalanches. Thus, it would be feasible to generate a massive amount of data using simulated sandpiles with no agents, and reframe the challenge as a supervised learning problem in which a network learns to predict what cells are at highest risk of getting avalanched, which would then be used by a non-RL agent to move about the grid. This is actually a bigger thought, as I was a bit unimpressed by RL with regards to sample efficiency vs supervised or self-supervised methods.

Transformers-Driven Gait State Estimation for Exo Control

Continuous EKF Exoskeleton Controller