Interactive Demo

Learn the Rules

You'll see colored shapes and three buttons. Each combination of shape, color, and action has a hidden point value. Nobody will tell you the rules. You have to figure them out by trying things and observing what happens.

How it works

Each round, a colored shape appears. You pick one of three actions: A, B, or C. You'll see how many points you earned (or didn't). After 6 rounds, the game resets but your knowledge carries over.

Play as many games as you want. Your goal: maximize your score by discovering the hidden reward rules through trial and error.

There are 9 possible states (3 shapes x 3 colors) and 3 possible actions. Some combinations are worth more than others. Some are worth nothing.

Reinforcement Learning

Learn the Rules

Game 1
Best 0
A colored shape will appear. Pick an action. Observe the reward. Over multiple games, try to figure out which actions score points for which states.
Current State
 
This Game
0
Rounds Left
6
Results

What Just Happened

best score across all games

The Hidden Rules

+15
Action C on any Green shape
+10
Action A on any Red shape
+5
Action B on any Circle (any color)
+0
Everything else

You just did reinforcement learning.

Nobody handed you a dataset. Nobody told you the rules upfront. You took actions in an environment, observed rewards, and adjusted your strategy over time. That's the core loop: act, observe, update. It's the same loop that trains game-playing AIs, robotic controllers, and recommendation systems.

Exploration vs. exploitation.

Early on, you probably tried random things just to see what happened. Later, you started repeating what worked. That tension between exploring new actions and exploiting known good ones is one of the central challenges in reinforcement learning. Too much exploration wastes time. Too much exploitation means you might miss a better strategy you never tried.

The reward signal is everything.

You only learned because you could see the points. If the reward had been delayed (say, only shown at the end of 6 rounds with no per-move feedback), learning would have been much harder. This is the "credit assignment problem": figuring out which of your actions actually caused the reward. In complex systems, this is one of the hardest parts of RL.