You'll see colored shapes and three buttons. Each combination of shape, color, and action has a hidden point value. Nobody will tell you the rules. You have to figure them out by trying things and observing what happens.
Each round, a colored shape appears. You pick one of three actions: A, B, or C. You'll see how many points you earned (or didn't). After 6 rounds, the game resets but your knowledge carries over.
Play as many games as you want. Your goal: maximize your score by discovering the hidden reward rules through trial and error.
There are 9 possible states (3 shapes x 3 colors) and 3 possible actions. Some combinations are worth more than others. Some are worth nothing.
Nobody handed you a dataset. Nobody told you the rules upfront. You took actions in an environment, observed rewards, and adjusted your strategy over time. That's the core loop: act, observe, update. It's the same loop that trains game-playing AIs, robotic controllers, and recommendation systems.
Early on, you probably tried random things just to see what happened. Later, you started repeating what worked. That tension between exploring new actions and exploiting known good ones is one of the central challenges in reinforcement learning. Too much exploration wastes time. Too much exploitation means you might miss a better strategy you never tried.
You only learned because you could see the points. If the reward had been delayed (say, only shown at the end of 6 rounds with no per-move feedback), learning would have been much harder. This is the "credit assignment problem": figuring out which of your actions actually caused the reward. In complex systems, this is one of the hardest parts of RL.