From the course: Machine Learning with Python: Foundations

What is reinforcement learning? - Python Tutorial

From the course: Machine Learning with Python: Foundations

What is reinforcement learning?

- [Narrator] Reinforcement learning is a science of learning to make decisions from interaction or the process of learning through feedback. It has many applications like autonomous driving, robotics trading and gaming. Reinforcement learning is very similar to early childhood learning. A toddler sees something, does something, gets positive or negative feedback, then adjusts his or her future behavior accordingly. Reinforcement learning along with unsupervised and supervised learning form the three major branches of machine learning. Unlike unsupervised learning where the objective is to identify unknown patterns in unlabeled data and supervise learning where the objective is to learn patterns in previously labeled data, reinforcement learning attempts to tackle two distinct objectives. The first is finding previously unknown solutions to existing problems. An example of this learning objective is a machine that plays chess better than any human ever. The second objective of reinforcement learning is finding solutions to problems that arise due to unforeseen circumstances. An example of this learning objective is a machine that is able to find an alternative route through a terrain, after a mudslide has altered the expected route. Reinforcement learning involves true primary entities that repeatedly interact with each other. One of them is the agent and the other is environment. The agent interacts with the environment by taking actions. The environment responds to the actions of the agent by providing feedback or observations to the agent. The feedback provided by the environment comes in two forms, state and reward. The state describes the impact of the agent's previous actions on the environment and the possible actions the agent can take. Each action is associated with a numeric reward, which the agent receives as a result of taking a particular action. The agent's primary objective is to maximize the sum of rewards it receives over the longterm. To illustrate how reinforcement learning works. Let's consider the familiar game of tic tac toe. In the game, two players take turns playing on a three by three board. One player plays Xs on the other Os, until one player wins by placing three marks in the row. Diagonally, vertically or horizontally as shown here. Let's assume that each of the positions on the board is represented by the labels shown here A1, A2, all the way to C3. Let's also assume that the first player is not the agent and plays Os while the agent is a second player and plays Xs. The first move of the game could look something like this. The table to the right is known as a policy table. It represents states and rewards. Columns A1 to C3 are the positions on the board. While column D is a reward associated with each state. Each row represents an available state or action that agent can make, given that the first player has played O in position A3. one stands for player one and two stands for player two. Notice that column A3 is taken and is therefore grayed out and pre-filled with one. This means that the agent can play any position on the board except A3, given the available actions and rewards, the agents must evaluate each possible action and choose the one that yields the highest reward. This is known as exploitation. Since all of the actions currently have the same reward, the agent randomly decides to play B2, in the second move, if the first player plays B3, then the state table via shown here. Once again, the agent was choose the action that yields the highest reward. Since all of these actions have a reward of 0.5, the agent randomly settles on a play of C3. The process repeats a third time for player one and for player two. At the end of each player's third move, the environment determines that player one has won the game. This is known as a terminal state. The coin cycle of learning has ended. At the end of the learning cycle, because the action taken by the agent in the third move resulted in the victory, the reward associated with that action is updated by the environment from 0.5 to one in the policy table. This is known as a backup, using the mathematical equation, the reward associated with the agent's second move is also backed up in the policy table, as well as a reward associated with the agent's first move. As a result of the higher rewards associated with the sequence of actions the agent took in the first learning cycle, during subsequent learning cycles, if the agent encounters a state similar to the one that it encountered in the first cycle, it will choose to take the same action that it did in the first cycle, in order to maximize reward. This brings up an important challenge with reinforcement learning. The challenge is known as the exploration versus exploitation trade-off. If left unchecked, an agent will always prefer to take actions that he has tried in the past and found to be effective in maximizing reward. As previously mentioned, this is known as exploitation. However, in order to discover a new sequence of actions with potentially higher reward, the agent was try actions that it has not selected before, or that do not initially appear to maximize reward. In other words, the agent sometimes has to choose actions with little to no consideration for their associated reward, this is known as exploration. An agent that focuses only on exploitation will only be able to solve problems it has previously encountered. An agent that focuses only on exploration will not learn from prior experience. A balanced approach is needed for effective reinforcement learning.

Contents