Training an Autonomous Vehicle using Reinforcement Learning¶

Juan E. Rolon, 2017.¶

This project was submitted as part of the requisites required to obtain Machine Learning Engineer Nanodegree from Udacity. It also forms part of the Artificial Intelligence curriculum.

Introduction¶

In this project, I worked towards constructing an optimized Q-Learning driving agent that navigates a Smartcab through its environment towards a goal. Since the Smartcab is expected to drive passengers from one location to another, the driving agent is evaluated on two very important metrics: Safety and Reliability.

A driving agent that gets the Smartcab to its destination while running red lights or narrowly avoiding accidents would be considered unsafe. Similarly, a driving agent that frequently fails to reach the destination in time would be considered unreliable.

Maximizing the driving agent's safety and reliability would ensure that Smartcabs have a permanent place in the transportation industry.

Safety and Reliability are measured using a letter-grade system as follows:

Grade	Safety	Reliability
A+	Agent commits no traffic violations, and always chooses the correct action.	Agent reaches the destination in time for 100% of trips.
A	Agent commits few minor traffic violations, such as failing to move on a green light.	Agent reaches the destination on time for at least 90% of trips.
B	Agent commits frequent minor traffic violations, such as failing to move on a green light.	Agent reaches the destination on time for at least 80% of trips.
C	Agent commits at least one major traffic violation, such as driving through a red light.	Agent reaches the destination on time for at least 70% of trips.
D	Agent causes at least one minor accident, such as turning left on green with oncoming traffic.	Agent reaches the destination on time for at least 60% of trips.
F	Agent causes at least one major accident, such as driving through a red light with cross-traffic.	Agent fails to reach the destination on time for at least 60% of trips.

To assist evaluating these important metrics, we use a visualization code that tracks the traveling path of the driving agent.

# Import the visualization code
import visuals as vs

# Pretty display for notebooks
%matplotlib inline

Understanding the World¶

One of the major components to building a self-learning agent is understanding the characteristics about the agent, which includes how the agent operates. To begin, we run an agent which has not been trained at all. The simulation is run for some time to see the various working components. In the visual simulation, the white vehicle represents the Smartcab.

The following points need to be observed during the simulation when running the default agent.py agent code. Some points to consider are:

Does the Smartcab move at all during the simulation?
What kind of rewards is the driving agent receiving?
How does the light changing color affect the rewards?

Observations:¶

According to the visual simulation the smartcab stays idle (not moving).
The driving agent seems to be receiving positive and negative rewards.
The agent receives a positive reward if it stays idle while the traffic light is red.
The agent receives negative rewards (penalty) if it stays idle while the traffic light is green and there is no oncoming traffic.

Understanding the Code¶

Attempting to create a driving agent would be difficult without having at least explored the "hidden" devices that make everything work. In the /smartcab/ top-level directory, there are two folders: /logs/ (which will be used later) and /smartcab/. Open the /smartcab/ folder and explore each Python file included, then answer the following question.

Sections of code relevant to driving agent simulation:¶

In the agent.py Python file, we choose three flags that can be set and explain how they change the simulation.

In the environment.py Python file, we look at the Environment class function that is called when an agent performs an action

In the simulator.py Python file, we look for the difference between the 'render_text()' function and the 'render()' function

In the planner.py Python file, we assess whether 'next_waypoint() function considers the North-South or East-West direction first

Analysis:¶

World parameters:¶

grid_size: Defines the number of intersections in the simulation grid. Basically, this represents the spatial size environment. It is a tuple of two integers (number of rows, number of columns).

learning: If set to True, it forces the driving agent to use the Q-learning algorithm.

epsilon: Defines the exploration factor. It is a continuous variable.

alpha: Defines the learning rate. It is a continuous variable.

Action parameters:¶

The function act with header act(self, agent, action) is called to implement an agent's action.

In addition, this function defines the reward scheme for all possible agent's actions

Visualization parameters:¶

The function render_text() calls the non-GUI display of the simulation while render() calls the GUI render display of the simulation.

Planning parameters:¶

In the planner.py Python file, the 'next_waypoint() function cconsiders the East-West direction first.

Implementing a Basic Driving Agent¶

The first step to creating an optimized Q-Learning driving agent is getting the agent to actually take valid actions. In this case, a valid action is one of None, (do nothing) 'left' (turn left), right' (turn right), or 'forward' (go forward). In the first implementation, we navigate to the 'choose_action()' agent function and make the driving agent randomly choose one of these actions.

We have access to several class variables that helps adding this functionality, such as 'self.learning' and 'self.valid_actions'. Once implemented, we run the agent file and simulation briefly to confirm that the driving agent is taking a random action each time step.

Basic Agent Simulation Results¶

To obtain results from the initial simulation, we need to adjust following flags:

'enforce_deadline' - Set this to True to force the driving agent to capture whether it reaches the destination in time.
'update_delay' - Set this to a small value (such as 0.01) to reduce the time between steps in each trial.
'log_metrics' - Set this to True to log the simluation results as a .csv file in /logs/.
'n_test' - Set this to '10' to perform 10 testing trials.

Optionally, we may disable to the visual simulation (which can make the trials go faster) by setting the 'display' flag to False. Flags that have been set here should be returned to their default setting when debugging. It is important to understand what each flag does and how it affects the simulation!

Once the initial simulation is completed (there should have been 20 training trials and 10 testing trials), we run the code cell below to visualize the results. Note that log files are overwritten when identical simulations are run, so be careful with what log file is being loaded!

We run the agent.py file after setting the flags from projects/smartcab folder instead of projects/smartcab/smartcab.

# Load the 'sim_no-learning' log file from the initial simulation results
vs.plot_trials('sim_no-learning.csv')

Analysis¶

From the initial simulation results it is necessary to provide an analysis and make several observations about the driving agent.

Questions that need to be answered at this point are:

How frequently is the driving agent making bad decisions? How many of those bad decisions cause accidents?
Given that the agent is driving randomly, does the rate of reliability make sense?
What kind of rewards is the agent receiving for its actions? Do the rewards suggest it has been penalized heavily?
As the number of trials increases, does the outcome of results change significantly?
Would this Smartcab be considered safe and/or reliable for its passengers? Why or why not?

Answers:¶

How frequently is the driving agent making bad decisions? How many of those bad decisions cause accidents?

As shown in the visualization panel, the percentage of bad actions taken by the driving agent varies between ~42% and ~48% during the course of 20 trials. Out of these bad actions, between 4% and 6% cause minor accidents and 4% to 6% cause major accidents; this yields between 8% and 12% of bad actions causing some type of accident.

Given that the agent is driving randomly, does the rate of reliability make sense?

The rate of reliability fluctuates between 10% and 20%. This poor performance metric is consistent with a driving agent that behaves erratically (i.e. randomly).

What kind of rewards is the agent receiving for its actions? Do the rewards suggest it has been penalized heavily?

The driving agent is heavily and frequently penalized with negative rewards as a result of its erratic behavior.

As the number of trials increases, does the outcome of results change significantly?

Since the agent is not learning anything, but just behaving randomly, the outcome results would not change significantly; they will fluctuate around a meaningless average value with increasing number of trials.

Would this Smartcab be considered safe and/or reliable for its passengers? Why or why not?

Due to its erratic behavior and numerous traffic violations and accidents, this driving agent is unsafe and unreliable for its passengers. Its overall safety and reliability ratings are both F.

The second step to creating an optimized Q-learning driving agent is defining a set of states that the agent can occupy in the environment.

Depending on the input, sensory data, and additional variables available to the driving agent, a set of states can be defined for the agent so that it can eventually learn what action it should take when occupying a state.

The condition of 'if state then action' for each state is called a policy, and is ultimately what the driving agent is expected to learn.

Without defining states, the driving agent would never understand which action is most optimal -- or even what environmental variables and conditions it cares about.

Identifying the Agent States¶

Inspecting the 'build_state()' agent function shows that the driving agent is given the following data from the environment:

'waypoint', which is the direction the Smartcab should drive leading to the destination, relative to the Smartcab's heading.
'inputs', which is the sensor data from the Smartcab. It includes
- 'light', the color of the light.
- 'left', the intended direction of travel for a vehicle to the Smartcab's left. Returns None if no vehicle is present.
- 'right', the intended direction of travel for a vehicle to the Smartcab's right. Returns None if no vehicle is present.
- 'oncoming', the intended direction of travel for a vehicle across the intersection from the Smartcab. Returns None if no vehicle is present.
'deadline', which is the number of actions remaining for the Smartcab to reach the destination before running out of time.

Policy Aspects¶

In order to assess the constraints definint the agents policy we need to consider the following:

We need to consider the features available to the agent that are most relevant for learning both safety and efficiency
We need to assess whether these features are appropriate for modeling the Smartcab in the environment.

Analysis:¶

To conduct this assessment is neccessary to explore lines 275-331 written in the body of the act() function defined in environment.py, which contains the code defining the conditions for valid actions and traffic violation rules.

Safety:

Traffic light color (inputs['light'])

The agent's motion, as well as traffic violations and potential accident considerations are conditioned by the traffic light status. Therefore this feature is absolutely neccessary.

Oncoming traffic percept (inputs['oncoming'])

This feature is also required. It conditions situations that could result in major traffic violations and accidents such as driving into an intersection with oncoming cross traffic.

LH side traffic percept (inputs['left'])
RH side traffic percept (inputs['right'])

For example, the agent sensors should perceive traffic approaching from the left while attempting to make a right turn while traffic light is red. Similarly, in the extreme case in which the agent makes a left turn with a red light (accidentally runs a red light) the agent must perceive traffic approaching from the right to avoid a collision.

Efficiency:

Waypoint (As defined in the route planner module)

This is a feature the agent needs to learn to plan a route that leads to a successful completion of its trip. This reflects a component of the agent's final policy, and amounts to learn the optimal conditions at which the agent decides to move forward or make a turn.

With this feature included, the agent will traverse a route between origin and destination, which varies from simulation to simulation due to changing conditions on the environment produced by the status of traffic lights and the presence of other vehicles in the grid.

Deadline (As defined in the environment module)

When enforced, the agent gets penalized (effectively and progressively) in terms of the remaining time deducted from the hard-time stamp assigned during initialization. If we include deadline as a feature, the agent must learn, in addition to learn to drive safely, to take actions based on maximizing the remaining trip time.

In my opinion, the deadline itself is not absolutely required as a training feature to teach an agent to drive with optimal safety and realiability scores, at least not at the level of this simulation.

Since the above feature impacts considerably the training time and complexity of the siumation, I have decided not to include deadline as a state component.

State Space¶

When defining a set of states that the agent can occupy, it is necessary to consider the size of the state space. In other words, if we expect the driving agent to learn a policy for each state, we need to have an optimal action for every state the agent can occupy. If the number of all possible states is very large, it might be the case that the driving agent never learns what to do in some states, which can lead to uninformed decisions.

For example, when the following features are used to define the state of the Smartcab:

('is_raining', 'is_foggy', 'is_red_light', 'turn_left', 'no_traffic', 'previous_turn_left', 'time_of_day').

How frequently would the agent occupy a state like (False, True, True, True, False, False, '3AM')? Without a near-infinite amount of time for training, it's doubtful the agent would ever learn the proper action!

Further considerations on the size of the state space¶

If a state is defined using the features selected from the previous analysis, it is necessary to consider the size of the state space, for that feature selection.
We need to determine whether the driving agent could learn a policy for each possible state within a reasonable number of training trials

Size of the state space:¶

The agent's state considered here is specified by the following tuple:

state = (waypoint, inputs['light'], inputs['oncoming'], inputs['left'], inputs['right'])

The values of each of the state components are:

waypoint: ['forward', 'left', 'right'] (count = 3)
inputs['light'] : [True, False] (count = 2)
inputs['oncoming']: [None, 'forward', 'left', 'right'] (count = 4)
inputs['left']: [None, 'forward', 'left', 'right'] (count = 4)
inputs['right']:[None, 'forward', 'left', 'right'] (count = 4)

Using the multiplication principle we can compute the total number of states, N, resulting from the different values of the state components.

The above yields: N=3 x 2 x 4 x 4 x 4 = 384.

Assessing whehter agent could learn a policy for each possible state within a reasonable number of training trials:

The number of states considered above, N=384, is fairly small and it can be handled by readily available memory/compute resources. In the current implementation, the agent did not have issues learning an optimal policy compatible with the constraints of the problem.

Updating the Driving Agent State¶

To carry a second implementation, we navigate to the 'build_state()' agent function. and set the 'state' variable to a tuple of all the features necessary for Q-Learning.

At this point, we need to confirm that the driving agent is updating its state by running the agent file and simulation briefly and note whether the state is displaying.

If the visual simulation is used, we need to confirm that the updated state corresponds with what is seen in the simulation.

Note: It is neccessary to reset simulation flags to their default setting when making this observation.

Implementing a Q-Learning Driving Agent¶

The third step to creating an optimized Q-Learning agent is to begin implementing the functionality of Q-Learning itself. The concept of Q-Learning is fairly straightforward: For every state the agent visits, create an entry in the Q-table for all state-action pairs available.

When the agent encounters a state and performs an action, update the Q-value associated with that state-action pair based on the reward received and the iterative update rule implemented. Of course, additional benefits come from Q-Learning, such that we can have the agent choose the best action for each state based on the Q-values of each state-action pair possible.

In this project phase we implement a decaying, $\epsilon$-greedy Q-learning algorithm with no discount factor.

Note that the agent attribute self.Q is a dictionary: This is how the Q-table will be formed. Each state will be a key of the self.Q dictionary, and each value will then be another dictionary that holds the action and Q-value. Here is an example:

{ 'state-1': { 
    'action-1' : Qvalue-1,
    'action-2' : Qvalue-2,
     ...
   },
  'state-2': {
    'action-1' : Qvalue-1,
     ...
   },
   ...
}

When using a decaying $\epsilon$ (exploration) factor, as the number of trials increases, $\epsilon$ should decrease towards 0. This is because the agent is expected to learn from its behavior and begin acting on its learned behavior.

Additionally, if the algorithm is tested on what it has learned after $\epsilon$ has passed a certain threshold (the default threshold is 0.05). For the initial Q-Learning algorithm, we implement a linear decaying function for $\epsilon$.

Q-Learning Simulation Results¶

To obtain results from the initial Q-Learning algorithm, we adjust the following flags and setup:

'enforce_deadline' - We set this to True to force the driving agent to capture whether it reaches the destination in time.
'update_delay' - We set this to a small value (such as 0.01) to reduce the time between steps in each trial.
'log_metrics' - We set this to True to log the simluation results as a .csv file and the Q-table as a .txt file in /logs/.
'n_test' - We set this to '10' to perform 10 testing trials.
'learning' - We set this to 'True' to tell the driving agent to use the Q-Learning implementation.

In addition, we use the following decay function for $\epsilon$:

$$ \epsilon_{t+1} = \epsilon_{t} - 0.05, \hspace{10px}\textrm{for trial number } t$$

It is recommended to set the 'verbose' flag to True to help debug any potential issues in the simulation. Flags that have been set here should be returned to their default setting when debugging. It is important to understand what each flag does and how it affects the simulation!

We run the code cell below to visualize the results once the initial Q-Learning simulation was successfully completed. Note that log files are overwritten when identical simulations are run, so be careful with what log file is being loaded!

# Load the 'sim_default-learning' file from the default Q-Learning simulation
vs.plot_trials('sim_default-learning.csv')

Q-Table Analysis¶

Using the visualization above resulting from the default Q-Learning simulation, we provide an analysis and make observations about the driving agent. To do this we analyze the Q-table saved into a text file at the end of the simulation. Q-tables help making observations about the agent's learning.

Some additional points to consider are:

As shown in the plot of "Relative Frequency of Bad Actions vs. Trial Number", the Total Frequency of Bad Actions is initially comparable (for the first 12 trials) to that resulting from the non-learning algorithm.

Notably, both the Frequency of Minor and Major accidents remain mostly constant and comparable to those resulting from the non-learning algorithm.

In the plot "Parameter vs Trial Number", we can see that $\epsilon$ decays to its threshold value in approximately 19 to 20 trials. This result is consistent with the linear decay function $\epsilon(t) = 1.0 - 0.05 t$. From the preceding equation, we can calculate the time needed to reach the threshold. For $t=19$ we have $\epsilon(t=19) = 1.0 - 0.05(19) = 0.05$.

$\epsilon$ decays linearly from its initial value of 1.0 down to its threshold of 0.05.

We can see that the "Total Frequency of Bad Actions" decreases with the number of trials from an initial value of ~41% down to ~20%. This decay is also observed in the frequency of Minor and Major Violations. On the other hand, the 10-Trial "Rolling Average Reward per Action" increases from approx. -4.5 upto ~1.5 during the course of ~20 Trials.

Both ratings correspond to a fail (F) situation similar to that observed for the non-learning initial driving agent.

Improving the Q-Learning Driving Agent¶

The third step to creating an optimized Q-Learning agent is to perform the optimization! Now that the Q-Learning algorithm is implemented and the driving agent is successfully learning, it's necessary to tune settings and adjust learning paramaters so the driving agent learns both safety and efficiency.

This step requires a lot of trial and error, as some settings will invariably make the learning worse. One thing to keep in mind is the act of learning itself and the time that this takes.

In theory, we could allow the agent to learn for an incredibly long amount of time; however, another goal of Q-Learning is to transition from experimenting with unlearned behavior to acting on learned behavior. For example, always allowing the agent to perform a random action during training (if $\epsilon = 1$ and never decays) will certainly make it learn, but never let it act.

When improving on the Q-Learning algorithm, we consider the implications it creates and whether it is logistically sensible to make a particular adjustment.

Improved Q-Learning Simulation Results¶

To obtain results from the initial Q-Learning implementation, we adjust the following flags and setup:

'enforce_deadline' - We set this to True to force the driving agent to capture whether it reaches the destination in time.
'update_delay' - We set this to a small value (such as 0.01) to reduce the time between steps in each trial.
'log_metrics' - We set this to True to log the simluation results as a .csv file and the Q-table as a .txt file in /logs/.
'learning' - We set this to 'True' to tell the driving agent to use Q-Learning implementation.
'optimized' - We set this to 'True' to tell the driving agent we are performing an optimized version of the Q-Learning implementation.

Additional flags that can be adjusted as part of optimizing the Q-Learning agent:

'n_test' - We set this to some positive number (previously 10) to perform that many testing trials.
'alpha' - We set this to a real number between 0 - 1 to adjust the learning rate of the Q-Learning algorithm.
'epsilon' - We set this to a real number between 0 - 1 to adjust the starting exploration factor of the Q-Learning algorithm.
'tolerance' - We set this to some small value larger than 0 (default was 0.05) to set the epsilon threshold for testing.

At this point we can use an arbitrary decaying function for $\epsilon$ (the exploration factor). The function must decay to 'tolerance' at a reasonable rate. The Q-Learning agent will not begin testing until this occurs. Some example decaying functions (for $t$, the number of trials):

$$ \epsilon = a^t, \textrm{for } 0 < a < 1 \hspace{50px}\epsilon = \frac{1}{t^2}\hspace{50px}\epsilon = e^{-at}, \textrm{for } 0 < a < 1 \hspace{50px} \epsilon = \cos(at), \textrm{for } 0 < a < 1$$

We may also use a decaying function for $\alpha$ (the learning rate). However, this is typically less common. We need to ensure th following inequality $0 \leq \alpha \leq 1$.

We run the code cell below to visualize the results. Note that log files are overwritten when identical simulations are run, so be careful with what log file is being loaded!

# Load the 'sim_improved-learning' file from the improved Q-Learning simulation
vs.plot_trials('sim_improved-learning.csv')

Improved Simulation Results.¶

Using the visualization resulting from the improved Q-Learning simulation, the following aspects need to be analyzed:

Decaying function was used for epsilon (the exploration factor):

I used an exponetially decaying exploration factor, $\epsilon(t) = e^{-\alpha t}$, with a learning rate, $\alpha = 0.001$, representing the decay constant.

Number of training trials needed for the agent before begining testing:

We can compute explicitly the trial time, $t^{*}$, at which the exploration factor reaches its threshold value, $\epsilon_T = \epsilon(t^{*})$. In the simulation I set the threshold at $\epsilon_T = 0.01$; with this values we set

$t^{*} = -\frac{1}{\alpha}\ln{\epsilon_T} = -\frac{1}{0.001}\ln{0.01} = 4605.17$

Therefore, we need ~4605 training trials prior to start the testing phase; this number agrees with the visualization results shown in the plot "Parameter Value vs. Trial Number"

Epsilon-tolerance and alpha (learning rate):

There is no appreciable change in the safety/reliability ratings with increasing values of the tolerance upto values near $\epsilon_T = 0.01$. Therefore, I decided to keep the latter value.
The learning rate, $\alpha$, (exploration factor decay constant) affects considerably the rating values. For example, increasing the learning rate by an order of magnitude from 0.001 to 0.01 results in failing safety rates and mediocre reliability rates.
In the language of Monte Carlo and TD methods for Q-learning, the performance decrease of the ratings with increasing $\alpha$ is the direct result of lowering the number of trials down to ~460, which are not enough to appropriately let the Markov Process to thermalize (relax to near-equillibrium). In other words, lowering the number of trials prevents the algorithm to reach quasi-stationary action-driven transition rates between most of the possible state tuples that can be formed when the agent visits those states.
The Q-learning algorithm should let the agent discover and learn from as many previous mistakes as possible, or to let the agent learn enough to aquire the valuable experience it needs to develop an optimal policy. Therefore, we decided to keep a smaller value of $\alpha =0.001$, which is adequate to perform the simulation in a reasonalbe amount of time.

Improvement increase when compared to the default Q-Learner from the previous simulation:

The improvement is highly significant from failing ratings upto A+, A ratings on safety and reliability, respectively. The optimized agent commits no traffic violations, and always chooses the correct action according to the rules set in the environment module.
On the other hand, the agent reaches the destination on time for at least 90% of trips. As we did not include the deadline as a state component feature, this latter performance metric could be attributed to not letting the agent learn how to optimize (maximize) the remaining time of the trip at each step.

Defining an Optimal Policy¶

Sometimes, the answer to the important question "what am I trying to get my agent to learn?" only has a theoretical answer and cannot be concretely described.

In the present case, we can concretely define the learning objective, i.e. the U.S. right-of-way traffic laws. Since these laws are known information, we can further define, for each state the Smartcab is occupying, the optimal action for the driving agent based on these laws.

In that case, we call the set of optimal state-action pairs an optimal policy. Hence, unlike some theoretical answers, it is clear whether the agent is acting "incorrectly" not only by the reward (penalty) it receives, but also by pure observation. If the agent drives through a red light, we both see it receive a negative reward but also know that it is not the correct behavior. This can be used to the agents advantage for verifying whether the agents policy has learned the correct one, or if it is a suboptimal policy.

Analysis:¶

We need to determine the optimal policy for the smartcab in the given environment, i.e. the best set of instructions possible given what we know about the environment.
Next, we investigate the 'sim_improved-learning.txt' text file to see the results of the improved Q-Learning algorithm. We need to answer the following questions: For each state that has been recorded from the simulation, is the policy (the action with the highest value) correct for the given state? Are there any states where the policy is different than what would be expected from an optimal policy?
We need to find instances where the smartcab did not learn the optimal policy and find out why is that the case.

To answer the questions above, I created the following script:

#The following script creates a symbolic representation of the data found in 
#'sim_improved-learning.txt' text file. I extract a random sample from the data illustrating
#examples from the recorded Q-table
#author: @J.E. Rolon

import operator
import random
import matplotlib.pyplot as plt

# ******************* Extract a random sample from Q-table *******************
def qtable_sample(fname, nsamples):
    """Input filename of textfile storing qtable. Number of samples requested
       Returns a sample of observations from the q-table. Each observation is
       a dictionary specifying the observed state, the input percepts and the
       action that maximizes the Q value.
    """
    f = open(fname, 'r')
    span = len(f.readlines())
    f.seek(0)
    state_maxQ_ldicts = []
    for i in range(span):
        tmp_dict = {}
        str_line = f.readline()
        if "('" in str_line:
            lst1 = []
            lst2 = []
            for i in range(4):
                tmpl = f.readline()
                if '--' in tmpl:
                    pos1 = tmpl.find('--')
                    pos2 = tmpl.find(':')
                    act_def = tmpl[pos1 + 3:pos2 - 1]
                    act_val = tmpl[pos2 + 2:-1]
                    lst1.append(act_def)
                    lst2.append(act_val)
            index, value = max(enumerate(lst2), key=operator.itemgetter(1))
            state = eval(str_line.rstrip())
            wypt = state[0]
            inps = (state[1], state[2], state[3], state[4])
            acts = zip(lst1, lst2)
            maxq_act = lst1[index]
            tmp_dict['waypoint'] = wypt
            tmp_dict['inp_light'] = inps[0]
            tmp_dict['inp_oncoming'] = inps[1]
            tmp_dict['inp_left'] = inps[2]
            tmp_dict['inp_right'] = inps[3]
            tmp_dict['avail_actions'] = acts
            tmp_dict['maxQaction'] = maxq_act
            state_maxQ_ldicts.append(tmp_dict)

    return random.sample(state_maxQ_ldicts, nsamples)

# *************************** Plotting functions **********************************************
#
# The following set of functions encapsulate different objects representing the elements of the
# the Q-table. Together generate a figure representing the state, input percepts and action taken.
def create_canvas(axlabels):
    """Defines the figure frame dimensions and axis specs"""
    plt.plot()
    plt.xlim(-1.2, 1.2)
    plt.ylim(-0.8, 1.8)
    # show axis (off, on)
    plt.axis(axlabels)
    plt.xticks([])
    plt.yticks([])

def intersection_point():
    """Generates a circle specifying the intersection point location"""
    intersect = plt.Circle((0.05, 0.3), radius=0.05, fc='k', fill=False, linestyle='dashed')
    return plt.gca().add_patch(intersect)

def percept_traffic_light(light_color):
    """Creates a representation of the traffic light states. Input current light color"""
    ypos = 1.6
    xpos = -0.9
    dx = 0.35
    if light_color == 'green':
        # green traffic light
        circle = plt.Circle((xpos, ypos), radius=0.15, fc='lime')
        plt.gca().add_patch(circle)
        # red traffic light
        circle2 = plt.Circle((xpos+dx, ypos), radius=0.15, fc='gray', linestyle='dashed', fill=False)
        plt.gca().add_patch(circle2)
        plt.text(xpos, ypos-0.25, 'Traffic light', rotation='horizontal', fontsize='9')
    elif light_color == 'red':
        # green traffic light
        circle = plt.Circle((xpos, ypos), radius=0.15, fc='gray', linestyle='dashed', fill=False)
        plt.gca().add_patch(circle)
        # red traffic light
        circle2 = plt.Circle((xpos+dx, ypos), radius=0.15, fc='r')
        plt.gca().add_patch(circle2)
        plt.text(xpos, ypos - 0.25, 'Traffic light', rotation='horizontal', fontsize='9')
    else:
        raise Exception('Invalid traffic light state')

def percept_input_left(input):
    """Represents the input percept and direction of a potential vehicle approaching from the LEFT
       valid inputs: approaching vehicle turning right, turning left,  moving forward.
       Percepts as observed from agent. Turns represented respect to approaching vehicle.
    """
    if input == 'right':
        # input left
        plt.arrow(-0.7, -0.2, 0.15, 0.0, lw=2, head_width=0.04, head_length=0.05, fc='gray', ec='gray',
                  linestyle='dashed')
        plt.arrow(-0.5, -0.2, 0.0, -0.15, lw=2, head_width=0.04, head_length=0.05, fc='gray', ec='gray',
                  linestyle='dashed')
        plt.text(-0.75, -0.15, 'Input left: right', rotation='horizontal', fontsize='9', color='gray')
    elif input == 'left':
        # input right
        plt.arrow(-0.7, -0.2, 0.15, 0.0, lw=2, head_width=0.04, head_length=0.05, fc='gray', ec='gray',
                  linestyle='dashed')
        plt.arrow(-0.5, -0.2, 0.0, 0.15, lw=2, head_width=0.04, head_length=0.05, fc='gray', ec='gray',
                  linestyle='dashed')
        plt.text(-0.75, -0.30, 'Input left: left', rotation='horizontal', fontsize='9', color='gray')
    elif input == 'forward':
        # input forward
        plt.arrow(-0.75, -0.2, 0.35, 0.0, lw=2, head_width=0.04, head_length=0.05, fc='gray', ec='gray',
                  linestyle='dashed')
        plt.text(-0.8, -0.15, 'Input left: forward', rotation='horizontal', fontsize='9', color='gray')
    elif input == None or 'None':
        pass
    else:
        raise Exception('Invalid percept input')

def percept_input_right(input):
    """Represents the input percept and direction of a potential vehicle approaching from the RIGHT
       valid inputs: approaching vehicle turning right, turning left,  moving forward.
       Percepts as observed from agent. Turns represented respect to approaching vehicle.
    """
    if input == 'left':
        # input left
        plt.arrow(0.9, -0.2, -0.15, 0.0, lw=2, head_width=0.04, head_length=0.05, fc='gray', ec='gray',
                  linestyle='dashed')
        plt.arrow(0.70, -0.2, 0.0, -0.15, lw=2, head_width=0.04, head_length=0.05, fc='gray', ec='gray',
                  linestyle='dashed')
        plt.text(0.65, -0.15, 'Input rigth: left', rotation='horizontal', fontsize='9', color='gray')
    elif input == 'right':
        # input right

        plt.arrow(0.9, -0.2, -0.15, 0.0, lw=2, head_width=0.04, head_length=0.05, fc='gray', ec='gray',
                  linestyle='dashed')
        plt.arrow(0.7, -0.2, 0.0, 0.15, lw=2, head_width=0.04, head_length=0.05, fc='gray', ec='gray',
                  linestyle='dashed')
        plt.text(0.6, -0.30, 'Input right: right', rotation='horizontal', fontsize='9', color='gray')
    elif input == 'forward':
        # input forward
        plt.arrow(0.9, -0.2, -0.3, 0.0, lw=2, head_width=0.04, head_length=0.05, fc='gray', ec='gray',
                  linestyle='dashed')
        plt.text(0.5, -0.15, 'Input right: forward', rotation='horizontal', fontsize='9', color='gray')
    elif input == None or 'None':
        pass
    else:
        raise Exception('Invalid percept input')

def percept_input_oncoming(direction):
    """Represents the input percept of actual oncoming traffic approaching.
       Valid inputs: oncoming traffic from left, forward or right dirrection.
       Percepts taken from perspective of agent.
    """
    if direction == 'left':
        # oncoming left
        plt.arrow(-0.8, 0.3, 0.4, 0.0, head_width=0.04, head_length=0.05, fc='b', ec='b')
        plt.text(-0.8, 0.35, 'Oncoming left', rotation='horizontal', fontsize='9')
    elif direction == 'forward':
        # oncoming forward
        plt.arrow(0.05, 1.2, 0.0, -0.45, head_width=0.04, head_length=0.05, fc='b', ec='b')
        plt.text(-0.05, 1.45, 'Oncoming forward', rotation='vertical', fontsize='9')
    elif direction == 'right':
        # oncoming right
        plt.arrow(0.9, 0.3, -0.4, 0.0, head_width=0.04, head_length=0.05, fc='b', ec='b')
        plt.text(0.45, 0.35, 'Oncoming right', fontsize='9')
    elif direction == None or 'None':
        pass
    else:
        raise Exception('Invalid oncomming traffic percept')

def waypoint(direction):
    """Represents the current waypoint of agent.
       Valid inputs: waypoint turning left, right or moving forward
       Percept taken from perspective of agent.
    """
    desc = 'Next Waypoint'
    # waypoint foward
    if direction == 'forward':
        plt.arrow(0.05, -0.5, 0.0, 0.3, head_width=0.04, head_length=0.05, fc='k', ec='k', linestyle='solid', linewidth=3.5)
        plt.text(-0.1, -0.6, desc, fontsize='9', fontweight='bold')
    # waypoint right
    elif direction == 'right':
        plt.arrow(0.05, -0.5, 0.0, 0.15, head_width=0.04, head_length=0.05, fc='k', ec='k', linestyle='solid', linewidth=3.5)
        plt.arrow(0.05, -0.3, 0.1, 0.0, head_width=0.04, head_length=0.05, fc='k', ec='k', linestyle='solid',
                  linewidth=3.5)
        plt.text(-0.1, -0.6, desc, fontsize='9', fontweight='bold')
    elif direction == 'left':
        plt.arrow(0.05, -0.5, 0.0, 0.15, head_width=0.04, head_length=0.05, fc='k', ec='k', linestyle='solid', linewidth=3.5)
        plt.arrow(0.05, -0.3, -0.1, 0.0, head_width=0.04, head_length=0.05, fc='k', ec='k', linestyle='solid',
                  linewidth=3.5)
        plt.text(-0.1, -0.6, desc, fontsize='9', fontweight='bold')

    elif direction == None or 'None':
        pass
    else:
        raise Exception('Invalid waypoint percept')

def maxQ_action(action):
    """Writes a message text indicating the action that maximizes Q given the observed percepts"""
    # maxQ action enclosing rectangle
    maxq_rect = plt.Rectangle((0.45, 1.2), 0.55, 0.55, fill=False)
    plt.gca().add_patch(maxq_rect)
    fs = 9
    # maxQ action forward
    if action == 'forward':
        plt.arrow(0.7, 1.35, 0.0, 0.2, head_width=0.04, head_length=0.05, fc='darkgreen', ec='darkgreen', linestyle='solid',
                  linewidth=3.5)
        plt.text(0.5, 1.25, 'MaxQ action', fontsize=fs, fontweight='bold', color='darkgreen')
        plt.text(0.55, 1.65, 'Forward', fontsize=fs, fontweight='normal', color='darkgreen')

    # maxQ action left
    elif action == 'left':
        plt.arrow(0.7, 1.35, 0.0, 0.15, head_width=0.04, head_length=0.05, fc='darkgreen', ec='darkgreen', linestyle='solid',
                  linewidth=3.5)
        plt.arrow(0.7, 1.55, -0.1, 0.0, head_width=0.04, head_length=0.05, fc='darkgreen', ec='darkgreen',
                  linestyle='solid',
                  linewidth=3.5)
        plt.text(0.5, 1.25, 'MaxQ action', fontsize=fs, fontweight='bold', color='darkgreen')
        plt.text(0.6, 1.65, 'Left', fontsize=fs, fontweight='normal', color='darkgreen')

    # maxQ action right
    elif action == 'right':
        plt.arrow(0.7, 1.35, 0.0, 0.15, head_width=0.04, head_length=0.05, fc='darkgreen', ec='darkgreen', linestyle='solid',
                  linewidth=3.5)
        plt.arrow(0.7, 1.55, 0.1, 0.0, head_width=0.04, head_length=0.05, fc='darkgreen', ec='darkgreen',
                  linestyle='solid',
                  linewidth=3.5)
        plt.text(0.5, 1.25, 'MaxQ action', fontsize=fs, fontweight='bold', color='darkgreen')
        plt.text(0.6, 1.65, 'Right', fontsize=fs, fontweight='normal', color='darkgreen')

    # maxQ action None (idle)
    elif action == None or 'None':
        plt.text(0.63, 1.45, 'Idle', fontsize='12', fontweight='bold', color='darkgreen')
        plt.text(0.5, 1.25, 'MaxQ action', fontsize='9', fontweight='bold', color='darkgreen')
        plt.text(0.6, 1.65, 'None', fontsize=fs, fontweight='normal', color='darkgreen')
    else:
        raise Exception('Invalid maxQ action')

#State = (waypoint, inputs['light'], inputs['oncoming'], inputs['left'], inputs['right'])
#{'avail_actions': [('forward', '-0.44'), ('None', '0.19'), ('right', '-0.24'), ('left', '-0.24')],
# 'inp_light': 'red', 'inp_left': 'forward', 'inp_right': None, 'inp_oncoming': 'forward',
# 'waypoint': 'left', 'maxQaction': 'None'}

def legended_description(obs):
    fs = 8
    xpos = 0.4
    ypos = 1.1
    plt.text(xpos-0.1, ypos, '(Action, Q-value) chosen from:', rotation='horizontal', fontsize=fs,color='blue')
    for i, act in enumerate(obs['avail_actions']):
        plt.text(xpos+0.1, (ypos-0.1)-i*0.1, act, rotation='horizontal', fontsize=fs,color='blue')

    state_descr = ('waypoint', 'inpup_light', 'input_oncoming', 'input_left', 'input_right')
    state_componets = (obs['waypoint'], obs['inp_light'], obs['inp_oncoming'], obs['inp_left'], obs['inp_right'])
    plt.text(xpos - 1.55, ypos, 'State description:', rotation='horizontal', fontsize=fs, color='blue')
    for j, component in enumerate(state_componets):
        plt.text(xpos - 1.55, (ypos - 0.1) - j * 0.1, state_descr[j]+': '+str(component), rotation='horizontal', fontsize=fs, color='blue')
    
def act_policy_figure(observation):
    """Generates a figure representation of the Q-table entry or policy observation"""
    create_canvas('on')
    intersection_point()
    waypoint(observation['waypoint'])
    percept_input_oncoming(observation['inp_oncoming'])
    percept_input_left(observation['inp_left'])
    percept_input_right(observation['inp_right'])
    percept_traffic_light(observation['inp_light'])
    maxQ_action(observation['maxQaction'])
    legended_description(observation)

# Generate a set of policy observations extracted from Q-table
sample = qtable_sample('logs/sim_improved-learning.txt', 9)
plt.figure(1, figsize=(16, 13.5))
nrows = 3
ncols = 3
for fig_num, observation in enumerate(sample):
    plt.subplot(nrows, ncols, fig_num+1)
    act_policy_figure(observation)
    plt.tight_layout()
plt.show()

Figure description.- Each subfigure represents symbolically the state of the environment, the available actions to the agent in the current state as well as the action (action with maximum Q-value) taken by the agent. The thick black arrow indicates the next waypoint (forward, righ or left) at the intesection (small dashed circle). The actual oncoming traffic direction (oncoming foward, left or right) is shown in solid blue arrows. The sensor inputs indicating the motion intention of traffic approaching the intersection is indicated in gray arrows (vehicles moving forward, or turning right or left as observed from the left or right). The action taken is indicated in the upper right box in solid green arrows). The environment and actions descriptions is also availabe in blue text. The resuls are base on a 9-item random sample extracted from the Q-table found in the file 'sim_improved-learning.txt'.

Optimal Policy¶

The most important part of the (theoretical) optimal policy (the one the agent will discover or learn) is the one that complies with the U.S. Traffic and Right-of-Way Rules, that is, every action that the smartcab agent takes must comply with those rules.
In our particular implementation, this component of the optimal policy is the set of actions that avoid any of the traffic violations specified in lines 275-331 of the act() function found in the environment.py module.
It is assumed that perfect compliance with the above-mentioned rules produces trips free from traffic violations and accidents. The other component of this policy, are actions taken in order to optimize a smartcab trip chronological deadline when enforced.
The policy that the smartcab learns or discovers on its own must approximate as much as possible the theoretical policy. In terms of safety, we aim to perfect equivalence to US Righ-of-Way rules that yield A+ rating. Obviously, the agent doesn't know anything about the US Traffic Laws, however the Q-learning algorithm effectively teaches the agent these rules.
In our specific case, the near-optimal policy is illustrated by the entries of the Q-table generated by the the chosen reiforcement learning algorithm. We need to verify that each of the max-Q actions taken by the cab conform to U.S. Traffic Rules.

We can look at the 'sim_improved-learning.txt' text file to see the results of the improved Q-Learning algorithm.

For each state that has been recorded from the simulation, we need to find out whether the policy (the action with the highest value) is correct for the given state.
In addition, we need to find out whether there any states where the policy is different than what would be expected from an optimal policy.

To answer the above questions, I generated a script that translates each entry of the text file into a graphic representation of the Q-table entries (current env. state, available actions and chosen actions).

The scripts extracts a random sample from the data:

I extracted random samples to avoid cherry-picking data-entries from the file. Please see the 3x3 array of figures shown above, as well as its figure caption.

Figure description:

(1st row, 1st column):

The next-waypoint is to the right. The traffic light is green and there is no oncoming traffic fron any direction. The agent detects two distant vehicles approaching from left and right, each with intent to make a left turn. (The agent decides to make a right turn. This action is Q-maximal and conforms to rules set in the environment module and US traffic rules. The action follows the optimal policy.

(1st row, 2nd column):

The next-waypoint is forward. There is a red light and no oncoming traffic. A vehicle is approaching from the left. The agent stays idle and decides to wait (for the green) as its next waypoint is forward. Therefore, the agent again follows the optimal policy.

(1st row, 3rd column):

The next-waypoint is forward. Red traffic light. Oncoming vehicle from the left. Vehicle approaches from right with intent to make right turn. Agents stays idle. It follows the optimal policy.

------------------------------------------------------------------------------------------

(2st row, 1st column):

The next-waypoint is to the right. Traffic light is red. There is forward oncoming traffic and vehicles approaching from left and right with the intent of making each a left turn. The agend decides to make a right turn. This is indeed following the enforced policy, and is a valid action.

(2st row, 2nd column):

Next way point is to turn left. Traffic light is green and there is oncoming traffic from the left. Sensors also detect vehicles approaching from the left, and one from the right with intent of turning right. The agent takes a valid decision of making a left turn.

The above action is perhaps not optimal according to US traffic rules. Upon further inspection it is allowed and rewarded positively according to the rules in environment.py. (Please see my comment at the end of answer.)

(2st row, 3rd column):

Similar situation to the previous one, except this time there is only oncoming traffic from the left. Again making a right turns seems to be a valid action according to the rules in environment.py. ( Please see my comment at the end of answer.)

------------------------------------------------------------------------------------------

(3rd row, 1st and 2nd columns):

In both cases, the next waypoint is forward and the traffic light is red, regardless of approaching traffic the agent stays idle at the red light; presumably it will wait for the green to continue forward. The current actions are optimal.

(3rd row, 3rd column): This case is interesting, possibly not optimal. Traffic light is green and there is a distant vehicle approaching from left and oncoming traffic from the right. The next waypoint is to the left, however the agent decides to make a valid right turn. Again the action taken is optimal from the safety perspective, but perhaps it compromises a bit the efficiency.

Observation on the reward system enforced in environment.py:

Original line numbers 300 to 326 found in https://github.com/udacity/machine-learning/blob/master/projects/smartcab/smartcab/environment.py

        # Agent wants to drive forward:
        if action == 'forward':
            if light != 'green': # Running red light
                violation = 2 # Major violation
                if inputs['left'] == 'forward' or inputs['right'] == 'forward': # Cross traffic
                    violation = 4 # Accident
        
        # Agent wants to drive left:
        elif action == 'left':
            if light != 'green': # Running a red light
                violation = 2 # Major violation
                if inputs['left'] == 'forward' or inputs['right'] == 'forward': # Cross traffic
                    violation = 4 # Accident
                elif inputs['oncoming'] == 'right': # Oncoming car turning right
                    violation = 4 # Accident
            else:# Green light
                if inputs['oncoming'] == 'right' or inputs['oncoming'] == 'forward': # Incoming traffic
                    violation = 3 # Accident
                else: # Valid move!
                    heading = (heading[1], -heading[0])

        # Agent wants to drive right:
        elif action == 'right':
            if light != 'green' and inputs['left'] == 'forward': # Cross traffic
                violation = 3 # Accident
            else: # Valid move!
                heading = (-heading[1], heading[0])

Lines 8-20 above define conditions for traffic violations when agent turns left. Lines 16-20 establishes what happens when there is a green light. As we can see, there is no penalty when there is oncoming traffic coming form left, i.e. the if statement in line 17 does not test for inputs['oncoming'] == 'left', nor it tests the values of inputs['right'] and inputs['left'].

Therefore the agent decisions based on the max-Q actions shown in the Figure (2st row, 2nd column), (2st row, 3rd column) are valid since they enter in the else statement body in line 19. However, they seem to be ambiguous respect to the actual traffic rules.

Similarly, it seems that additional testing conditions can be added in line 26 when the environment has a green light and the agent is making right turn. Currently, there is no testing for oncoming traffic nor other inputs from the left or right.

Everything would be OK if the green light for the agent implies a red light for (left/right) approaching vehicles.
However, we cannot assume that oncoming or approaching vehicles would stop at their red light*.

So in my opinion, the aforementioned testing conditions can be added to the code above as an extra layer of safety in further applications.

Future Rewards - Discount Factor, `'gamma'`¶

Including future rewards in the algorithm is used to aid in propagating positive rewards backwards from a future state to the current state. Essentially, if the driving agent is given the option to make several actions to arrive at different states, including future rewards will bias the agent towards states that could provide even more rewards.

An example of the above would be the driving agent moving towards a goal: With all actions and rewards equal, moving towards the goal would theoretically yield better rewards if there is an additional reward for reaching the goal.

In this project, the driving agent is trying to reach a destination in the allotted time, therefore, including future rewards will not benefit the agent. In fact, if the agent were given many trials to learn, it could negatively affect Q-values.

Analysis:¶

There are two characteristics about the project that invalidate the use of future rewards in the Q-Learning algorithm. One characteristic has to do with the Smartcab itself, and the other has to do with the environment.

Agent characteristics:

The agent maintains an egocentric view and is only aware of the immediate next waypoint and sensorial percepts. This simplified agent lacks any geopositional inputs such as real-time navigation information and traffic conditions within the grid enclosing the origin and destination coordinates. So there is no way the agent can anticipate or replan its route based on non-local percepts. Even if so, the state space could become to large to handle with the current implementation

Environment characteristics:

Aside from the origin location, the destination coordinates are also changing (chosen at random) in-between trials. Therefore, the values of future rewards along a particular path that seemed optimal in one trial will change (to arbitrary sub-optimal values) with the next trial. Future rewards in the next trial will be uncorrelated to the ones in the previous trial. In other words, what the agent learnt in terms of future rewards during previous trials could be detrimental to learning valid actions in future trials. This situation would prevent the algorithm to converge towards an optimal policy.

Training an Autonomous Vehicle using Reinforcement Learning¶

Juan E. Rolon, 2017.¶

Introduction¶

Understanding the World¶

Observations:¶

Understanding the Code¶

Sections of code relevant to driving agent simulation:¶

Analysis:¶

World parameters:¶

Action parameters:¶

Visualization parameters:¶

Planning parameters:¶

Implementing a Basic Driving Agent¶

Basic Agent Simulation Results¶

Analysis¶

Answers:¶

Driving Agent Navigation Policy¶

Identifying the Agent States¶

Policy Aspects¶

Analysis:¶

State Space¶

Further considerations on the size of the state space¶

Size of the state space:¶

Updating the Driving Agent State¶

Implementing a Q-Learning Driving Agent¶

Q-Learning Simulation Results¶

Q-Table Analysis¶

Improving the Q-Learning Driving Agent¶

Improved Q-Learning Simulation Results¶

Improved Simulation Results.¶

Defining an Optimal Policy¶

Analysis:¶

Optimal Policy¶

Future Rewards - Discount Factor, 'gamma'¶

Analysis:¶

Future Rewards - Discount Factor, `'gamma'`¶