by Manfred Füllsack


This model demonstrates a partly heuristic, partly q-learning oriented kind of machine learning. An agent learns the shortest path through a labyrinth. See below how it works.


Imagine a robot looking for the optimal path to deliver a component from one end of a crowded shop floor to the other.

Pushing <setup> creates a shop floor scenario with an agent (red spot) situated on a yellow patch on one corner, a goal (green house) on one of the other corners and several obstacles (blue sqares) in between. By pushing <go> the agent starts searching for the goal. At first she has no idea where the goal is. She moves completely at random and it might take some time until she first time coincidentally hits the goal. But once she has found it, she remembers the direction from her origin (the yellow patch). She is moved back there and starts her next mission. This time she is geared by her knowledge of the right direction. However, there might be obstacles on the way. So whenever she encounters an obstacle she randomly tries to circumvent it either to the right or to the left. By doing this she marks and counts the patches she is moving on. When reaching her goal, she remembers the patch-count and starts her third mission from the yellow patch. She again orients on direction but randomly varies her way through the obstacles. When reaching goal, she compares the current patch-count with the one of her former mission. If the current patch-count is lower she erases her knowledge of the former path and memorizes the patches of her last mission as the current optimal path. From now on she follows this path with a certain probability. That is, corresponding to her <certainty> she chooses the patches of this path to move to her goal. Corresponding to her remaining uncertainty however she varies her way and when reaching goal compares patch-counts again. Whenever she finds a still shorter path she memorizes the corresponding patches and increases her certainty. In this way, she eventually might find an optimal solution. When certainty reaches 100% the agent stops moving and displays its solution.

Distributed learning

The <num-obstacle>-slider lets you vary the number of obstacles (the percentage is just an approximation). In scenarios with many obstacles, chances will be that the agent looses itself in a dead end. In most cases, she will find out again, but it might take some time. In this case, the <num-learners>-slider lets you increase the number of agents that will start each time anew from the yellow patch when a mission is accomplished. Each of them browses the scenario in the same way and when finding a better way communicates it to the others. Even in complex scenarios (with percentage 13 and higher), the agents thus can be brought on their way. However, in most cases the combined solution of many agents might be suboptimal compared to the solution of just one. Thats why the <change-to-one>-switch lets you change back to the one-agent-scenario whenever you think certainty might be high enough for the sole agent not to get lost in a dead end again.

The <accuracy>-slider increases or decreases the time the agents keep searching. The <restart same scenario>-button does exactly this.

Sometimes, agents coincidentally might find a good or even optimal solution on one of their first missions. Certainty then will not reach 100% and the simulation will not stop. In this case, you can push <stop and show solution>. If you consider this solution suboptimal just increase accuracy and restart the scenario. In some cases however, with a num-obstacles-percentage of 13 and higher, there might be no solution at all. The way might be blocked.


Coded by Manfred Füllsack (source code on demand), Jan. 2009  (to be improved and continued)


Sutton, Richard S. / Barto, Andrew G. (1998): Reinforcement Learning: An Introduction. MIT Press, Cambridge: Bradford. (An HTML verson of the book can be found here.)

Mitchell, Tom (1997): Machine Learning, New York: McGraw Hill.