Hola,
I worked on SARSA algorithm as well as on Q Learning algorithm and both of them had different Q matrix (Duh!) The methodology of both of the algorithms depicts how well one algorithm responds to future awards (which we can say OFF Policy for Q learning) while the other works of the current policy and takes an action before updating Q matrix (ON Policy).
The previous post example of the grid game showed different results when I implemented SARSA. It also involved some repetitive paths whereas Q didn’t show any. A single step showed that SARSA followed the agent path and Q followed an optimal agent path.
To implement both ways I remember the way of pseudo code.
//pagead2.googlesyndication.com/pagead/js/adsbygoogle.js
(adsbygoogle = window.adsbygoogle || []).push({}); QL
initiate Q matrix.
Loop (Episodes):
Choose an initial state (s)
while (goal):
Choose an action (a) with the maximum Q value
Determine the next State (s’)
Find total reward -> Immediate Reward + Discounted Reward (Max(Q[s’][a]))
Update Q matrix
s <- s'
new episode
SARSA-L
initiate Q matrix
Loop (Episodes):
choose an initial state (s)
while (goal):
Take an action (a) and get next state (s’)
Get a’ from s’
Total Reward -> Immediate reward + Gamma * next Q value – current Q value
Update Q
s <- s' a <- a'
Here are the outputs from Q-L and SARSA-L
The above is Q-L
This one is SARSA
There is a difference between both Q Matrix. I worked on another example by using both Q learning and SARSA. It might appear similar to mouse cliff problem for some readers so bear with me.
The code for Naruto-Q-Learning is below
https://gist.github.com/Shashi18/59becd86a03694550e4a690084cc2d55.js
Here is
Hinata trying to find her way to her goal by using SARSA
The code for Hinata SARSA Learning
https://gist.github.com/Shashi18/272cf8aeef9fcd52ec39e832839212f4.js
I used epsilon-greedy method for action prediction. I generated a random floating number between 0 to 1 and set epsilon as 0.2. If the generated number is greater than 0.2 then I select maximum Q valued action (argmax). If the generated number is less than 0.2 then I select the action (permitted) randomly. With each episode passing by, I decreased the value of epsilon (Epsilon Decay) This will ensure that as the agent learns its way it follows the path rather than continuing exploration. Exploration is maximum at the start of the simulation and gradually decreases as each episode are passed.
//pagead2.googlesyndication.com/pagead/js/adsbygoogle.js
(adsbygoogle = window.adsbygoogle || []).push({}); This is the decay of the epsilon.
The path followed in the above simulation is 0 – 4 – 8 – 9 – 10 – 11 – 7. Sometimes the agent also follows the same path as followed during Q learning. Well, I am continuing my exploration for the same and will post more details as I learn more about RL.
Till then, bye
Great blog. Python Training Institute in Chennai
LikeLike
Your blog is nice. I believe this will surely help the readers who are really in need of this vital piece of information. Thanks for sharing and kindly keep updating. IELTS Coaching in Adyar IELTS Class in ThiruvanmiyurIELTS Class in TriplicaneIELTS Coaching in Anna NagarIELTS Coaching Centre in Koyambedu IELTS Coaching Centres in Chennai Mogappair IELTS Classes near me
LikeLike
nice update, thank you, please can you explain how you generate your matrices for reward and state, thanks
LikeLike
The first row specifiesTOP BOTTOM LEFT RIGHTso 0 points if out of box100 point if the agent lands on green box-1 if the agent lands on box other than green or red-10 if the agent lands on red box by any move specified aboveEach box has its own state number.Starting from top left and going horizontally from 0 to 15So -1 is impossible state
LikeLike
thanks, I got it from your previous post. please can you share your email address with me?
LikeLike
Its working.Can it be ported into hardware?
LikeLike
This comment has been removed by a blog administrator.
LikeLike
This information was very useful to me and I thank you for this.keep sharing more like this. ccna course in Chennai ccna institute in Chennai Python course in Chennai Python Training Institute in Chennai Angularjs course in Chennai ccna Training in Anna nagar ccna Training in T nagar
LikeLike