Update Rule in Temporal difference
Posted
by Betamoo
on Stack Overflow
See other posts from Stack Overflow
or by Betamoo
Published on 2010-05-28T12:45:08Z
Indexed on
2010/05/29
0:52 UTC
Read the original article
Hit count: 364
artificial-intelligence
|machine-learning
|markov-models
|q-learning
|temporal-difference
The update rule TD(0) Q-Learning:
Q(t-1) = (1-alpha) * Q(t-1) + (alpha) * (Reward(t-1) + gamma* Max( Q(t) ) )
Then take either the current best action (to optimize) or a random action (to explorer)
Where MaxNextQ is the maximum Q that can be got in the next state...
But in TD(1) I think update rule will be:
Q(t-2) = (1-alpha) * Q(t-2) + (alpha) * (Reward(t-2) + gamma * Reward(t-1) + gamma * gamma * Max( Q(t) ) )
My question:
The term gamma * Reward(t-1)
means that I will always take my best action at t-1
.. which I think will prevent exploring..
Can someone give me a hint?
Thanks
© Stack Overflow or respective owner