This is a quick advert for the excellent online course “Reinforcement Learning” by David Silver.
David Silver’s ‘teaching’ home page includes links to the slides (which are particularly helpful for Lecture 7, where there’s no video available  just sound).
These notes are mainly for my own consumption…
Musings : Eligibility Traces
Several things that bubbled to the surface during the Reinforcement Learning course:

Eligibility traces, which are a way of keeping track of how rewards for \(TD(\lambda)\) should be attributed to the states/actions that lead to them, seem very reminiscent of activation potentials

Aiming for minimum fit error is the same as optimising for least surprise, which is the same as making predictions that are as unsurprising as possible

Distributing the amount of ‘surprise’ (positive or negative) to exponentially decaying eligibility nodes with a neural network seems like a very natural reinterpretation of the \(TD(\lambda)\) framework

But current neural networks are densified, with multidimensional spaces embedding knowledge in a way that is entangled far more thoroughly than the sparse representations (apparently) found in the brain

Perhaps the ‘attribution problem’ that sparse & eligibility solves is equivalent to dense & backprop

But there appears to be a definite disconnect between the rationales behind each of the two methods that requires more justification
Musings : Exploration vs Exploitation

Interesting that \( (\mu + n\sigma) \) is an effective actionchooser, since that is what I was doing in the 1990s, but using as a placeholder heuristic until I found something that was ‘more correct’

But \([0,1]\) bounded i.i.d variables having a decent confidence bound (Hoeffding’s inequality) was a new thing for me

Also, liked the Thompson sampling (from 1930s) that implicitly created samples according to a distribution, merely by using samples from the underlying factors is very elegant
 and makes me think about the heuristic in Genetic Algorithms for ‘just sampling’ as potentially solving a (very complex) probablity problem, without actually doing any computation
Musings : Games and StateoftheArt

The tree optimisations seem ‘obvious’ in retrospect  but clearly each was a major ‘aha!’ when it was first proposed. Very interesting.

Almost all of the StateoftheArt methods use binary features, linear approximations to \( v_*() \), search and selfplay.

Perhaps deeper models will work instead of the linear ones  but it’s interesting that binary (~sparse?) features are basically powerful enough (when there’s some treesearch for trickier strategy planning)
Musings : Gimbal Learning
Lecture 8 of the Reinforcement Learning Course by David Silver is superrelevant to the Gimbal control problem that I’ve been considering.
It’s also interesting that the things I had already assumed would have been obvious are currently considered stateoftheart (but isn’t that always the way? Hindsight, etc).
In summary :
 Learn model from real world by observing state transitions, and then learning {state, action}tostate mapping
 Also learn {State, Action}toReward mapping (almost a separate model)

Apply modelfree methods to modelsimulated world

Once ‘correct’ action has been selected, actually perform it
 Now we have new realworld learning with which to finetune the world model
To apply this to gimbal, seems like one could present ‘target trajectories’ to controller in turn, letting it learn a common worldmodel, with different rewardmodels for each goal. And let it selfplay…
Apropos of Nothing
DARPA Robotics Fast Track  are they doing anything?
blog comments powered by Disqus