Where do rewards come from?
This is actually several blog entries from my old website stitched together. I hate to lose these things when I migrate software, so I’m trying to keep it alive.
These are some pretty random thoughts, btw. My opinions have likely changed since writing this :)
Entry 1:
So, I'm reading this book "How the Mind Works" by Steven Pinker. Its great. It speaks to the methods by which time has evolved very specialized mental function in the brain. The idea is that we sometimes take for granted that complex physical structures have evolved, but we think of the mind as some general purpose thinking machine. Pinker's view is that the mind has evolved in a similar way as the rest of the body. So this got me thinking about the reward function in reinforcement learning...
So, in reinforcement learning, we generally have some states and a reward function, and we want to find a policy that maximizes the discounted sum of future rewards generated by this function. We have decent solutions to finding such a policy in fairly complex domains.
But... it takes a long time. And really, in real life we don't have a long time. Take animals for example. I know, I know - trying to relate a new idea to something I hardly understand from nature is a farce, but just hear out this illustrative example. Animals know bad tastes from good tastes. They have natural aversion to things that taste bitter and natural attraction to things that taste sweet. This isn't something that they learn, it is something that they are born with. Why? Why not learn it? Because if animals had to learn everything from scratch, they would die. Extinction. Animals run from loud noises. Same deal. Evolution programmed some things in to help animals survive.
Ok, but how does this apply to learning? Well, animals learn to associate things. They can learn to associate people with loud noises for example - so stay away from people. Or, maybe they will associate people with food (don't feed the wildlife) - so people become a secondary reinforcer, so approaching people is a good things.
Maybe (and just maybe) if we want our agents to learn quickly and generalize well, we need to tailor their reward function more than we have been. I mean, look at a big maze. You can give a reward of -1 for every state action pair except escape, and then marvel that the agent learns the fastest way out. The problem is that they learn this optimal policy in the limit, which can take a long time. When people first learn of reinforcement learning, they almost always will say "Can't we give positive rewards for going near the exit and negative rewards from being far from it". The common answer with the classical viewpoint is "no". The reason, because then you are doing all of the work, analyzing the domain, crafting a reward function that helps the agent. Better, or so we're told, is to just tell the agent the end result of what you want it to do, exit the maze. Make everything else bad, and eventually the agent will work things out.
All those points are valid. But what if we have additional constraints. Say, the agent is a failure if it does not exit the maze within a fixed limit of time. Even if the agent is given the task over and over, it may take a huge amount of time before it finds a way out. But... if we had a reward function that rewards subgoal behaviour, perhaps this agent could learn it's way out quickly, and on the first try. Wouldn't that be neat? I think so.
So, what does it take to tailor a reward function. Work. You have to try a bunch of them, and do some sort of local search to get better ones. The good news, you can do it in parallel, which saves some time.
I think the big win here actually is going to be with function approximation. What we often find is that we have a function approximator which doesn't provide optimal discrimination along lines that are necessary to maximize some reward function. Like, we want the robot to get out of its pen, but the pen is round and our function approximator uses squares. So, maybe the agent needs to do some bumping into walls and zigzagging because some squares are "good sometimes" and "bad othertimes". This is a bit wishy washy, but stay with me. Maybe with an evolving reward function, we can make the task easier to learn. Maybe we can provide rewards in such away that the overall task (escaping the pen) is made easiest given the function approximation. Maybe the reward function can evolve to exploit regularities in the function approximator. Heck, maybe we can evolve the reward function and the feature set in parallel and find interesting features that give us discrimination and generalization exactly where we need it. Maybe we could even evolve a starting policy at the same time and build in instinctive behaviour.
Anyways, these are some ideas. I don't know if they've been done. I'm about to read Geoffrey Hinton's paper on "How learning can guide evolution". I think maybe its backwards and we should use evolution to guide learning... but maybe not.
If this is new, its going to be some sort of search in reward space, maybe I can bundle it up into a neat paper.
Entry 2:
This is a bit of an extension to the story above... I did some more thinking and read Geoffrey Hinton's paper.
So, we're talking about crafting a reward function. But, this makes the bottom fall out of our barrel. If agents are supposed to maximize their reward, and we are learning a reward function to help the agent succeed, the obvious degenerate case is for the the agent to get high reward for doing nothing (or doing anything).
How does nature deal with this problem? Nature doesn't even consider the problem, because the goal and the rewards are distinct. It doesn't matter how happy I am in my life, or how much reward I accumulate, if I do not reproduce, then my gene's have failed in their goal, which is to propagate themselves. 1 distinct, simple goal. Survive.
We can see this is many different aspects of human nature (I think - I'm no psychologist). Why is getting better than having? Why is the thrill in the chase? Why do rich people gamble? Why take the smaller payout instead of the larger one spread over time? People like to get.
Where am I going with this?
I'm going to postulate that people like getting because there is a reward for getting. I'll come back to this if I can make it more clear.
In the maze example, we can see what we need to do. The reward evolution decides how the agent is rewarded, but (like in nature - sheesh I'm doing it again) the agent needs to be evaluated by an external process. Did they get out of the maze? Did they get out of the maze fast? It doesn't really matter how much reward (fun) the agent had running around the maze, it matters if he got out. That is the fitness function that guides the reward evolution and eventually evaluates the agent. Will this work with more complex tasks?
Maybe it'll work better. (Maybe not). We really need to provide input here, which I would prefer if we didn't but for now we will and keep it simple. Say we are making a robot that walks. If it falls, it fails. If it moves forward some distance, it passes. There is the fitness function. Pass/fail. Maybe.
What about something like playing blackjack. This is harder. Rationally, it seems that people are bad at gambling. People get addicted. People chase their losses with more good money. If leaving with less than you started with is failing, and leaving with much more than you started with is winning, maybe our gambling reward function does just the right thing? The expected value of gambling is losing, so perhaps a few big bets is better than many smaller bets. If you are down a bunch of money, the only way to not be down a bunch of money is to win. Maybe chasing lost money is actually a good thing, to the goal of not being down a bunch of money.
Anyways, so this *is* the hard part, I won't deny it. By going a level up from the reward function, we have to come up with a simpler fitness function, something that is almost braindead. If not, then my whole argument can be called recursively to some higher goal. Maybe that's not a terrible idea, but its not the one I want to explore. What do we do? Maybe standard RL goes are ok. Playing a game - winning the game is good, losing is bad. If we are using a large population of agents, then the stochasticity of games and different opponents works itself out. If we are playing with a single agent, this doesn't work so well. But, would evolution work with a small population? Nope.
I'm trying to think of something with a really complicated reward function. An example at ICML '04 where they did inverse reinforcement learning was this car driving task. You wanted to stay on the road, not hit people, not get hit by people, go fast, etc, etc. Their argument (if I recall) for inverse RL was that people can perform this task well, but have a hard time constructing the reward function for an agent to do as well. If the penalty for going off the road is too weak, then the agent will drive off road to avoid the stochastic nature of traffic. If this penalty is too strong, then the agent will crash into other cars instead of veering into the shoulder. What could we do here? I'm not quite sure. Maybe I'll come back and edit this. Otherwise, send me an e-mail if you have some idea.
Entry 3:
So, previously - I was all about reward function shaping. I read some work by Andrew Ng, and he showed that reward shaping can be a little dangerous, and perhaps we should do something that he describes with a potential function. Many of the benefits, with less risk. Then I looked at Eric Wiewora's research note showing that this potential function scheme was the same as just setting the initial value function to the potential function.
Maybe this is a win? Initial value function is the same as using a potential function which is safer and has all the benefits of changing the reward function.
So - I thought about evolving the value function. What I decided was that evolving the value function has little benefit over starting the agent over multiple times with random value functions and then taking some of what was learned in each one and combining it. This then, is the same as learning off policy with a few good exploratory policies?
So, is this whole direction a waste? Perhaps. I want to speak further with Vadim about them and see what he thinks.
No comments:
Post a Comment