Saturday, April 14, 2007

reinforcement learning with function approximation

This is much harder than I thought. I've got a series of tests (not very modular at all - they're split into only two source files and cover test-first unit tests and empirical performance measures) which examine the components of the system in some way... the neural network seems to be functioning reasonably, although it's less accurate than I'd expected for pure numeric output, unless I format the output as a binary string again... annoying.

Not sure now whether to have the TD net represent its estimated values like this or keep it as an expanded sigmoid for now.

Here's its performance just now on a 30-room random walk test - there are two actions: 0 is left and 1 is right. All rewards are 0 except when entering the terminal state (furthest room to the right) which gives a reward of +100. I kind of expected much better performance ages ago, but I guess it's harder to get these systems right than I thought. Although it completes the task in 15 moves a few times (the minimum amount), it often goes wrong.
learningRate: 0.5
steps taken on run 0: 697556
action 0 chosen: 91.2195%, action 1 chosen 8.78051%
steps taken on run 1: 67
action 0 chosen: 38.806%, action 1 chosen 61.194%
steps taken on run 2: 99
action 0 chosen: 42.4242%, action 1 chosen 57.5758%
steps taken on run 3: 121
action 0 chosen: 43.8017%, action 1 chosen 56.1983%
steps taken on run 4: 115
action 0 chosen: 43.4783%, action 1 chosen 56.5217%
steps taken on run 5: 27
action 0 chosen: 22.2222%, action 1 chosen 77.7778%
steps taken on run 6: 15
action 0 chosen: 0%, action 1 chosen 100%
steps taken on run 7: 35
action 0 chosen: 28.5714%, action 1 chosen 71.4286%
steps taken on run 8: 19
action 0 chosen: 10.5263%, action 1 chosen 89.4737%
steps taken on run 9: 17
action 0 chosen: 5.88235%, action 1 chosen 94.1176%
steps taken on run 10: 17
action 0 chosen: 5.88235%, action 1 chosen 94.1176%
steps taken on run 11: 15
action 0 chosen: 0%, action 1 chosen 100%
steps taken on run 12: 15
action 0 chosen: 0%, action 1 chosen 100%

Sometimes it gets stuck in some weird state (local optima?) and never really learns anything. At this point, the action selection policy is just the action with the best estimated value unless the two actions appear worth the same, in which case it's a 50/50 tossup, or the epsilon-greedy function chooses a random action with probability 0.08.

I'm thinking the next step is to implement a softmax action selection method and maybe another test.. or, I dunno, actually carry on with the functional requirements of the thing. Time runneth short.

No comments:

Post a Comment