Not sure now whether to have the TD net represent its estimated values like this or keep it as an expanded sigmoid for now.
Here's its performance just now on a 30-room random walk test - there are two actions: 0 is left and 1 is right. All rewards are 0 except when entering the terminal state (furthest room to the right) which gives a reward of +100. I kind of expected much better performance ages ago, but I guess it's harder to get these systems right than I thought. Although it completes the task in 15 moves a few times (the minimum amount), it often goes wrong.
learningRate: 0.5Sometimes it gets stuck in some weird state (local optima?) and never really learns anything. At this point, the action selection policy is just the action with the best estimated value unless the two actions appear worth the same, in which case it's a 50/50 tossup, or the epsilon-greedy function chooses a random action with probability 0.08.
steps taken on run 0: 697556
action 0 chosen: 91.2195%, action 1 chosen 8.78051%
steps taken on run 1: 67
action 0 chosen: 38.806%, action 1 chosen 61.194%
steps taken on run 2: 99
action 0 chosen: 42.4242%, action 1 chosen 57.5758%
steps taken on run 3: 121
action 0 chosen: 43.8017%, action 1 chosen 56.1983%
steps taken on run 4: 115
action 0 chosen: 43.4783%, action 1 chosen 56.5217%
steps taken on run 5: 27
action 0 chosen: 22.2222%, action 1 chosen 77.7778%
steps taken on run 6: 15
action 0 chosen: 0%, action 1 chosen 100%
steps taken on run 7: 35
action 0 chosen: 28.5714%, action 1 chosen 71.4286%
steps taken on run 8: 19
action 0 chosen: 10.5263%, action 1 chosen 89.4737%
steps taken on run 9: 17
action 0 chosen: 5.88235%, action 1 chosen 94.1176%
steps taken on run 10: 17
action 0 chosen: 5.88235%, action 1 chosen 94.1176%
steps taken on run 11: 15
action 0 chosen: 0%, action 1 chosen 100%
steps taken on run 12: 15
action 0 chosen: 0%, action 1 chosen 100%
I'm thinking the next step is to implement a softmax action selection method and maybe another test.. or, I dunno, actually carry on with the functional requirements of the thing. Time runneth short.
No comments:
Post a Comment