I had an epiphany (love that word) last night while trying to sleep, and came upon an idea that seems incredibly obvious now, so I guess I was stuck in a mental rut for a while. Need to raise the temperature when that happens, or something...
Anyway, my agent was spending all its time choosing illegal moves and being punished by the environment which trivially rejected those moves until it returned a valid one. However, since I'm using a neural network as a function approximator for the TD(λ) agent, making 1000 illegal moves followed by one valid one isn't a great way to learn - the backpropagation for the many illegal moves not only slows the whole thing down, but much worse, it results in so much network noise that the reward for a good move seems to get drowned out, as one of my lecturers in DCU helpfully pointed out a few days ago.
So last night I realised that not only was training on illegal moves a waste of time, but more importantly, since illegal moves are well-defined, I could just have the environment tell the learning agent which moves are legal (by passing in an array of booleans, one for each action... false for illegal and true for legal). Then the agent devalues the illegal moves and pays attention to the valid moves instead... although this could have been implemented better. In fact, everything could do with a lot of refactoring - it's too coupled to add proper unit tests for some of the behaviour.
Anyway, I implemented this and now the agents train against each other at a rate of almost one game per second on my machine. Not blazingly fast, but enough to get the following modest result in enough time to do a demo hopefully:
score before training: 76%
score after training: 88%
Success: 22 tests passed.
Test time: 886.09 seconds.
That was for 100 evaluation games against two random players, followed by 500 training games against other TD(λ) players, followed by another 100 evaluation games against randomers. It's probably not totally accurate as it's learning even against the first random player, and also because random players don't play to win, so it's almost a different game!
Still, it's an encouraging result... actually the first encouraging result. Bit of a happy birthday present really...