DQN reinforcment learning network won't train

I have a project to use DQN/RL to search an n dimensional space for the “best” solution - the best solution is defined by a single real number for the reward. The plan is that new, but similar searches will need to be done from time to time, and if we can train a RL/DQN on some general cases, it should make the search for a new-related case faster using the trained network. We are looking at perhaps 300,000 possible solutions, a 16 dimensional space, something like a:

4x3x3x2x2x2x2x2x2x2x2x2x2x2x2x2 space.

One important issue is that the calculation of the reward is very, very expensive (minutes), so while we could train the network, running many, many models, we like to then find the best solution to the new, similar problem much faster, using the pre trained network.

I’ve adapted a solution to the frozen lake example:

[https://github.com/hamedmokazemi/DeepQLearning_FrozenLake_1]

as this also is a space search. Important difference are that the input to the frozen lake is one hot encoded. I needed to make this just binary, as there are 300,000 possible solutions to my problem, while the frozen lake only has 16. The output of our DQN is, like the frozen lake the best action. Below is the model definition (for this simpler model, I only have a 4 dimensional space, 3x3x3x3 = 81 possible solutions.

def buildmodel(self):

initializer = tf.keras.initializers.RandomUniform(minval=0., maxval=1.)

model=Sequential()

model.add(Dense(50, input_dim=env.bin_state_size, activation=‘relu’,kernel_initializer=initializer))

model.add(Dense(50, activation=‘linear’,kernel_initializer=initializer))

model.add(Dense(self.action_size, activation=‘linear’))

model.compile(loss=‘mse’, optimizer=Adam(lr=self.learning_rate))

return model

Pretty similar to frozen lake, but two layers, I increase the # of nodes to 50 (as the problem is a little bigger).

My problem is that this refuses to train. After a few episodes, the prediction for any state is the same action. Which action seems to depend on the initial conditions, but it will just keep selection, e.g., action = 1 over and over. If I look at predictions for different states, they all prediction the same action. The rewards are different but the argmax(actions) is always the same.

Really stuck here, I don’t think it is overtraining (but tried drop out anyway, didn’t help), have tried more node/more layers, linear tanh activation.

Interestingly this, all works just find for 3 dimensions, but then fails at 4 dimensions. Note that frozen_lake is 2 dimensions.

Any suggestions would be greatly appreciated. I’d be happy to put the code on github if some generous person would like to look at it.

thanks

Mark