Hi there, I’m making my first steps in ML. I’m absolutely new to this world, I just want to start understanding an make my first steps. My goal is to train a model to play a card game, so I started defining a good state definition, and, with the help of GPT I’ve set up a minimal working propotype. After a few trials, my perception is that the model is not learning. I paste here the relevant parts, so maybe some of you could focus me on the right direction.

First I create the model and define it’s metaparameters with:

```
optimizer = tf.keras.optimizers.Adam(learning_rate=0.001)
# Adjust the create_model function
def create_model():
input_shape = (20,)
num_actions = 3
model = Sequential()
model.add(Dense(128, input_shape=input_shape, activation='relu')) # Adjust input_shape and add more layers if needed
model.add(Dense(num_actions, activation='softmax')) # Adjust the output layer to match the number of actions
# Compile the model with categorical crossentropy loss
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
return model
```

I generate a set of training states, training_actions and training_rewards.

each state is an array of 20 integers. Action is a value from 0 to 2 (only 3 possible actions) and reward is either 0 or 1. The system gives a reward if the model chooses the bigger integer in positions 0, 1 or 2 of the state.

Action for each state is predicted with:

```
action_probabilities = model.predict(np.expand_dims(state, axis=0),verbose = 0)[0]
```

I then normalize the probabilities and choose the action predicted with biggest value, after getting predicted action, I calculate the reward. If action predicted is the greatest value from state[0], state[1] or state[2] then reward=1 else reward=0

For example:

```
Predicted Actions: [2, 2, 0]
Input States: [[4, 3, 2, -1, -1, -1, 8, -1, -1, 1, -1, -1, 5, -1, -1, 0, 1, 0, 1, 0], [1, 8, 9, -1, -1, -1, 11, -1, -1, 3, -1, -1, 13, -1, -1, 0, 1, 0, 1, 0], [8, 2, 15, -1, -1, -1, 1, -1, -1, 1, -1, -1, 2, -1, -1, 0, 1, 0, 1, 0]]
Rewards: [0, 1, 0]
```

As you see, the goal is that the predicted actions were: [0,2,2] because greatest numbers of the first 3 elements of each state array are, respectively, 0 (value 4 in first state), 2 (value 9 in second array, well predicted) and 2 (value 15 in third array)

After hundreds of thousands of data feed to the model, it’s unable to predict good actions.

Functions to train the model are those:

```
def train_model(model, states, actions, rewards):
# Convert training data to NumPy arrays
# Log the data passed to train_model to a file
log(f"Actions: {actions}")
log(f"States: {states}")
log(f"Rewards: {rewards}")
if states:
X_train = np.vstack(states)
action_indices = np.array(actions)
rewards = np.array(rewards)
# Calculate Q-values for the chosen actions
q_values = calculate_q_values(len(states), action_indices, rewards)
# Train the model
model.fit(X_train, q_values, epochs=20, verbose=0)
def calculate_q_values(num_states, action_indices, rewards):
# In this simplified case, we assume the rewards apply to the chosen actions
q_values = np.zeros((num_states, len(config['commands'])))
for i in range(len(action_indices)):
action_index = action_indices[i]
reward = rewards[i]
q_values[i, action_index] = reward # Update Q-value for the chosen action
return q_values
```

I really don’t know where to start to optimize the model, any help would be very much appreciated.

Many thanks in advance,

Jaume.