Averaging Weights of Identical LSTM Models for a Unified Global Model

I’m currently working on a project where I have several pre-trained LSTM models, all with the same architecture. My goal is to combine these models into a single global model by averaging their weights. However, I’m encountering an issue where the output of the global model does not make sense, even though the trend captured by the model seems accurate.

Has anyone here dealt with a similar challenge, and if so, could you share how you managed to average the weights while maintaining real results?

I will greatful for any advice.

def average_weights(model_list):

  avg_weights = model_list[0].get_weights()

  for model in model_list[1:]:
      for i, layer_weights in enumerate(model.get_weights()):
          avg_weights[i] += layer_weights


  for i in range(len(avg_weights)):
      avg_weights[i] /= len(model_list)

return avg_weights```


@albi_z8 Instead of a simple weight average, a more robust approach could be to ensemble the models or use knowledge distillation. This involves training a new model to mimic the behavior of the ensemble.

import tensorflow as tf

def ensemble_models(models, input_shape):
    inputs = tf.keras.layers.Input(shape=input_shape)
    outputs = [model(inputs) for model in models]
    avg = tf.keras.layers.average(outputs)
    ensemble_model = tf.keras.models.Model(inputs=inputs, outputs=avg)
    return ensemble_model

# Example usage
model_list = [...]  # List of your pre-trained LSTM models
input_shape = [...]  # Define the input shape of your models

ensemble_model = ensemble_models(model_list, input_shape)

This approach creates an ensemble model that takes the average of the predictions from individual models rather than directly averaging their weights. It provides a more stable way to combine the knowledge from multiple models without risking the loss of meaningful information.

Caveat* Adjust the input_shape according to your model’s input

My research is focused on exploring methods to effectively combine pre-trained models into a single model that can be applied to previously unsupported locations. The motivation behind this stems from a Federated Learning perspective, but instead of averaging weights during individual training steps, my goal is to try do this effectively after the models are fully trained.
An essential constraint is the inability to transmit entire models due to privacy concerns mainly. Instead, I am limited to sending only weights or gradients.