About Feature Fusion and Model Fusion

Recently, vectors of different dimensions were used to input the random forest model. Both of them had good results, and the classification accuracy reached 95%. However, after combining the two RF models, the effect became very poor. Is the model fusion setting wrong? What’s wrong?
Attached is the model diagram, I hope experienced expert partners can answer questions and doubts, thank you~

Hello Urnotcoward,

It seems you are trying to combine Random Forests with Neural Networks. This is possible, but there are some caveats. The two most important ones are:

  1. Random Forests will not propagate loss gradients (at least not the native version). It means that Neural Networks located before a random forest cannot be trained with backpropagation (“dense” and “dense_1” in your example) . The solution is generally to pre-train those Neural Networks or to add a secondary path for the gradient to propagate.

You are mentioning that the Random Forests work well. Therefore, I assume you are maybe pre-training those parts (the input looks like an image). A third solution is to rely on the random initialization of Neural Networks: Those are essentially random projections and they can sometimes produce some results (but there are better ways to do it).

  1. Random Forests do not train with loss gradients. It does not seem to be an issue in your diagram, but it is important to be aware of that.

Note that those two points are discussed in detail in the Model composition colab.

About your model specifically, there might be different things happening, and different solutions. Let me go through some initial ideas:

  1. Always start simple (e.g. just the Random Forests + one layer) and add complexity progressively (one layer at a time). Keep in mind that the more complex your networks (e.g. with attention mechanisms), the more complex it is to train.

  2. Make sure you are pre-training “dense” and “dense_1”. If not, remove them.

  3. Each random forest outputs a single value (likely a probability?). All the networks after the random forests are only depending on those two values. This is a very strong data bottleneck, and it is unlikely that those networks will learn much. The correct solution here depends on what you try to achieve (the diagram looks a little bit like a two-towers model). Can you give more details?

1 Like

Let add a couple of suggestions in addition to Mathieu’s suggestion:

  • Pre-train the two random forests, as likely you are already doing, and then concatenate their outputs to the original inputs (shaped (none, 150) and (None, 768)), so you’ll have a tensor shaped (None, 2 + 150 + 768) and then feed that to a FNN.

If I’m not misunderstanding your plot I think there is something problematic with the m3_vec and m4_vec dense layers: their input has a dimension of 1, and expanding them to 200 won’t really do anything … there is not “enough information” in 1 float (a probability) to occupy a vector of 200 floats so to say (it’s not that the probability has many modes that the various values of Dense would separate).

Also I don’t think the attention layer will help there: the attention is good for sequences (text), or unordered sets of things, but not necessarily things that have a specific position (meaning something learned for position 1 is different than something learned for position 2): you can just use normal dense layers.

Good luck, sounds like an interesting project.

Btw, we have in the works a differentiable DF implementation, but it will take a couple of months still.

Hello Mathieu,
Thank you very much for your attention and suggestions. I didn’t expect you to pay attention to my problem for the first time and I’m really happy.
I did refer to the model fusion of random forest and neural network you mentioned. My purpose is to use the random forest model for feature fusion after encoding the same set of data with two vectors to achieve better results.
Through your suggestion, I started from a simple one, first fused two sets of vectors and then input them into random forest for training, the effect is about 1.6% higher than the previous single layer, without adding attention mechanism and other complicated things, Really surprised!
In addition, I found that my original error was due to the fusion in the decision layer, which is an invalid method. The difference between random forest and neural network is that it will directly output the classification results, and it is meaningless to fuse the output results.
Thank you again, and I hope you can exchange and discuss any questions in the future.

Hello Jan,
Thank you very much for your addition, you mentioned a very important point, it doesn’t make any sense to expand the output of random forest from 1 to 200, it is true, I didn’t figure it out before.
My solution is to fuse the features before entering the random forest. There are some optimizations in the initial stage, which is really amazing!
It is indeed a difficult and complicated job. We have worked hard for a long time, and hope that everything will go well in the follow-up.
What you mentioned you are working on differentiable DF implementation sounds great! I wish you all the best and look forward to seeing the results of your research soon!