How does the keras' RMSProp fit sooner (in less epochs) than a implementation of the algorithm?

What I meant by the title is, what is the difference between my implementation, which I believe is close to the original rmsprop, and the keras’ implementation.

Kera’s fit method gets a loss < 0.01 in 10 epochs, while my naive algorithm slows down quite a lot after the 0.1 mark, and never reaches 0.01 even after 100 epochs.

KERAS:

Epoch 1/10
1000/1000 [==============================] - 3s 3ms/step - loss: 0.7463 - accuracy: 0.7680
Epoch 2/10
1000/1000 [==============================] - 3s 3ms/step - loss: 0.3919 - accuracy: 0.8980
Epoch 3/10
1000/1000 [==============================] - 2s 2ms/step - loss: 0.2604 - accuracy: 0.9420
Epoch 4/10
1000/1000 [==============================] - 2s 2ms/step - loss: 0.2144 - accuracy: 0.9550
Epoch 5/10
1000/1000 [==============================] - 2s 2ms/step - loss: 0.1174 - accuracy: 0.9730
Epoch 6/10
1000/1000 [==============================] - 2s 2ms/step - loss: 0.0748 - accuracy: 0.9840
Epoch 7/10
1000/1000 [==============================] - 2s 2ms/step - loss: 0.0459 - accuracy: 0.9870
Epoch 8/10
1000/1000 [==============================] - 2s 2ms/step - loss: 0.0196 - accuracy: 0.9940
Epoch 9/10
1000/1000 [==============================] - 2s 2ms/step - loss: 0.0107 - accuracy: 0.9970
Epoch 10/10
1000/1000 [==============================] - 2s 2ms/step - loss: 0.0061 - accuracy: 0.9970

NAIVE IMPLEMENTATION:

epoch: 0 loss: 2.351585176975065 acc: 0.073 // no training before this
epoch: 1 loss: 1.3441321235522397 acc: 0.738
epoch: 2 loss: 0.8169945169854006 acc: 0.822
epoch: 3 loss: 0.5958281075546342 acc: 0.849
epoch: 4 loss: 0.48944367167816155 acc: 0.868
epoch: 5 loss: 0.4215665684223718 acc: 0.879
epoch: 6 loss: 0.3784920125976777 acc: 0.891
epoch: 7 loss: 0.3491281265388871 acc: 0.899
epoch: 8 loss: 0.3267179105908342 acc: 0.902
epoch: 9 loss: 0.29812428350751236 acc: 0.917
epoch: 10 loss: 0.2816787871403444 acc: 0.918
epoch: 11 loss: 0.26484099127320365 acc: 0.922
epoch: 12 loss: 0.25452064445405986 acc: 0.928
epoch: 13 loss: 0.24510738403612584 acc: 0.925
epoch: 14 loss: 0.2313635832678401 acc: 0.934
epoch: 15 loss: 0.2263060342080479 acc: 0.934
epoch: 16 loss: 0.2141716029870379 acc: 0.937
epoch: 17 loss: 0.20585278438998866 acc: 0.941
epoch: 18 loss: 0.2008210906589328 acc: 0.939
epoch: 19 loss: 0.1914643542433619 acc: 0.943
epoch: 20 loss: 0.18526972395147176 acc: 0.95
epoch: 21 loss: 0.17977634057074943 acc: 0.953
epoch: 22 loss: 0.17659312338459698 acc: 0.95
epoch: 23 loss: 0.1730095272271334 acc: 0.951
epoch: 24 loss: 0.16751114601528463 acc: 0.956
epoch: 25 loss: 0.1622696005318275 acc: 0.956
epoch: 26 loss: 0.1584748890775764 acc: 0.96
epoch: 27 loss: 0.1541660845336603 acc: 0.963
epoch: 28 loss: 0.15366931850717738 acc: 0.962
epoch: 29 loss: 0.14791503167922196 acc: 0.964
epoch: 30 loss: 0.14370620733897888 acc: 0.965
epoch: 31 loss: 0.1394299197253062 acc: 0.967
epoch: 32 loss: 0.13728666423867625 acc: 0.968
epoch: 33 loss: 0.13302949320498997 acc: 0.968
epoch: 34 loss: 0.13020528536144707 acc: 0.968
epoch: 35 loss: 0.13069805470523502 acc: 0.969
epoch: 36 loss: 0.1276391479303531 acc: 0.97
epoch: 37 loss: 0.12226603405970293 acc: 0.969
epoch: 38 loss: 0.12101554054811792 acc: 0.969
epoch: 39 loss: 0.11890066609131254 acc: 0.969
epoch: 40 loss: 0.11783996723830573 acc: 0.971
epoch: 41 loss: 0.1132539505108236 acc: 0.97
epoch: 42 loss: 0.11192076822162904 acc: 0.973
epoch: 43 loss: 0.10894143290231988 acc: 0.972
epoch: 44 loss: 0.10717285655939912 acc: 0.974
epoch: 45 loss: 0.10487730744173353 acc: 0.974
epoch: 46 loss: 0.10197636382729229 acc: 0.973
epoch: 47 loss: 0.0991876673474291 acc: 0.973
epoch: 48 loss: 0.099348139794124 acc: 0.975
epoch: 49 loss: 0.09520582580655605 acc: 0.975
epoch: 50 loss: 0.0969406397611115 acc: 0.976
epoch: 51 loss: 0.09059255100317501 acc: 0.976
epoch: 52 loss: 0.09316977521888427 acc: 0.976
epoch: 53 loss: 0.08938247631626035 acc: 0.975
epoch: 54 loss: 0.08811868742037693 acc: 0.978
epoch: 55 loss: 0.08625686202783996 acc: 0.978
epoch: 56 loss: 0.08538771887459436 acc: 0.979
epoch: 57 loss: 0.08185635352133913 acc: 0.978
epoch: 58 loss: 0.08170590581949694 acc: 0.979
epoch: 59 loss: 0.07719183930779538 acc: 0.979
epoch: 60 loss: 0.07684899163736585 acc: 0.979
epoch: 61 loss: 0.07588368055325305 acc: 0.98
epoch: 62 loss: 0.07507468920577076 acc: 0.979
epoch: 63 loss: 0.07207410799842978 acc: 0.979
epoch: 64 loss: 0.07040289474188392 acc: 0.979
epoch: 65 loss: 0.0703488063887447 acc: 0.98
epoch: 66 loss: 0.06929370403706761 acc: 0.979
epoch: 67 loss: 0.0657441442659503 acc: 0.979
epoch: 68 loss: 0.06729136911819426 acc: 0.981
epoch: 69 loss: 0.06414198279278469 acc: 0.983
epoch: 70 loss: 0.0615023553909231 acc: 0.983
epoch: 71 loss: 0.06018738520679154 acc: 0.982
epoch: 72 loss: 0.05918258034472605 acc: 0.982
epoch: 73 loss: 0.056588497296133494 acc: 0.983
epoch: 74 loss: 0.059048515572232146 acc: 0.984
epoch: 75 loss: 0.054759201485826324 acc: 0.983
epoch: 76 loss: 0.052757782436277205 acc: 0.985
epoch: 77 loss: 0.05371287689539768 acc: 0.983
epoch: 78 loss: 0.05071757667213161 acc: 0.984
epoch: 79 loss: 0.04923249682405242 acc: 0.986
epoch: 80 loss: 0.0493799899097154 acc: 0.985
epoch: 81 loss: 0.04733707437497998 acc: 0.985
epoch: 82 loss: 0.04974538426387033 acc: 0.986
epoch: 83 loss: 0.04644481612691435 acc: 0.987
epoch: 84 loss: 0.04487185519164782 acc: 0.986
epoch: 85 loss: 0.045398671290498294 acc: 0.989
epoch: 86 loss: 0.04399450836766221 acc: 0.989
epoch: 87 loss: 0.0420689016845811 acc: 0.989
epoch: 88 loss: 0.04011364587751942 acc: 0.991
epoch: 89 loss: 0.039923482281579464 acc: 0.989
epoch: 90 loss: 0.03845436789415447 acc: 0.992
epoch: 91 loss: 0.03867456867975187 acc: 0.991
epoch: 92 loss: 0.03779903707806368 acc: 0.989
epoch: 93 loss: 0.0373895581236269 acc: 0.991
epoch: 94 loss: 0.03760058053188023 acc: 0.99
epoch: 95 loss: 0.03589038914134395 acc: 0.993
epoch: 96 loss: 0.03561026022566131 acc: 0.993
epoch: 97 loss: 0.03553280058120401 acc: 0.993
epoch: 98 loss: 0.03513232650415162 acc: 0.994
epoch: 99 loss: 0.03376538200643341 acc: 0.993
epoch: 100 loss: 0.03136016808749243 acc: 0.995

The implementation is in this repo: https://github.com/o-clipe/mykeraslike

Thank you in advance!

When comparing a custom implementation of an optimization algorithm like RMSProp to the Keras built-in version and observing significant differences in performance, several factors could be at play. These differences can stem from subtle implementation details, initialization values, or additional optimizations present in the Keras version. Here are some potential reasons why Keras’ RMSProp might outperform a naive implementation:

  1. Initialization Parameters:

    • Learning Rate: Keras might use a different default learning rate or a learning rate schedule that adjusts the learning rate over time.
    • Epsilon: A small constant added to the denominator to improve numerical stability. Differences in this value can significantly affect the algorithm’s behavior.

  2. Momentum:

Keras’ RMSProp optimizer includes a momentum term by default, which helps accelerate gradients vectors in the right directions, thus leading to faster converging.

  1. Weight Decay:

Some implementations might include weight decay (also known as L2 regularization) directly in the optimizer, which can help prevent overfitting and might lead to better generalization.

  1. Numerical Stability:

Keras implementations often include tweaks to improve numerical stability, such as preventing division by zero or too small numbers, which might not be present in a naive implementation.

  1. Gradient Clipping:

Keras might apply gradient clipping to prevent exploding gradients, which is particularly useful in deep or complex networks.

  1. Batch Normalization and Regularization:

If your Keras model uses batch normalization or any form of regularization, it could impact training dynamics, leading to faster convergence not directly related to the optimizer itself.

  1. Precision of Computations:

The underlying data type (e.g., float32 vs. float64) can affect the precision of computations and, consequently, the training outcome.

  1. Vectorization and Parallelization:

Keras and TensorFlow are highly optimized for performance, including vectorized operations and parallel execution, which might not be fully leveraged in a custom implementation.

  1. Update Rule Subtleties:

There might be subtle differences in the RMSProp update rule between your implementation and Keras’, even if they look similar at first glance.

Debugging Steps:

To pinpoint the exact cause, consider the following debugging steps:

•	Parameter Alignment: Ensure that all parameters (learning rate, epsilon, etc.) are identical between your implementation and Keras’.
•	Verbose Output: Add verbose output to your implementation to track the learning rate, gradients, and parameter updates at each step, and compare these to what you might expect from Keras.
•	Simplify the Model: Test both implementations on a simpler model to rule out issues related to model complexity or specific layers.
•	Gradual Testing: Start with a very basic version of RMSProp (e.g., without momentum) and gradually add features to see at which point the performance diverges significantly.
•	Consult Keras Source Code: Review the Keras RMSProp source code to understand the exact implementation details and ensure your version matches it as closely as possible.

By systematically comparing your implementation against Keras’ and adjusting for these factors, you can identify what’s causing the performance difference and potentially improve your custom optimizer’s efficiency.