GPU much slower than CPU for sequential networks on M1 Pro

I am trying to run the notebook Text classification with an RNN Text classification with an RNN  |  TensorFlow from the TensorFlow website.
The code has LSTM and Bidirectional layers

Although I get the message “Plugin optimizer for device_type GPU is enabled”, the GPU is not used.

When the GPU is “enabled”, although not used, the time is 56 minutes/epoch.
When I am only using the CPU is 264 seconds/epoch.

I face the same issue when trying to run Transformer model for language understanding भाषा की समझ के लिए ट्रांसफार्मर मॉडल  |  Text  |  TensorFlow

I am using TensorFlow-macos 2.8.0 with TensorFlow-metal 0.5.0. The Python version is 3.8.13. I face the same problem for TensorFlow-macos 2.9.2 too.

My device is MacBook Pro 14 (10 CPU cores, 16 GPU cores)

When I am using CNNs the GPU is fully enabled and 3-4 times faster than when only using the CPU.

I am afraid there is a problem with TensorFlow-metal.
Does someone else face the same issue?

In general a model needs to be “big enough” in order to profit from GPU acceleration, as training data needs to be transferred to the GPU, and new weights need to be downloaded from the GPU, and this overhead reduces the efficiency, making things slower.

Thanks for your comment. However, when I am running the codes I mention using NVIDIA GPU, there is acceleration happening.

I observe the issue I mention only for the Apple M1 Pro GPU.