Accelerating TensorFlow using Apple M1 Max?

Part of this could be related to the TF Addons dependency availability for M1.

See https://github.com/tensorflow/addons/pull/2559#issuecomment-918738096

1 Like

https://tensorflow-prod.ospodiscourse.com/t/accelerating-tensorflow-using-apple-m1-max/5282/27

TensorFlow will continue to provide a plug-in solution for integration with 3rd party devices in the new runtime stack. There might be changes in the API, but we don’t expect this to affect end users (plug-in users).

2 Likes

https://tensorflow-prod.ospodiscourse.com/t/accelerating-tensorflow-using-apple-m1-max/5282/29

I see. Thank you for clarifying! ​
Yes, they are not automatically updated. We rely on our partners to release new versions. And yes, TensorFlow-Metal is not open source.

1 Like

Thanks for the clarifications, @penporn !

this thread seems relevant to my question below, but i can open a new thread if it is warranted.

i posted this question over on the apple TF-metal developer board:

https://developer.apple.com/forums/thread/693982

in a nutshell, the question is if the C API for Tensorflow can be built with the TF-metal plugin, given that most instructions i see on the web are focused on building the python libraries for tensorflow. the answer above is that probably not, because the TF-metal plugin has a dependency on the python packaging location, and i was asked to bring up this question over here.

we have a C++ application that was linked against tensorflow 2.3; on OSX of course up to this point there has been no GPU acceleration available. the dylibs we packaged contain AVX2 and FMA instructions that are not supported by rosetta2, so for the moment the only way to run this application on M1 CPUs is to use an older dylib that does not contain these instructions. it works but it seems like now we have the opportunity (at least on Monterey) to get GPU acceleration somehow with the TF-metal plugin.

of course we might need to tweak something since i think TF-metal requires TF 2.5 since it is a plugin, but that’s OK probably.

thanks for any help on this.

If all GPU cores appear as a single device, does it mean I cannot use distributed training with TensorFlow? Please help to clarify this

it can use. For example, if your Mac has 32 GPU cores, it uses all the cores for training.

I have just installed TF 2.6 using Apple’s introductions using miniforge. Which includes tensorflow-metal as well. I ran some keras examples and I can confirm it detects and maxes out my 16” MBP with M1 Max GPU. Checked utilization via activity monitor.

Performance is absurd for a machine this size, and it doesn’t even generate much heat. The fans are barely audible.

2 Likes

I did a bunch of testing across Google Colab, Apple’s M1 Pro and M1 Max as well as a TITAN RTX GPU.

Turns out the M1 Max and M1 Pro are faster than Google Colab (the free version with K80s).

And though not as fast as a TITAN RTX, the M1 Max still puts in a pretty epic performance for a laptop (about 50% the speed).

I used tensorflow-macos and tensorflow-metal across all Macs and found them to work fantastic with each other.

Here was one of the experiments using transfer learning with an EfficientNetB0 from tf.keras.applications and the Food101 dataset (~100,000 images) from TensorFlow Datasets, batch size 32.

You can see the code and rest of the results on GitHub/YouTube.

5 Likes

That’s a very interesting benchmark! Thanks for sharing!

That’s very interesting, thank you @mrdbourke :tada:

Thanks for sharing!


If anyone is interested, found some older comparisons, though they aren’t about the Max or Pro models:

https://towardsdatascience.com/benchmark-m1-vs-xeon-vs-core-i5-vs-k80-and-t4-e3802f27421c

https://towardsdatascience.com/benchmark-m1-part-2-vs-20-cores-xeon-vs-amd-epyc-16-and-32-cores-8e394d56003d

Hi! I followed Apples instructions on installing the tensorflow-metal PluggableDevice tensorflow on a M1 Macmini. Did som compaprisons and the vanilla CPU version of tensorfloe is about 7 times faster that the new tensorflow-metal for example at mnist. Anybody else with this probelm?

Hello, just found this forum post and wanted to share an article I wrote some days ago:

This comparison is no exhasutive, but a desktop workstation card is significantly faster than the laptop apple processors. They are indeed pretty capable, but no comparable to an actual 200+ watt GPU.
Also, with the bigger boys, you can push batch size a lot.

Has anyone tried comparing the M1 Pro with 16 GPU Cores to the Google Colab’s Pro T4 GPU? How does it compare? Do you guys think it will get much better on later updates because of optimizations?? Thank you

Could anyone who knows about the Pluggable Device implementation let me know what, if any, are the implication of the following message on my M1 MBA:

>>> tf.config.list_logical_devices()
Metal device set to: Apple M1

systemMemory: 8.00 GB
maxCacheSize: 2.67 GB

2022-01-09 15:57:38.705551: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:305] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2022-01-09 15:57:38.706074: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:271] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)
[LogicalDevice(name='/device:CPU:0', device_type='CPU'), LogicalDevice(name='/device:GPU:0', device_type='GPU')]

can you try pip install asitop and sudo asitop to see the memory bandwidth when you are training?
When fine tunning a bert model, my M1 MAX is using around 115GB/s bandwidth. I wonder what is the number on the M1, so we can findout if it is the bandwidth or the gpu that limited the performance.

Tagging @kulin_seth for this question.
Just to confirm if I understand the question correctly: You mean running MNIST with tensorflow-macos is 7x faster than tensorflow-macos + tensorflow-metal on your M1 Macmini, right?

The NUMA node message is just saying that TensorFlow cannot decide which NUMA node (e.g., CPU socket) the GPU PluggableDevice is associated with. It could be because the M1 processor only has one socket or reports NUMA affinity in a different way from what TensorFlow expects (or because TensorFlow isn’t built for NUMA support). This is just for further optimization purposes. It doesn’t mean that there was any error preventing the creation or usage of the GPU PluggableDevice. You can safely ignore the message.

Edited to add: I’ll try to make the NUMA message clearer. Sorry for the confusion!

The second part with “0 MB memory” could be because the memory is less than 1 MB. (The error message only prints an integer, so anything less than 1MB gets rounded down). Or it could be that TensorFlow is reading the memory size wrong on M1. @kulin_seth, what are the usual memory sizes you saw?

fyi

An update on M1 guidance! Should be on https://www.tensorflow.org/install shortly

1 Like

Yeah, I have the same. A couple of times the training even got stuck mid-epoch and froze.
I am using tensorflow-macos 2.8.0, tensorflow-metal 0.4.0, both latest available versions from pypi, and a Miniforge environment w/ Python 3.9.12.
Example with GPU:

$ time python tfdocs/mnist-fashion.py 
training:  (60000, 28, 28)  images translate to  60000  labels

training:  (10000, 28, 28)  images translate to  10000  labels

Metal device set to: Apple M1 Max

systemMemory: 64.00 GB
maxCacheSize: 24.00 GB

2022-05-17 16:37:15.827223: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:305] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2022-05-17 16:37:15.827348: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:271] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)
2022-05-17 16:37:15.974899: W tensorflow/core/platform/profile_utils/cpu_utils.cc:128] Failed to get CPU frequency: 0 Hz
Epoch 1/50
2022-05-17 16:37:16.096858: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:113] Plugin optimizer for device_type GPU is enabled.
1875/1875 [==============================] - 8s 4ms/step - loss: 0.5018 - accuracy: 0.8246
...
real	6m19,561s
user	4m49,345s
sys	4m8,490s

In another environment w/o GPU support, the same model is trained 7x faster:

$ time python tfdocs/mnist-fashion.py 
training:  (60000, 28, 28)  images translate to  60000  labels

training:  (10000, 28, 28)  images translate to  10000  labels

2022-05-16 19:02:31.719596: W tensorflow/core/platform/profile_utils/cpu_utils.cc:128] Failed to get CPU frequency: 0 Hz
Epoch 1/50
1875/1875 [==============================] - 1s 550us/step - loss: 0.5003 - accuracy: 0.8250
...
real	0m54,642s
user	1m27,664s
sys	0m33,717s