Kudos to PluggableDevices team; question about AMD GPU

Hi all, I just wanted to show my appreciation for the PluggableDevices implementation. I think it was a good compromise between expanding availability of GPU acceleration and not touching the core CUDA kernels, which would probably need a complete re-development.

In particular, I have been using the MacOS/Metal implementation and liking it very much. One question I have to take this one step further is, what are some guidelines on memory usage the experts could share? For example, in my setting I have an 8GB AMD Radeon pro 5500. When I’m setting buffer size for training my TF models, is there a rule or thumb or any other rough guideline on how I could get the most bang for the buck (in other words, more GPU acceleration for the CPU workload that sending / fetching data to and from the GPU entails).

Many thanks,


Hi Doug,

Thank you very much for the kind words! We are excited to hear that you are liking it!

@kulin_seth and his team develop the Metal plug-in. (They have worked really hard for the release – Kudos!) He can help answer your question. :slight_smile:


P.S. We have changed the tag of this thread from tfdata to pluggable_device just so it’s easier to look up all PluggableDevice-related posts.


Many thanks @penporn! I’ll be looking forward to @kulin_seth and his team’s inputs.
Happy that you corrected the tag also.


1 Like

Hi Doug,

Could you please clarify what you mean by setting buffer size? (maybe only I don’t know)
Do you mean to ask what sized/shape tensors you can create? Or if there are any alignment restrictions? Or something else?

If it is one of my guesses, you should be able to create any size/shaped tensors, individual tensors are backed by MTLBuffers so any size restrictions on MTLBuffers are the only restrictions that apply (which I believe was >= 1GB, which should be usually plenty)
And there are no alignment/shape restrictions with MPS supporting upto 16 dimensions.

I have also informed kulin and others in our team so they will chime in soon.



Hi Dhruv,
Many thanks for following up on this. Only reading your question did I note my typo: it wasn’t supposed to be “buffer size” but rather “batch size”. Apologies for the confusion, everyone!

Ultimately, what I am looking to know is, how can I optimise the amount of data flowing into my GPU that maximises GPU usage while not overusing the CPU to transmit the data. What I have noted is that for the same batch size in the same dataset, normal tensorflow in CUDA and tensorflow-metal in my (AMD GPU) Mac lead to different GPU usages. This is expected, of course. But I wanted to know is, whether there is any guideline or rule of thumb that can us users set a batch size that can utilise resources more efficiently.

Please let me know if I can further clarify the question. And thanks again for the follow up, this is much appreciated!

hello, your post would have been relevant ten / twelve years ago, but like 1 / Apple counts for nothing in the machine learning market
2 / Apple no longer uses AMD architecture, everything has become proprietary
3 / if you want to complain about CUDA: subscribe to the Nvidia forum and report your problems

Sorry, Remy but I think you misunderstood my post or I expressed myself completely wrong. Kindly do let me clarify below.

  1. I am not saying who counts or does not count on ML. My notebook is a Mac and this is what I prefer to code on.
  2. AMD architecture is the one I have on my notebook. I am well aware of the move to M1.
  3. Where did I complain about CUDA? Frankly I think CUDA is an engineering marvel and I use it a lot in cloud instances.

Following up on my earlier post, just a clarification: I know batch size depends on the size of the underlying dataset. I just wondered if the Metal experts had any views (or experience/intuition/rule-of-thumb) on how and whether to adjust batch size of a data coming into a model to make the most of parallelisation at the GPU compared to, say, a CUDA benchmark of similar compute power.

My apologies, so for AMD particularities, you have to see their side with ROCm (how to complicate when you want simple things)

Thanks. Actually, ROCm is only for Linux. Macs with AMD also use Metal. Again, I am not discussing the merits of each framework. I just came here to ask for practical insights on how to better use the GPU I am primarily using.

you tried on the xcode / metal forums? I know silly question. This is why I say that Cupertino is next to the plate compared to our professions, he does not listen to the devs

Thank you for the suggestion. My question is not about Xcode or Metal, it’s about TensorFlow with PluggableDevices running on a Metal backend. Beyond that, I would encourage you to post your views on what the best company or framework is on another topic, so that this exchange remains on topic. Thanks.

Have you already tried the tips at:


Thanks, Bhack! I wasn’t aware of this page, will definitely try those performance tips out! I appreciate the pointer.

Check also:

And If available on Metal


Very nice pointers, Bhack! Thanks very much!

@penporn As ROCm profiler support It is just landing now with:

I suppose that the profiler Is still not part of the puggable device project right?

@Bhack Many thanks for the performance links! :slight_smile:

No, it’s not. But there’s an ongoing RFC about this: Pluggable Profiler

1 Like

Now I know where the tag tfdata came from. Added them back to your post. Sorry! :slight_smile:


Is there a sample app example for pluggable device ?

1 Like