Kudos to PluggableDevices team; question about AMD GPU

Doug · June 15, 2021, 12:46am

Hi all, I just wanted to show my appreciation for the PluggableDevices implementation. I think it was a good compromise between expanding availability of GPU acceleration and not touching the core CUDA kernels, which would probably need a complete re-development.

In particular, I have been using the MacOS/Metal implementation and liking it very much. One question I have to take this one step further is, what are some guidelines on memory usage the experts could share? For example, in my setting I have an 8GB AMD Radeon pro 5500. When I’m setting buffer size for training my TF models, is there a rule or thumb or any other rough guideline on how I could get the most bang for the buck (in other words, more GPU acceleration for the CPU workload that sending / fetching data to and from the GPU entails).

Many thanks,
Doug

penporn · June 15, 2021, 11:04pm

Hi Doug,

Thank you very much for the kind words! We are excited to hear that you are liking it!

@kulin_seth and his team develop the Metal plug-in. (They have worked really hard for the release – Kudos!) He can help answer your question.

Best,
Penporn

P.S. We have changed the tag of this thread from tfdata to pluggable_device just so it’s easier to look up all PluggableDevice-related posts.

Doug · June 16, 2021, 6:32am

Many thanks @penporn! I’ll be looking forward to @kulin_seth and his team’s inputs.
Happy that you corrected the tag also.

Best,
Doug

Dhruv_Saksena · June 19, 2021, 3:31pm

Hi Doug,

Could you please clarify what you mean by setting buffer size? (maybe only I don’t know)
Do you mean to ask what sized/shape tensors you can create? Or if there are any alignment restrictions? Or something else?

If it is one of my guesses, you should be able to create any size/shaped tensors, individual tensors are backed by MTLBuffers so any size restrictions on MTLBuffers are the only restrictions that apply (which I believe was >= 1GB, which should be usually plenty)
And there are no alignment/shape restrictions with MPS supporting upto 16 dimensions.

I have also informed kulin and others in our team so they will chime in soon.

thanks

Doug · June 21, 2021, 10:45pm

Hi Dhruv,
Many thanks for following up on this. Only reading your question did I note my typo: it wasn’t supposed to be “buffer size” but rather “batch size”. Apologies for the confusion, everyone!

Ultimately, what I am looking to know is, how can I optimise the amount of data flowing into my GPU that maximises GPU usage while not overusing the CPU to transmit the data. What I have noted is that for the same batch size in the same dataset, normal tensorflow in CUDA and tensorflow-metal in my (AMD GPU) Mac lead to different GPU usages. This is expected, of course. But I wanted to know is, whether there is any guideline or rule of thumb that can us users set a batch size that can utilise resources more efficiently.

Please let me know if I can further clarify the question. And thanks again for the follow up, this is much appreciated!

Remy_Wehrung · June 21, 2021, 11:15pm

hello, your post would have been relevant ten / twelve years ago, but like 1 / Apple counts for nothing in the machine learning market
2 / Apple no longer uses AMD architecture, everything has become proprietary
3 / if you want to complain about CUDA: subscribe to the Nvidia forum and report your problems

Doug · June 21, 2021, 11:24pm

Sorry, Remy but I think you misunderstood my post or I expressed myself completely wrong. Kindly do let me clarify below.

I am not saying who counts or does not count on ML. My notebook is a Mac and this is what I prefer to code on.
AMD architecture is the one I have on my notebook. I am well aware of the move to M1.
Where did I complain about CUDA? Frankly I think CUDA is an engineering marvel and I use it a lot in cloud instances.

Doug · June 21, 2021, 11:27pm

Dhruv,
Following up on my earlier post, just a clarification: I know batch size depends on the size of the underlying dataset. I just wondered if the Metal experts had any views (or experience/intuition/rule-of-thumb) on how and whether to adjust batch size of a data coming into a model to make the most of parallelisation at the GPU compared to, say, a CUDA benchmark of similar compute power.

Remy_Wehrung · June 21, 2021, 11:38pm

My apologies, so for AMD particularities, you have to see their side with ROCm (how to complicate when you want simple things)

Doug · June 21, 2021, 11:42pm

Thanks. Actually, ROCm is only for Linux. Macs with AMD also use Metal. Again, I am not discussing the merits of each framework. I just came here to ask for practical insights on how to better use the GPU I am primarily using.

Remy_Wehrung · June 21, 2021, 11:54pm

you tried on the xcode / metal forums? I know silly question. This is why I say that Cupertino is next to the plate compared to our professions, he does not listen to the devs

Doug · June 22, 2021, 12:00am

Thank you for the suggestion. My question is not about Xcode or Metal, it’s about TensorFlow with PluggableDevices running on a Metal backend. Beyond that, I would encourage you to post your views on what the best company or framework is on another topic, so that this exchange remains on topic. Thanks.

Bhack · June 22, 2021, 12:01am

Have you already tried the tips at:

Doug · June 22, 2021, 12:03am

Thanks, Bhack! I wasn’t aware of this page, will definitely try those performance tips out! I appreciate the pointer.

Bhack · June 22, 2021, 12:06am

Check also:

And If available on Metal

Doug · June 22, 2021, 12:07am

Very nice pointers, Bhack! Thanks very much!

Bhack · June 22, 2021, 1:48am

@penporn As ROCm profiler support It is just landing now with:

I suppose that the profiler Is still not part of the puggable device project right?

penporn · June 22, 2021, 3:44am

@Bhack Many thanks for the performance links!

No, it’s not. But there’s an ongoing RFC about this: Pluggable Profiler

penporn · June 22, 2021, 3:46am

Now I know where the tag tfdata came from. Added them back to your post. Sorry!

Subin · June 23, 2021, 3:22pm

Is there a sample app example for pluggable device ?