Computational Cost of tensor operations

Samuele_Bortolato · December 4, 2021, 7:58am

Tensorflow is a great library for tensor operations, covering most of the functions necessary for effective tensor manimulation. The code is extremely well optimized and it automatically scale to take advantage of the hardware available.

For these resons it’s an appealing choice for writing parallel programs, but (as a software engineer) I would need some more informations on the computational cost of each operation for implementing it effectively in the programs.

Let’s make an example. I’m currently enrolled in a master degree in artificial intelligence, a few weeks ago we were given a reinforcement learning problem were we had to write an agent able to play a certain board game. In order to get all the data for training our agent we had to play a lot of games, but the simulation engine provided was too slow. For this reason we decided to rewrite completey the simulations parallelizing it in tensorflow, and we were able to run hundreds of tousands of simulations in paralell in approximately the same time. Writing all the optimized code was not easy but it could be done.
Doing this project we realized that some of the operations were way slower than others, for example al the segment operations (segment_sum, segment_max, … ) were way slower than creating a ragged tensor and applying a reduce sum on the ragged dimension, while still doing approximately the same thing.

Or in another case, trying to implement a sparsely connected neural network, we were not able to understand if it was faster computing the product using some sparse tensor, or using gather to get the slices we needed and applying the multiplication manually on neurons we needed or other.

A complete documentation of the library, expecially one that offers such low level operations, in my opinion should not omit the computational cost of the basic operations.

I’m aware of the fact that running the fact that running the code on many different architectures will result in completely different behaviours (running on a cpu is different than running on a gpu or tpu), but still I think a complete desctiption of the behaviour should be available for the advanced users. The two main point of interest of course are the time complexity and the space complexity (becaause if the gpu crashes for out of memory the fast code is of no use) of each basic tensorflow operation, highliting the differences among the architectures.

Maybe such resource is already availabe somewere, but in all my researches I wasn’t able to find it. If it’s now I think it would be a very usefull addition, even if only for a few specialized developer and researcher

Bhack · December 4, 2021, 3:24pm

I don’t think that we have this documented but you can always check FLOPS and memory details at runtime with:

https://tensorflow-prod.ospodiscourse.com/t/how-to-find-out-keras-model-memory-size/5249/3?u=bhack

Samuele_Bortolato · December 12, 2021, 1:36am

Thanks for the reply, I didn’t know about the profiler and is definitely a very usefull tool to understand the performance of the code.

.

Anyway I think that trying to understand the behaviour of a program only from the performance measured at runtime on one machine is not the best, it doesn’t give any informations on how it would scale changing the tensors or the machine.

I just think that having some theoretical knowlege on the behaviour of these basic functions would be very usefull for writing efficient algorithms, without having to start coding to have some feedback.

If there is no such documentation jet I would dare to say that it could even be helpful to have a pseudocode for these basic operations: even if it would still require quite a lot of technical knowlege to understand the bottlenecks on different architectures (estimating the constants of the time complexity) with an accurate analysis would still be possible to at least have some theoretical infortmation on the asymptotic behavior of these functions.
(btw I have no idea if it’s based on some open source code or if it’s licensed)
Even just saying ‘on the cpu this function follows its blas lapack implementation’ or informations of that sort could be useful.