Deformable convolution and other custom ops

Recently we had a refresh over a Deformable convloution WIP PR in Addons.

I’ve cherry-picked this as an example as this requires us to maintain almost 3k lines of new code in the repository.

This is maintainer-ship overhead is also quite similar to what we have with other custom kernels PRs.

As Addons is one of the few Ecosystem repositories to support custom (c++) ops and the related CI infra it is quite normal that we have this kind of proposed PRs.

But as the codeownership of these components it is generally not so stable over time we would like to not merge, as possible, these custom ops PRs also to achieve a more broad hardware coverage.

What are the alternatives? How we could collaborate when a compositional implementation has huge performance gaps?

Often this kind of issues are shared across the “extend” ecosystem like e.g. for the EmbeddingBag:

EmbeddingBag op and layer by Rocketknight1 · Pull Request #2352 · tensorflow/addons · GitHub (1k lines)

Thanks,
Stefano

1 Like

@kristen Is the MLIR team registered to this Dscourse instance or are they only in the LLVM MLIR discourse instance?

Cause generally we don’t have TF specific threads in the LLVM MLIR instance.

1 Like

They have been invited here too

1 Like

Ok I’ve cross posted in the MLIR llvm forum instance.

I hope that at least some TF-MLIR team members could be subscribed to their tags and subcategory.

/cc @Jacques_Pienaar let me know if you want to move this in in another category and you want to use only the XLA tag.

Hey Stefano,

Here is fine thanks (all necessary tags). I’m pinging a couple of folks who has been looking at interfacing/third party backends as i don’t think they’ve seen this yet.

Best,

Jacques

[I’ll speculate based on previous conversations while we wait]

One of the parts we have discussed is “keeping” multiple levels of abstraction around, enabling backends to hook/match at appropriate level to enable the “mega” op while exposing the decomposed forms where there is no support. It is also true that the compositional representation has been too rigid and hasn’t composed as well (“just rewrite your computation as convolutions if you want performance” being in effect the indirect suggestion) and should be revised (which is happening albeit slowly). These are great examples to highlight - a common problem is that folks find a case where compositional form does poorly, special cases a transformation and then moves on and without such overarching examples it is easy to miss that the problem isn’t being addressed.

1 Like

IMHO this is exactly the point.
And I think it is why some specific reusable components ( keras-nlp, keras-cv, tf-addons) that are serving e2e models, also in our selected models in model garden, could be one of the driver to understand what we are expecting from the compiler stack.

Just take a look at our current threshold in TF addons:
we need at least >50 citations to accept a feature related to a paper so it is not something that it is totally brand new.

If we need to have a custom c++ op to reach good enough performance for a new layer but then the codeowner disappear after one or two months or people require to use it in Colab/Google cloud TPU isn’t better to try to interact with these use cases directly with the compiler stack team to understand a little bit how to handle our end2end performance request and to better evaluate a solution that it is alternative to maintain a partial hw covered large custom op?

Just my 2¢

We could see the same in Keras as now it is again a python only repo:

@Jacques_Pienaar Any news? I would like to keep this thread alive :wink:

/cc @yarri-oss @thea

Not yet (I have a meeting soon that is semi-relevant, but higher level and a couple next week where I could raise it again). There are a few efforts I’m aware of, but they are at various stages.

I do like driving these with specific components. I would also ideally have it be such that the compiler team need not be a bottleneck here as that also doesn’t scale. And I believe separable convolutions have been on your list for a long time :slight_smile:

1 Like

Thank you, help me to keep this thread alive

Just a keep alive message for this thread.

Can we find someone in the TF or MLIR team that can give us any feedback/roadmap or just a rough outlook on this topic?

Thanks

@markdaoust Could you help us to find someone, on the TF side, that could give us an overview on this thread about the custom ops roadmap with the new compiler infra and TF runtime?

Thanks

I’ll see if I can find someone.


Aside: For embedding-bag the docs describe this a merging “embedding lookup” and “reduce”. But for the sum and mean combiners, isn’t it sufficient to implement this as a sparse tensor (the ids and weights) times a dense matrix (the embedding vectors)? Doesn’t that cover the cases except combiner=max? I think it would be possible to implement an efficient combiner=max if the sparse_segment_* series was complete and included a sparse_segment_max

1 Like

Thanks,
yes the topic is more in general about what is the perspective when the compositional path doesn’t perform well.

Do we need to interact more strictly with the compiler team on the TF side before introducing a custom ops (often it is hard to collect a feedback)? I think new ops are interesting use cases to stress-test the compositional and the compiler stack transformations.

Will we have a new way to use the new compiler and runtime infra to write more portable high level custom ops?

If we are in a python only ecosystem repo, like keras*, where we need to contribute these “missing pieces”?

P.s.
For the embedding bag case (Addons, Pytorch TPU, JAX) at some point we had a sparse proposal at:

But then the custom ops was merged in Addons (+1.100 lines for CPU/CUDA)

I want to refresh this topic for the new year.

Can we collect a little bit more clear vision on this topic?

It may be one where we could make this an impromptu virtual meeting to discuss. Some folks aren’t back yet, but let me see.

1 Like

We have a new MLIR paper out:

Composable and Modular Code Generation in MLIR: A Structured and Retargetable Approach to Tensor Compiler Construction

It Is still not clear how we are going to interface with these compiler tecnologies/infra when we need to write custom ops without asking to the AVG contributor to have compiler developer skills.

I see that recently some python DSL are emerging in the MLIR community:

Do you suppose that we are going to write TF custom ops in a python DSL?