Deformable convolution and other custom ops

Bhack · June 14, 2021, 11:36am

Recently we had a refresh over a Deformable convloution WIP PR in Addons.

I’ve cherry-picked this as an example as this requires us to maintain almost 3k lines of new code in the repository.

This is maintainer-ship overhead is also quite similar to what we have with other custom kernels PRs.

As Addons is one of the few Ecosystem repositories to support custom (c++) ops and the related CI infra it is quite normal that we have this kind of proposed PRs.

But as the codeownership of these components it is generally not so stable over time we would like to not merge, as possible, these custom ops PRs also to achieve a more broad hardware coverage.

What are the alternatives? How we could collaborate when a compositional implementation has huge performance gaps?

Often this kind of issues are shared across the “extend” ecosystem like e.g. for the EmbeddingBag:

github.com/tensorflow/addons

EmbeddingBag and Product-Key Memory Layers

opened 11:42PM - 13 Oct 20 UTC

closed 12:46PM - 02 Feb 22 UTC

Rocketknight1

layers Feature Request

**Describe the feature and the current behavior/state.** FAIR have a cool paper… where they introduce [Product-Key Memory Layers](https://arxiv.org/abs/1907.05242) - these are layers that can add a huge number of parameters (100M-1B) to a network with a very minimal compute overhead. Unfortunately, implementing them efficiently depends on the [EmbeddingBag layer](https://pytorch.org/docs/stable/generated/torch.nn.EmbeddingBag.html) from Pytorch. This layer basically does a gather op followed by a weighted sum across the final dimension of the gather indices. It is trivial to implement this op as a composition of two or three ops in Tensorflow, but doing so requires you to materialize the output of the gather, which in the case of Product-Key Memory layers is enormous, and usually blows out my GPU RAM. By combining these ops into a single efficient call, EmbeddingBag avoids ever materializing the extremely large pre-sum gather output. There's no efficient way to do the same in Tensorflow without a custom op. I've already gotten a CUDA and (single-threaded) CPU implementation of EmbeddingBag working locally using the custom-op repo and associated docker image. I've verified correctness by comparing outputs and gradients to those from the manual composition of ops, and speed and memory usage are vastly improved. I could also contribute a TF implementation of the Product-Key Memory layer itself if desired. **Relevant information** - Are you willing to contribute it (yes/no): yes - Are you willing to maintain it going forward? (yes/no): yes - Is there a relevant academic paper? (if so, where): https://arxiv.org/abs/1907.05242 - Is there already an implementation in another framework? (if so, where): Yes, EmbeddingBag is already a PyTorch layer - Was it part of tf.contrib? (if so, where): **Which API type would this fall under (layer, metric, optimizer, etc.)** Layer **Who will benefit with this feature?** People who want to squeeze loads of parameters into their model while maintaining fast throughput and aren't worried about overfitting. The paper used it for big autoregressive NLP Transformers, but I suspect you could deploy it in a lot of other places too. **Any other info.** I have only implemented the portions of EmbeddingBag necessary for Product-Key Memory layers.

EmbeddingBag op and layer by Rocketknight1 · Pull Request #2352 · tensorflow/addons · GitHub (1k lines)

github.com/tensorflow/tensorflow

embedding_lookup cause ran out of memory

opened 08:39AM - 05 Oct 20 UTC

closed 10:52AM - 19 Aug 21 UTC

shz0116

stat:awaiting response type:bug stalled comp:tpus comp:xla TF 2.3

I am running the following code to test embedding_lookup. ```python # comman…d: # python3 -m pdb embtest.py --features=1000 --nnz=30 --batch=128 # # error: # *** tensorflow.python.framework.errors_impl.ResourceExhaustedError: # Ran out of memory in memory space vmem. It should not be possible to run out of vmem - please file a bug against XLA. # import tensorflow as tf import numpy as np import sys import os import time def measure(params, sp_ids, steps, thr): res = tf.nn.embedding_lookup([params[0:thr],params[thr:]], sp_ids, None, name="TEST1") print("Finished test") return res if __name__ == "__main__": import sys import argparse parser = argparse.ArgumentParser( description="Measure the performance of tensorflow embeddingbag using tf.nn.embedding" ) parser.add_argument("--features", type=int, default=10) parser.add_argument("--em", type=int, default=2) parser.add_argument("--nnz", type=int, default=2) parser.add_argument("--batch", type=int, default=4) parser.add_argument("--steps", type=int, default=1) parser.add_argument("--warmups", type=int, default=0) args = parser.parse_args() features = args.features em = args.em nnz = args.nnz batch = args.batch steps = args.steps warmups = args.warmups sp_ids = np.random.randint(0, features, (batch * nnz,)) res = tf.zeros([batch, em]) resolver = tf.distribute.cluster_resolver.TPUClusterResolver(tpu="grpc://"+os.environ["TPU_IP"]) tf.config.experimental_connect_to_cluster(resolver) tf.tpu.experimental.initialize_tpu_system(resolver) print(" ") tpus = tf.config.list_logical_devices('TPU') print("There are {} tpu logical devices".format(len(tpus))) print(tpus[0]) with tf.device('TPU:0'): params = tf.random.uniform([features, em]) res = measure(params, sp_ids, tf.constant(steps), features//2) print(res) ``` But got the following error: ```bash hongzhang@shan-tf1:~$ python embtest.py --features=1000 --nnz=30 --batch=128 Eager execution : True 2020-10-05 08:23:42.244623: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)to use the following CPU instructions in performance-critical operations: AVX2 FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2020-10-05 08:23:42.250601: I tensorflow/core/platform/profile_utils/cpu_utils.cc:104] CPU Frequency: 2300000000 Hz 2020-10-05 08:23:42.251595: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x4c1dde0 initialized for platform Host (this does not guarantee that XLA will be used). Devices: 2020-10-05 08:23:42.251631: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version 2020-10-05 08:23:42.263068: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:301] Initialize GrpcChannelCache for job worker -> {0 -> 10.178.175.58:8470} 2020-10-05 08:23:42.263113: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:301] Initialize GrpcChannelCache for job localhost -> {0 -> localhost:38651} 2020-10-05 08:23:42.279709: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:301] Initialize GrpcChannelCache for job worker -> {0 -> 10.178.175.58:8470} 2020-10-05 08:23:42.279743: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:301] Initialize GrpcChannelCache for job localhost -> {0 -> localhost:38651} 2020-10-05 08:23:42.280176: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:405] Started server with target: grpc://localhost:38651 There are 8 tpu logical devices LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:7', device_type='TPU') Traceback (most recent call last): File "embtest.py", line 84, in <module> t1 = measure(params, sp_ids, tf.constant(steps), features//2) File "embtest.py", line 15, in measure res = tf.nn.embedding_lookup([params[0:thr],params[thr:]], sp_ids, None, name="TEST1") File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/util/dispatch.py", line 201, in wrapper return target(*args, **kwargs) File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/ops/embedding_ops.py", line 394, in embedding_lookup_v2 return embedding_lookup(params, ids, "div", name, max_norm=max_norm) File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/util/dispatch.py", line 201, in wrapper return target(*args, **kwargs) File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/ops/embedding_ops.py", line 328, in embedding_lookup transform_fn=None) File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/ops/embedding_ops.py", line 246, in _embedding_lookup_and_transform ret.set_shape(ids.get_shape().concatenate(element_shape_s)) File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/ops.py", line 1206, in set_shape if not self.shape.is_compatible_with(shape): File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/ops.py", line 1167, in shape self._tensor_shape = tensor_shape.TensorShape(self._shape_tuple()) tensorflow.python.framework.errors_impl.ResourceExhaustedError: Ran out of memory in memory space vmem. It should not be possible to run out of vmem - please file a bug against XLA. Largest program allocations in vmem: XLA label: register allocator spill slots Allocation type: scoped XLA label: %concatenate.724 = f32[3840,2]{0,1:T(2,128)} concatenate(f32[1,2]{0,1:T(2,128)}, f32[3,2]{0,1:T(2,128)}, f32[5,2]{0,1:T(2,128)}, f32[1,2]{0,1:T(2,128)}, ...(+2400)), dimensions={0} Allocation type: scoped XLA label: %concatenate.724 = f32[3840,2]{0,1:T(2,128)} concatenate(f32[1,2]{0,1:T(2,128)}, f32[3,2]{0,1:T(2,128)}, f32[5,2]{0,1:T(2,128)}, f32[1,2]{0,1:T(2,128)}, ...(+2400)), dimensions={0} Allocation type: scoped XLA label: %concatenate.724 = f32[3840,2]{0,1:T(2,128)} concatenate(f32[1,2]{0,1:T(2,128)}, f32[3,2]{0,1:T(2,128)}, f32[5,2]{0,1:T(2,128)}, f32[1,2]{0,1:T(2,128)}, ...(+2400)), dimensions={0} Allocation type: scoped XLA label: %concatenate.724 = f32[3840,2]{0,1:T(2,128)} concatenate(f32[1,2]{0,1:T(2,128)}, f32[3,2]{0,1:T(2,128)}, f32[5,2]{0,1:T(2,128)}, f32[1,2]{0,1:T(2,128)}, ...(+2400)), dimensions={0} Allocation type: scoped 2020-10-05 08:23:59.826142: W tensorflow/core/distributed_runtime/eager/remote_tensor_handle_data.cc:76] Unable to destroy remote tensor handles. If you are running a tf.function, it usually indicates some op in the graph gets an error: Ran out of memory in memory space vmem. It should not be possible to run out of vmem - please file a bug against XLA. Largest program allocations in vmem: XLA label: register allocator spill slots Allocation type: scoped XLA label: %concatenate.724 = f32[3840,2]{0,1:T(2,128)} concatenate(f32[1,2]{0,1:T(2,128)}, f32[3,2]{0,1:T(2,128)}, f32[5,2]{0,1:T(2,128)}, f32[1,2]{0,1:T(2,128)}, ...(+2400)), dimensions={0} Allocation type: scoped XLA label: %concatenate.724 = f32[3840,2]{0,1:T(2,128)} concatenate(f32[1,2]{0,1:T(2,128)}, f32[3,2]{0,1:T(2,128)}, f32[5,2]{0,1:T(2,128)}, f32[1,2]{0,1:T(2,128)}, ...(+2400)), dimensions={0} Allocation type: scoped XLA label: %concatenate.724 = f32[3840,2]{0,1:T(2,128)} concatenate(f32[1,2]{0,1:T(2,128)}, f32[3,2]{0,1:T(2,128)}, f32[5,2]{0,1:T(2,128)}, f32[1,2]{0,1:T(2,128)}, ...(+2400)), dimensions={0} Allocation type: scoped XLA label: %concatenate.724 = f32[3840,2]{0,1:T(2,128)} concatenate(f32[1,2]{0,1:T(2,128)}, f32[3,2]{0,1:T(2,128)}, f32[5,2]{0,1:T(2,128)}, f32[1,2]{0,1:T(2,128)}, ...(+2400)), dimensions={0} Allocation type: scoped ``` **System information** - OS Platform and Distribution (e.g., Linux Ubuntu 16.04): os: Linux os kernel version: #1 SMP Debian 4.19.146-1 (2020-09-17) os release version: 4.19.0-11-cloud-amd64 os platform: Linux-4.19.0-11-cloud-amd64-x86_64-with-debian-10.6 linux distribution: ('debian', '10.6', '') linux os distribution: ('debian', '10.6', '') mac version: ('', ('', '', ''), '') uname: uname_result(system='Linux', node='shan-tf1', release='4.19.0-11-cloud-amd64', version='#1 SMP Debian 4.19.146-1 (2020-09-17)', machine='x86_64', processor='') architecture: ('64bit', 'ELF') machine: x86_64 - TensorFlow installed from (source or binary): - TensorFlow version (use command below): tf.version.VERSION = 2.3.0-dev20200620 tf.version.GIT_VERSION = v1.12.1-34769-gfd2d4cdb70 tf.version.COMPILER_VERSION = 7.3.1 20180303 - Python version: - Bazel version (if compiling from source): - GCC/Compiler version (if compiling from source): - CUDA/cuDNN version: - GPU model and memory:

github.com/google/jax

np.take and np.einsum aren't fused properly

opened 06:16AM - 26 May 20 UTC

AranKomat

performance xla_issue P2 (eventual)

I'm trying to translate [Product Key Memory ](https://arxiv.org/abs/1907.05242) …in PyTorch into JAX, and this requires the translation of [nn.EmbeddingBag](https://pytorch.org/docs/master/generated/torch.nn.EmbeddingBag.html) with per_sample_weights, as I haven't found any counterpart in JAX (but if you know, please let me know). For this, I wrote scatter_weighted_sum, the weighted sum version of scatter_add, in hope that it'll be efficient with fusing. However, jit didn't fuse np.take, reshape and np.einsum properly, which resulted in a huge intermediate object. Since #1979 concluded this sort of ops will be fused on GPU, I was wondering what is causing this problem. If by any chance this isn't supported on GPU, should this work on TPU? I'm using JAX ver. 0.1.67 with various Colab GPUs. ````python hidden_dim = 512 n_keys = 512 batch = 2 ** 15 knn = 32 heads = 4 key = random.PRNGKey(0) values = random.normal(key, (n_keys ** 2, hidden_dim)) indices = random.randint(key, (batch*heads, knn), 0, n_keys ** 2) weights = random.normal(key, (batch*heads, knn)) @jit def scatter_weighted_sum(inputs, indices, weights): num_bags = weights.shape[-1] dim = inputs.shape[-1] indices = indices.reshape(-1) tmp = inputs.take(indices, axis=0).reshape(-1, num_bags, dim) return np.einsum('ind, in -> id', tmp, weights) ````

Thanks,
Stefano

Bhack · June 17, 2021, 10:19pm

@kristen Is the MLIR team registered to this Dscourse instance or are they only in the LLVM MLIR discourse instance?

Cause generally we don’t have TF specific threads in the LLVM MLIR instance.

mihaimaruseac · June 20, 2021, 7:29pm

They have been invited here too

Bhack · August 4, 2021, 11:13am

Ok I’ve cross posted in the MLIR llvm forum instance.

I hope that at least some TF-MLIR team members could be subscribed to their tags and subcategory.

Bhack · August 4, 2021, 2:37pm

/cc @Jacques_Pienaar let me know if you want to move this in in another category and you want to use only the XLA tag.

Jacques_Pienaar · August 4, 2021, 2:51pm

Hey Stefano,

Here is fine thanks (all necessary tags). I’m pinging a couple of folks who has been looking at interfacing/third party backends as i don’t think they’ve seen this yet.

Best,

Jacques

Jacques_Pienaar · August 4, 2021, 3:01pm

[I’ll speculate based on previous conversations while we wait]

One of the parts we have discussed is “keeping” multiple levels of abstraction around, enabling backends to hook/match at appropriate level to enable the “mega” op while exposing the decomposed forms where there is no support. It is also true that the compositional representation has been too rigid and hasn’t composed as well (“just rewrite your computation as convolutions if you want performance” being in effect the indirect suggestion) and should be revised (which is happening albeit slowly). These are great examples to highlight - a common problem is that folks find a case where compositional form does poorly, special cases a transformation and then moves on and without such overarching examples it is easy to miss that the problem isn’t being addressed.

Bhack · August 4, 2021, 9:21pm

IMHO this is exactly the point.
And I think it is why some specific reusable components ( keras-nlp, keras-cv, tf-addons) that are serving e2e models, also in our selected models in model garden, could be one of the driver to understand what we are expecting from the compiler stack.

Just take a look at our current threshold in TF addons:
we need at least >50 citations to accept a feature related to a paper so it is not something that it is totally brand new.

If we need to have a custom c++ op to reach good enough performance for a new layer but then the codeowner disappear after one or two months or people require to use it in Colab/Google cloud TPU isn’t better to try to interact with these use cases directly with the compiler stack team to understand a little bit how to handle our end2end performance request and to better evaluate a solution that it is alternative to maintain a partial hw covered large custom op?

Just my 2¢

Bhack · August 11, 2021, 4:35pm

We could see the same in Keras as now it is again a python only repo:

github.com/keras-team/keras

Support 3D Pre-trained Model (DepthwiseConv3D and SeparableConv3D)

opened 12:22PM - 03 Aug 21 UTC

closed 08:44AM - 24 Aug 21 UTC

innat

type:feature stat:awaiting response stalled

It must be a bug, a feature request, or a significant problem with the documenta…tion (for small docs fixes please send a PR instead). 1. The form below must be filled out. **Here's why we have that policy:**. feature-request 2. Keras developers respond to issues. We want to focus on work that benefits the whole community, e.g., fixing bugs and adding features. Support only helps individuals. GitHub also notifies thousands of people when issues are filed. We want them to see you communicating an interesting problem, rather than being redirected to Stack Overflow. **System information**. General. TensorFlow version (you are using): TF 2.5 Are you willing to contribute it (Yes/No): No 3. **Describe the feature and the current behavior/state**. Describe the feature clearly here. Be sure to convey here why the requested feature is needed. Any brief description of the use-case would help. It will be useful to enhance the research of medical imaging (3D modeling) and work on video data and more. There are available 2D classification models but unfortunately not a single 3D model for classification. For that, DepthwiseConv3D and [SeparableConv3D](https://github.com/keras-team/keras/issues/5639) official implementation is also needed. 4. **Will this change the current api? How?** It will enhance the API. 5. **Who will benefit from this feature?** Researcher on medical imaging and with video data and possibly all 3D format data. 6. **[Contributing](https://github.com/keras-team/keras/blob/master/CONTRIBUTING.md)** - Do you want to contribute a PR? (yes/no): no - If yes, please read [this page](https://github.com/keras-team/keras/blob/master/CONTRIBUTING.md) for instructions - Briefly describe your candidate solution(if contributing): **Others** There is (AFAIK) only one non-official code-bases for 3D modeling in `tf.keras` - [ZFTurbo/efficientnet_3D](https://github.com/ZFTurbo/efficientnet_3D) - [ZFTurbo/classification_models_3D](https://github.com/ZFTurbo/classification_models_3D)

Bhack · August 18, 2021, 2:31pm

@Jacques_Pienaar Any news? I would like to keep this thread alive

/cc @yarri-oss @thea

Jacques_Pienaar · August 18, 2021, 3:02pm

Not yet (I have a meeting soon that is semi-relevant, but higher level and a couple next week where I could raise it again). There are a few efforts I’m aware of, but they are at various stages.

I do like driving these with specific components. I would also ideally have it be such that the compiler team need not be a bottleneck here as that also doesn’t scale. And I believe separable convolutions have been on your list for a long time

Bhack · August 18, 2021, 3:17pm

Thank you, help me to keep this thread alive

Bhack · October 15, 2021, 9:46pm

Just a keep alive message for this thread.

Can we find someone in the TF or MLIR team that can give us any feedback/roadmap or just a rough outlook on this topic?

Thanks

Bhack · November 13, 2021, 6:42pm

@markdaoust Could you help us to find someone, on the TF side, that could give us an overview on this thread about the custom ops roadmap with the new compiler infra and TF runtime?

Thanks

markdaoust · November 15, 2021, 9:31pm

I’ll see if I can find someone.

Aside: For embedding-bag the docs describe this a merging “embedding lookup” and “reduce”. But for the sum and mean combiners, isn’t it sufficient to implement this as a sparse tensor (the ids and weights) times a dense matrix (the embedding vectors)? Doesn’t that cover the cases except combiner=max? I think it would be possible to implement an efficient combiner=max if the sparse_segment_* series was complete and included a sparse_segment_max

Bhack · November 16, 2021, 1:22am

Thanks,
yes the topic is more in general about what is the perspective when the compositional path doesn’t perform well.

Do we need to interact more strictly with the compiler team on the TF side before introducing a custom ops (often it is hard to collect a feedback)? I think new ops are interesting use cases to stress-test the compositional and the compiler stack transformations.

Will we have a new way to use the new compiler and runtime infra to write more portable high level custom ops?

If we are in a python only ecosystem repo, like keras*, where we need to contribute these “missing pieces”?

P.s.
For the embedding bag case (Addons, Pytorch TPU, JAX) at some point we had a sparse proposal at:

github.com/tensorflow/addons

EmbeddingBag op and layer

tensorflow:master ← Rocketknight1:master

opened 05:19PM - 18 Jan 21 UTC

Rocketknight1

+1100 -0

# Description Brief Description of the PR: This is a PR for the EmbeddingBag… op. Please don't merge it yet! Although it works, testing is incomplete and the file structure needs to be cleaned up. I'm opening it now just to get some initial feedback. I'll keep working on several of these issues (particularly 1, 3, 4 and 6 see below), but I'll need some feedback on 2) and 5), plus any other feedback you have for the rest of it! Fixes # (issue) #2201 ## Type of change New layer and associated C++/CUDA op # Comments There are a few issues that need to be resolved before I'd feel comfortable with this being merged. In no particular order, they are: 1) The CUDA/C++ code is split with the forward and backward passes in separate files, which is not how other Tensorflow or Addons ops do it. This is just a style thing - I'll merge them soon. 2) There are really two different entrypoints for users here, the function/op (analogous to tf.gather) and the layer (analogous to tf.keras.layers.Embedding). Like Embedding, the layer instantiates its own embeddings tensor and expects to be passed only indices and weights, whereas the function needs to be passed embeddings as well. Following PyTorch's naming conventions, I called the op embeddingbag and the layer EmbeddingBag, but this is almost certainly not what you want. What is the right way to name these two? Should I make the function/op a stateless Layer rather than just a function? 3) No support for float16/bfloat16 yet. 4) Because context->AllocateTemp continuously segfaulted for me when I was compiling in the custom-op repo, I used AllocateOutput to make some dummy outputs and then just used them as temp arrays. Compiling in tensorflow_addons itself seems much more stable, but I still need to go back and set that properly to AllocateTemp. 5) The CUDA/C++ ops expect a weight tensor. When no weights are passed, the Python wrapper instantiates dummy weights with `tf.ones_like()`. Is this acceptable? 6) More tests! I don't have any gradient tests at all yet, and I should probably add additional tests with weird shapes.

But then the custom ops was merged in Addons (+1.100 lines for CPU/CUDA)

Bhack · January 10, 2022, 12:18pm

I want to refresh this topic for the new year.

Can we collect a little bit more clear vision on this topic?

Jacques_Pienaar · January 14, 2022, 1:48am

It may be one where we could make this an impromptu virtual meeting to discuss. Some folks aren’t back yet, but let me see.

Bhack · February 8, 2022, 3:10pm

We have a new MLIR paper out:

Composable and Modular Code Generation in MLIR: A Structured and Retargetable Approach to Tensor Compiler Construction

It Is still not clear how we are going to interface with these compiler tecnologies/infra when we need to write custom ops without asking to the AVG contributor to have compiler developer skills.

Bhack · February 11, 2022, 8:55pm

I see that recently some python DSL are emerging in the MLIR community:

Do you suppose that we are going to write TF custom ops in a python DSL?