Block masking in TensorFlow

Sayak_Paul · March 14, 2022, 10:40am

I’m trying to implement the masking generation function for BEiT:

microsoft/unilm/blob/master/beit/masking_generator.py

"""
Originally inspired by impl at https://github.com/zhunzhong07/Random-Erasing, Apache 2.0
Copyright Zhun Zhong & Liang Zheng

Hacked together by / Copyright 2020 Ross Wightman

Modified by Hangbo Bao, for generating the masked position for visual image transformer
"""
# --------------------------------------------------------
# BEIT: BERT Pre-Training of Image Transformers (https://arxiv.org/abs/2106.08254)
# Github source: https://github.com/microsoft/unilm/tree/master/beit
# Copyright (c) 2021 Microsoft
# Licensed under The MIT License [see LICENSE for details]
# By Hangbo Bao
# Based on timm, DINO and DeiT code bases
# https://github.com/rwightman/pytorch-image-models/tree/master/timm
# Originally inspired by impl at https://github.com/zhunzhong07/Random-Erasing, Apache 2.0
# Copyright Zhun Zhong & Liang Zheng
#
# Hacked together by / Copyright 2020 Ross Wightman

This file has been truncated. show original

The part I am struggling with is the assignment of EagerTensors.

I have consulted references that show how to approach such assignments, but this one does not seem to fit them.

Any particular approaches I should try out or look into for this case?

Bhack · March 14, 2022, 11:26am

Is every single masking patch random inside the single image there?

Bhack · March 14, 2022, 11:28am

P.s. I was looking at:

Sayak_Paul · March 14, 2022, 11:45am

A single mask can be applied to a batch too.

Sayak_Paul · March 14, 2022, 11:46am

Thanks for sharing this. Will take a look.

Sayak_Paul · March 15, 2022, 8:29am

@Bhack there’s actually no masking involved in the link you sent.

So, the question is pretty much still open.

Bhack · March 15, 2022, 10:20am

Yes as they are just reloading Microsoft weights. So no train protocol there.

What Is your specific issue? Isn’t just the standard
image tokenization in many visual transformer where some token are masked?

Sayak_Paul · March 15, 2022, 10:44am

What Is your specific issue? Isn’t just the standard
image tokenization in many visual transformer where some token are masked?

My issue is in the block-wise masking strategy where apparently tensor assignment is needed (refer to my initial post). Had it been randomized, it would have been easier and we implemented that a while back (here).

Bhack · March 15, 2022, 12:21pm

To exactly mimic that impl are you looking for slice assigment?

Sayak_Paul · March 15, 2022, 12:41pm

Yes. Please take note of this part before sharing existing references:

I have consulted references that show how to approach such assignments, but this one does not seem to fit them.

If there’s no way other than doing something like this, then it’s a different choice.

Bhack · March 15, 2022, 1:05pm

Oh, in that case historically we are full of slice assignment tickets. Just to mention a few still open:

github.com/tensorflow/tensorflow

Dynamical Tensor (and EagerTensor) slice assignment

opened 09:28AM - 19 Jun 20 UTC

zaccharieramzi

stat:awaiting tensorflower type:feature comp:ops

**System information** - TensorFlow version (you are using): 2.2 - Are you wil…ling to contribute it (Yes/No): Yes **Describe the feature and the current behavior/state.** I would like to have slice assignment for Tensor objects in TensorFlow. The code I would like to write is: ```python import tensorflow as tf a = tf.constant([1, 2, 4, 5, 7, 3, 2, 6,]) indices = tf.constant([3, 4], dtype=tf.int32) a[indices] += 1 ``` Of course it's a simplistic example and doesn't cover everything I want to do (I would use it in more complex functions not with constants), and I am happy to make it more complex if necessary. Currently this code gives the error: ``` TypeError: Only integers, slices (`:`), ellipsis (`...`), tf.newaxis (`None`) and scalar tf.int32/tf.int64 tensors are valid indices, got <tf.Tensor: shape=(2,), dtype=int32, numpy=array([3, 4], dtype=int32)> ``` **Will this change the current api? How?** I guess this is a change of API since it introduces a new functionality. **Who will benefit with this feature?** A lot of people have been asking for this feature for example in this GitHub issues: - https://github.com/tensorflow/tensorflow/issues/14132#issuecomment-483002522 - https://github.com/tensorflow/tensorflow/issues/33131 These issues have unfortunately been closed because some workarounds for specific use-cases have been found (ones where the slicing is fixed and you can use [masking](https://github.com/tensorflow/tensorflow/issues/14132#issuecomment-483002522) or [TensorArrays](https://github.com/tensorflow/tensorflow/issues/14132#issuecomment-487643287)). Some other issues deal with `Variable`s which is not what I am talking about here. [Some workarounds do exist](https://stackoverflow.com/a/62202181/4332585) involving `Variable` but they seem hacky. I will personally benefit from it, in the multiple places where I now use `tensor_scatter_nd_add` or `tensor_scatter_nd_update`, which is solution that always works but is very difficult to write and very slow: - [for a wavelet-based neural network, called MWCNN](https://github.com/zaccharieramzi/tf-mwcnn/blob/master/mwcnn.py#L106-L110); - [for non-uniform fast fourier transform](https://github.com/zaccharieramzi/tfkbnufft/blob/master/tfkbnufft/nufft/interp_functions.py#L151); - [for sensitivity map extraction when doing MRI reconstruction with TensorFlow neural networks](https://github.com/zaccharieramzi/fastmri-reproducible-benchmark/blob/master/fastmri_recon/data/utils/multicoil/smap_extract.py#L27-L35). **Any Other info.** The `tensor_scatter_nd_*` alternative might seem like a viable solution, but it suffers from 2 drawbacks that I consider huge: - It is very difficult to write. It is actually so difficult, I decided to make a package that would alleviate this difficulty by having the different slicing possibilities unit tested: [tf-slice-assign](https://github.com/zaccharieramzi/tf-slice-assign). - It is very slow. I made a [benchmark notebook](https://colab.research.google.com/drive/1gEjha7h1mhQkFwULS9MAU0bWQfzfEALY?usp=sharing) vs `pytorch` for slice assignment add. You can see that on GPU, using `tensor_scatter_nd_add` is 10 times slower than slice assignment in `pytorch` and 20 times slower on CPU. For a practical example, it means that my `tfkbnufft` (for non-uniform fast fourier transform) package is 30 times slower than its [torch counterpart](https://github.com/mmuckley/torchkbnufft#computation-speed) which I translated. This currently removes the possibility of training neural networks using the non-uniform fourier transform in TensorFlow.

I’ve not checked the paper in details on what kind of index is going to be selected to execute the masking. Cannot be covered by tf.tensor_scatter_nd_update after populating these indexes?

Sayak_Paul · March 15, 2022, 1:18pm

The indexing conditions are in the source code I provided.

If you know a way around with scatter, do you mind providing a minimal working code.

Bhack · March 15, 2022, 1:27pm

E.g. I think that embedding in the Hugginface transformers library, also if it is using Pytorch ops, is not going to require/use the slice assignment:

github.com

huggingface/transformers/blob/main/src/transformers/models/beit/modeling_beit.py#L138-L214

      
        
            # Based on timm implementation, which can be found here:
            # https://github.com/rwightman/pytorch-image-models/blob/master/timm/models/vision_transformer.py
            class BeitEmbeddings(nn.Module):
                """
                Construct the CLS token, position and patch embeddings. Optionally, also the mask token.
            
            
    """
            
            
    def __init__(self, config: BeitConfig) -> None:
                    super().__init__()
            
            
        self.cls_token = nn.Parameter(torch.zeros(1, 1, config.hidden_size))
                    if config.use_mask_token:
                        self.mask_token = nn.Parameter(torch.zeros(1, 1, config.hidden_size))
                    else:
                        self.mask_token = None
                    self.patch_embeddings = PatchEmbeddings(
                        image_size=config.image_size,
                        patch_size=config.patch_size,
                        num_channels=config.num_channels,

This file has been truncated. show original

Sayak_Paul · March 15, 2022, 4:51pm

I think you’re mistaken then.

bool_masked_pos in the forward() is nothing but the output the mask yielded by the class I showed in my initial post.

Bhack · March 15, 2022, 9:09pm

It is true bool_masked_pos is only the “application” of the masking but then ownership to prepare the mask it is still to the external the caller.

I don’t see all the details that are in reference implementation in the paper but with the concrete reference implementation you shared, with all these attemps, conditional loops etc, you could try to use a tf.variable to mimic that implementation but probably you will need to refactor it more in graph mode/tf.function:

Sayak_Paul · March 16, 2022, 1:04am

Absolutely. And in case no reference implementations are available I guess the implementation done by the actual author comes to the rescue.

There isn’t much about it in the paper apart from the figure on block-wise masking which is why the original implementation is an important reference point.

Thanks for sharing your implementation. Will check it out.

Bhack · March 17, 2022, 1:58pm

Having a tf.function/graph version It is quite trivial with few changes/substitution with TF ops.

But a jit_compile=True version it will require a new design and probably some compromises.

Let me know if you have a jit_comile=True version.

Sayak_Paul · March 17, 2022, 2:19pm

What’s trivial for you may not be trivial to someone else

Bhack · March 17, 2022, 2:25pm

Let me know when you have the same I’ve posted but with TF instead of numpy ops.
I will help you to make the required changes for tf.function.

Bhack · March 17, 2022, 10:23pm

This is already working with tf.function with minimal changes

    @tf.function
    def _mask(self, mask, max_mask_patches):
        delta = 0
        for attempt in tf.range(10):
            target_area = random.uniform(self.min_num_patches, max_mask_patches)
            aspect_ratio = math.exp(random.uniform(*self.log_aspect_ratio))
            h = int(round(math.sqrt(target_area * aspect_ratio)))
            w = int(round(math.sqrt(target_area / aspect_ratio)))
            if w < self.width and h < self.height:
                top = random.randint(0, int(self.height - h))
                left = random.randint(0, int(self.width - w))
                num_masked = tf.math.count_nonzero(mask[top: top + h, left: left + w])
                # Overlap
                if 0 < h * w - num_masked and h * w - num_masked  <= max_mask_patches:
                    for i in range(top, top + h):
                        for j in range(left, left + w):
                            if mask[i, j] == 0:
                                mask[i, j].assign(1)
                                delta += 1
                if delta > 0:
                    break
        return delta