Improving state of NLP for TFLite ( Mobile / flutter )

Intro

Hi, few months ago I was working on a project of creating a prototype application which uses a custom BERT extending model using flutter for summarization of long form text. During the entire project cycle of creating that prototype application I faced a ton of issues and was able to find a work around for a few of them (as stated below) .
This summer I was thinking about working on them in general for community, and if they are big enough/viable issues to be a GSoC project, if not I will still work on them as an extension so people can use it if they wanted. Obviously as a GSoC project I would able to learn more and gain experience too. This post is just a discussion point with more experienced people in community than me. Also I don’t know if this is a good strategy to post it in the forum but I would rather like a open discussion about the issues etc.

Also, it would be nice if there was a GSoC tag for tags section, if that’s possible.

Note: While I was researching this, I found out that the current am15h/tflite_flutter_plugin package started out as actually a 2020 GSoC project which was cool.

Problems and maybe-solutions

  • Preprocessing: The first thing to with any NLP model is tokenization, there are many tokenization methods but currently as per my research BERT models and others like electra, ALbert , distilBert etc. use one of two major methods Sentencepiece or another preprocessing layer like bert_en_uncased_preprocess which can be easily loaded and added before embedding model as a layer. The problem being that there is no generalized built in method to do preprocessing in TFLite or its support packages as far as I can tell nor in native android or flutter. (To be noted I do know that preprocessing for certain implementation is present (tflite_flutter_helper/lib/src/task/text at master · am15h/tflite_flutter_helper) which just calls native Helper library). Also no, the preprocessing layer cannot be converted to tflite without maybe making a flex delegate support for it.
    Possible Solutions: 1 : I faced this issue too as I was trying out different embedding models. The method I used then was, for Sentencepiece based models like ALBERT I compiled the sentencepiece lib for android and made a wrapper essentially using dart ffi as a plugin for flutter (obviously its minimum viable product I needed to get basics working ).
    2 : For the bert preprocessing layer, I went with mimicking the basic code of TFText which already has a BERT tokenizer whose major parts starts from text/bert_tokenizer.py at v2.8.0 · tensorflow/text here in dart. Essentially it just removes things from different defined Unicode ranges and normalizes them using NFD and just canonically decomposes them. My version doesn’t have a normalization currently as the current package that I was using 3 months ago unorm_dart turned out to be painfully slow increasing processing times by about 10x-100x converting sentences to 128 tokens length arrays (might have changed) and obviously is only for english-ish(supports Japanese and Chinese chars).
    3 : Another possibility in solving this would be building in support for the TFText’s tokenizer in Flex delegates or TFLite built-ins. I believe before the issues was tf.nn.embedding_lookup_sparse but that might not be 100 percent of the reason.
    3.5 : There is already qa and nl classifier support in helper libraries. I haven’t researched enough just a thought but why not just expose the function from which preprocessing is being done. Obviously if possible.

    For starting two this can be either have its own repo or added to helper libraries for tflite.

  • TFHub bert based Models: This is something I absolutely cannot achieve on my own fast enough and would require more research but this results in next problem which can be easily avoided if this one can be improved even a little bit. Currently there is ONLY 1 fully TFLite compatible (I mean which don’t require use of Flex delegates) which is a problem because this causes downstream models which use BERT based models as embedding layers to also use Flex delegates regardless of the fact that their code is entirely tflite compatible.
    Possible Solution: Looking at the source code of ALBERT it makes this possible by replacing einsum op with reshape + matmul combination (please correct me if I am wrong). If I remember correctly other models that I considered like SmallBert used einsum too, maybe retraining these small embedding models (which are more likely to be used as they are < 60 mb in model size making them easier to run on edge devices) will help reduce space consumption from FlexDelegate binaries and also allow downstream model to take advantage of NNAPI etc improving performance.

    I am sorry for lack of research on this front compared to last problem, but this is the reason why I wanted an open discussion with people who are more experienced than me (maybe of TFhub team) to correct me on this… or tell me if this is possible

  • Flutter tflite plugin and binaries: There are few changes I would like to make with the plugin itself and for Documentation and automation of FAT binary generation and slimed FlexDelegate per model (I couldn’t generate slimed FlexDelegate for my life ). Currently the most recent binary provided in the package (which supports GPU acceleration and NNAPI) isn’t even compatible with TF 2.6. Also, there is no support for passing strings to models atm, I have a fork where I experimented with it.
    But all these things are something I discussed with Mr. Amish Garg, the person who’s GSoC project this package started as. He would like me to have a concrete method for automating Fat binary generation with GitHub actions or its alternatives but for string support, like he said, this package is used by a lot of people and I would only try to add this if someone from TFLite team or any other related member would review the code to be production level. I would like to think that I am decent enough but I have no base when it comes to writing production level code. So, string support is basically rolling if this can be a GSoC project and someone can overview the code for the flutter plugin. Otherwise, I will try to do this as an individual contribution to the package.

I appreciate anyone who took their time read this all lol. I really tried to keep it small enough but have enough information for anyone to assess if even some parts of it can be included as a GSoC project. In my small world these are big improvements to current workflow but obviously this depends on TFHub, TFlite etc. team’s priorities and goals. Any assessment about these problems and improvements is appreciated, as my goal at the end of the day is to write code which helps others while doing it I will get to learn more from more experienced people.

Thanks, Sid (on Side note I wrote this while going to sleeping without proof reading :man_facepalming: )

Hi Sid, thanks for your post

Let me add some insights to help you on your project

BERT models + TFLite
The easiest way to use BERT models on mobile devices is using the Task Library

It has a tokenizer and that makes the usage of the Preprocessing model unnecessary on mobile. You can still use them for finetuning a model, but you don’t need to save them in the final model.

For BERT models for on-device, I’d suggest you look into these: TensorFlow Hub

they are super optimized and can be a good start. As far as I remember playing with them, you could finetune them in the same way the other BERT models from TFHub (with the preprocessing layers)(if not exactly the same way, very similar)

I don’t know about the flutter plugin maybe @khanhlvg or @Wei_Wei might know more about it

1 Like

Thanks for adding more insight into this. I actually didn’t find this model when I was researching about them 3 months ago. Could they have been added later? Either way thanks for this.

About preprocessing layer, I already saw this method but this only works for classification applications and pls correct me if I am wrong but it doesn’t expose any general tokenization method, right? Also, for example, when my goal is to generate a summary fastest method, I found was to use an embedding model (fine-tuned for the use case) and then use the embeddings generated as an input to a Text Rank algorithm-ish. I am not sure if could achieve this using this method. As I said in 3.5 point of the problem if we can just expose preprocessing this would eliminate the issue entirely as we can already call the model ourselves after that.

But I can see this being a niche enough problem that isn’t that important as obviously most people are implementing classifiers and QA type models it seems.

Again, thanks for taking your time to read that long post.

Just as an update the mode here TensorFlow Hub (tfhub.dev) is not tflite compatible without Flex delegates if you are extending it after using it as a layer.

seems like we need to load the pretrained model as the documentation says here models/official/projects/edgetpu/nlp at master · tensorflow/models (github.com). I will update this after verifying if it works if someone reads up on it ever.

For Flex: I think that depends on the layers you add your model in addition to mobilebert.

Can you post which ops are missing?

It’s just einsum as in other models. I just tried to add an input layer and just mobilebert model and retrieved encoder_layer’s 2nd last layer’s output specifically there is no other layer added after that in the keras model.

This was my testing colab notebook if want to take a look, although pls note that this was just made for testing purpose rather than documenting anything, the error is in the last cell