Well, I try my best to describe my current method, it’s probably not optimal. I hope you can have a better gauge on the effort needed compared to c++ ndk sentencepiece.
I was using this and this as references for Bert tokenizer.
Based on what i observe, Bert Tokenizer consists of 2 general steps which are basic tokenizer followed by wordpiece tokenizer. Basic tokenizer deals with stripping whitespace, casefolds, splitting special characters such as punctuations and Chinese characters. This is followed by Wordpiece tokenizer which takes the preprocessed splits from previous basic tokenize step. It converts the split words and convert to integer tokens in reference with the vocab.txt dictionary. It also deals with possible multiple integer tokens per word since this is the wordpiece method.
OPs or methods that needed to be modified or offload to mobile are(the ops name may not be exact):
For my case, i rewrote most of the basic tokeniser steps in flutter. There are casefolds and NFD normalization libraries around in flutter. As for regex, the expression on original python code is based on perl, while the flutter RegEx library is based on JS. The regex syntax conversion is a little ugly. I used this tool to help me. You can see a simple punctuation regex in perl results in a long string of expressions for JS. Once done splitting, I recombined it with simple space separated strings and sent it to the tfmodel. Here, the standard tf string ops split is able to handle space separated strings without relying on tf-text regex ops.
For Wordpiece tokenizer, the main issue was dealing with the failed initialization of lookup table. I have reimplement the lookup table into a standard tensor of string elements. During tflite conversion, the tensor based dictionary are already loaded as constant values and converted accordingly. Those methods that involved accessing the lookup table were modified to look for strings in the tensor instead. This is especially convoluted for wordpiece_tokenize_with_offsets. I have to based off the google-research bert tokenization.py python code and reimplement in tf style.
Lastly, the current flutter tflite libraries doesn’t have support for text input/output and lacks good support to select ops too. I have to do modifications to the existing flutter libraries to work with strings and select ops for my case. See my fork if you want to try. I have only done modifications to suit my needs and I don’t guarantee it will work for all cases. I am able to make it work for both ios and android simulators, but I am still having issues such as resorting to large monolithic binaries. With so many modifications to make it work, i highly suspect i have hidden bugs along the way too. Frankly, I am still a noobie to flutter. I only using it because the developer I am helping uses flutter to develop apps.