NLP : Question while working with ALBERT

Sid · August 4, 2021, 8:06am

Hi, so in recent while I have been researching about BERT (ALBERT specifically) and its related works and while working with them I have a few questions (which I have tried to get answer but I probably have knowledge gaps)

How is the preprocessing done for bert and albert alike?
So far I have been able to preprocess the text using albert_en_preprocess and sentencepiece tokenizer but its like a genie in a bottle which I don’t really understand it, its like calling function and boom its done. I skimmed through albert’s paper 1909.11942.pdf (arxiv.org) still didn’t find it. It works but I don’t get it.
Yes I did try looking it sentencepiece’s source code for a sec but even its code structure went through my head in mach 5
Output vectors

The output ALBERT vectors contain 2 vectors, one is pooled_output and sequence_output. The pooled_output is the sentence embedding of the dimension 1x768 and the sequence output is the token level embedding of the dimension 1x(token_length)x768

This is pretty clear about what is what. But I couldn’t find a reason for x768 fixed thingy, it probably is my lack of research at this point.

Other than that I have no problems working with the models, it would be awesome if someone with more experience can tell me the details of why, I am pretty sure the x768 will be a short and OH! I SEE THAT.
Thanks

lgusm · August 4, 2021, 1:32pm

Hi Sid, regarding the preprocess model, To understand what’s going on behind the scene, I’d look on how to use a BERT model without the preprocessing help (Fine-tuning a BERT model | Text | TensorFlow). As you can see, there’s a lot of boilerplate code to transform text to the proper input. That’s all wrapped in the preprocessing models using regular TF operations (mainly from tensorflow_text package).

Regarding the output, the 768 is part of the model parameters and my guess (citation needed) this is to keep the model memory usage under some constraints. The 768 specifically is because the ALBERT you’re using is based on a BERT with the same output size. If you need an ALBERT with a bitter output size (for more accuracy) you can chose another one from this collection

Of course, the larger the output (and the model) the more resources you’ll need to fine tune and use it later.

does it makes sense?

Sid · August 4, 2021, 5:42pm

Thanks for the direction into the preprocessing part, I will definitely look into it. Also thanks for the arbitrary 768 thingy, before I was just looking it as something that I missed without realising it might be a choice .
Really appreciate for answering.