Classify words by meaning in booking documents

Hi there!

I am looking for an approach that takes in booking documents (booking date, sender address, receiver address, price, …) and returns structured text (e.g. a json). The documents are fairly structured but do differ significantly between different companies.

The OCR part is no problem, I can accurately get the raw text (and coordinates on the page) from the input file. The problem I face now is actually getting the raw text into a structured form. For instance, I need to be able to detect what the sender address is or the delivery date.

I tried to do some manual work based on the word position in the document; e.g. “sender name” is followed by the name of the sender. However, suffice to say that this approach is not general at all.

I was wondering if there were models out there, for instance BERT, that could classify individual words after some training.

Any suggestion would be nice, thanks!

Hi @TimoKer

Welcome to the TensorFlow Forum!

Could you please tell us what format the file is as of now and what are the preprocessing steps you have done? If possible, please share minimal reproducible code to replicate and understand the issue. Thank you.