PDF Data Extraction powered by TensorFlow.js

Hi All,

I’m new to ML World. I completed my initial training till chapter 5.3 from Machine Learning for Web Developers (Web ML) tutorial series on youtube. After implementing the tasks from the tutorial, I would like to create a model that lists the segments of a pdf document page and then coordinates will be return by the model which I’m planning to use to extract the text data from. (This will resolve the manual data extraction needs from the PDFs from 1000s of documents of same type.)

I’m not sure if I can retrain the COCO-SSD model or a similar one which will return the list of objects with the coordinates of the page for a given segments.

I appreciate if @Jason or any other ml expert who could guide me here, whether I’m in a right direction or not ? or whatever I’ve mentioned above is possible with the tfjs or not ?

The bigger question on my mind is can you access the PDF data in JS in the browser? Usually PDFs are rendered by the browser not with HTML etc (though I am not an expert on PDFs so I may be wrong).

Or are you doing this on Node.js or such where you can convert PDF doc to some bunch of images via command line tool or such to then use with the model proposed?

The TFJS example COCO-SSD model is hard to retrain as frozen weights etc. I would recommend following @hugozanini tutorial on retraining YOLO instead for an object detection model which is also pretty popular for object detection these days vs COCO-SSD:

1 Like

Update: I was talking to some friends on the Chrome team and they said to check out:

For reading PDF in browser. Essentially if not pdf.js you would need to convert some PDF reader to Web Assembly or such to run in the browser which could then render the pages of the PDF to HTML canvas or such from which you can then sample and send to TFJS model to do the object detection on that page.

1 Like

Thanks for the reply @Jason,

As you mentioned in the update about pdf.js I’m already using it to render the pdf on a webpage. Also, I built a tool where user can define a template of a pdf doc by drawing rect on top of the render pdf. which will return the coordinate of a pdf page and then user can also mention some other attributes/properties for the drawn rect like data point name, constraints so that we can re calculate the rect for same type of doc but has slightly diff. positioning of the datapoint. So rendering is not an issue here.

I’m planning to create an interface where a pdf will get rendered on webpage and by drawing the rect(s) on rendered page, will now have the coordinates of the loaded pdf page after that I’ll extract the image from given coordinates which will use to train model.

So imagine, same type of 1000s of pdf, will tag different datapoints on rendered page for all pdfs and give them a label to train the model from extracted image.

Also as mentioned above, I’ll go through @hugozanini 's tutorial on retraining YOLO for Object Detection.

Thanks again :pray:

You are welcome! Good luck!

1 Like