PDF Data Extraction powered by TensorFlow.js

smtc7648 · February 13, 2024, 1:42pm

Hi All,

I’m new to ML World. I completed my initial training till chapter 5.3 from Machine Learning for Web Developers (Web ML) tutorial series on youtube. After implementing the tasks from the tutorial, I would like to create a model that lists the segments of a pdf document page and then coordinates will be return by the model which I’m planning to use to extract the text data from. (This will resolve the manual data extraction needs from the PDFs from 1000s of documents of same type.)

I’m not sure if I can retrain the COCO-SSD model or a similar one which will return the list of objects with the coordinates of the page for a given segments.

I appreciate if @Jason or any other ml expert who could guide me here, whether I’m in a right direction or not ? or whatever I’ve mentioned above is possible with the tfjs or not ?

Jason · February 13, 2024, 6:10pm

The bigger question on my mind is can you access the PDF data in JS in the browser? Usually PDFs are rendered by the browser not with HTML etc (though I am not an expert on PDFs so I may be wrong).

Or are you doing this on Node.js or such where you can convert PDF doc to some bunch of images via command line tool or such to then use with the model proposed?

The TFJS example COCO-SSD model is hard to retrain as frozen weights etc. I would recommend following @hugozanini tutorial on retraining YOLO instead for an object detection model which is also pretty popular for object detection these days vs COCO-SSD:

Jason · February 13, 2024, 6:21pm

Update: I was talking to some friends on the Chrome team and they said to check out:

For reading PDF in browser. Essentially if not pdf.js you would need to convert some PDF reader to Web Assembly or such to run in the browser which could then render the pages of the PDF to HTML canvas or such from which you can then sample and send to TFJS model to do the object detection on that page.

smtc7648 · February 14, 2024, 5:43am

Thanks for the reply @Jason,

As you mentioned in the update about pdf.js I’m already using it to render the pdf on a webpage. Also, I built a tool where user can define a template of a pdf doc by drawing rect on top of the render pdf. which will return the coordinate of a pdf page and then user can also mention some other attributes/properties for the drawn rect like data point name, constraints so that we can re calculate the rect for same type of doc but has slightly diff. positioning of the datapoint. So rendering is not an issue here.

I’m planning to create an interface where a pdf will get rendered on webpage and by drawing the rect(s) on rendered page, will now have the coordinates of the loaded pdf page after that I’ll extract the image from given coordinates which will use to train model.

So imagine, same type of 1000s of pdf, will tag different datapoints on rendered page for all pdfs and give them a label to train the model from extracted image.

Also as mentioned above, I’ll go through @hugozanini 's tutorial on retraining YOLO for Object Detection.

Thanks again

Jason · February 14, 2024, 5:41pm

You are welcome! Good luck!

smtc7648 · March 20, 2024, 10:24am

@hugozanini could you or @Jason or someone from tensorflow team please help me with running model builder test cell ? I’m getting following error.

Traceback (most recent call last):
File “/Users/smitparmar/Desktop/ML/custom-object-detection/models/research/object_detection/builders/model_builder_tf2_test.py”, line 24, in
from object_detection.builders import model_builder
File “/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/object_detection/builders/model_builder.py”, line 26, in
from object_detection.builders import hyperparams_builder
File “/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/object_detection/builders/hyperparams_builder.py”, line 27, in
from object_detection.core import freezable_sync_batch_norm
File “/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/object_detection/core/freezable_sync_batch_norm.py”, line 20, in
class FreezableSyncBatchNorm(tf.keras.layers.experimental.SyncBatchNormalization
AttributeError: module ‘keras._tf_keras.keras.layers’ has no attribute ‘experimental’

I’m using Apple’s MacBook Pro with M1 Pro CPU. I’ve also installed Tensorflow Metal. I tried running TensorFlow Hub Object Detection Colab Tutorial and It is working with few changes. But not able to run the Model Builder Test from Real-Time SKUs Detection (Google Colab)

“Modified by moderator”

smtc7648 · April 1, 2024, 6:38pm

@hugozanini could you pls help here ?