Extract data/snippets from text

pomputer · June 18, 2021, 12:38am

Hi!
I am a complete beginner in Tensorflow so, please excuse my noob question.

I am trying to extract snippets of text from a larger text file. Eg. extract Stanford University from Jim is a smart guy. He studied at Stanford University in his 20s. The solution must also work with languages other than English (at least Finnish).

I searched online, but wasn’t able to find any examples that fit my requirements. Could somebody give me an example or help me get started with this? I already have a dataset with which I managed to train a text classification model which worked well. Now I just need to implement it in a way that allows me to extract snippets similar to those in the dataset.

Thanks in advance!

lgusm · June 18, 2021, 9:54am

As I understand, you are trying to implement Named Entity Recognition (NER). Also known as Entity Extraction.

I don’t think there’s a tutorial on tensorflow.org but I found some available from the community.

any insights @markdaoust ?

markdaoust · June 18, 2021, 11:38am

It could also be thought of as a text generation (sumarization) task.

Or if you know you want a snippet that exists in the input you could run some sort of attention over the input to choose the start/end tokens.

Bhack · June 18, 2021, 2:14pm

We was talking about tagger at

https://tensorflow-prod.ospodiscourse.com/t/is-there-an-android-equivalent-to-the-apple-word-tagger-model/1503

casolorz · June 18, 2021, 2:29pm

Yeah this is similar to what I want. I need to be able to give categories to different words or sets of words and then feed it a large piece of text and I need it to figure out which category each word in that text belongs to. It is working fairly well with the Apple Word Tagger on iOS even with a small set of training data.

lgusm · June 25, 2021, 9:29pm

follow up on this. Today was published on keras.io a tutorial to do what you want:
Named Entity Recognition using Transformers

very good timing!

Fawaz_Ahmed · July 27, 2023, 4:50am

Also this tutorial by HuggingFace seems to be easier to understand