How will a text prediction model affected by removing rare words from the dataset

Let’s say you want to build a model for text prediction and your dataset contains 5M words.The unique words(the vocab size is 59k). But if i remove all words that appear less than 4 times in the whole dataset the number of unique words drops down to 20k. What I want to know is does removing these rare words affect my model performance in any way?


In my opinion, if you remove the rare words it will affect the performance of model, becuase text classification models actully based on transfer learning model where words in the sentence plays a very important role. So if you remove those rare words, it might affect the overall context of sentence which may affect the model performance. So there might be a possibility when you start testing your model, it will not classify the things if any of those rare words come in future.
I hope this answer might help you.

It will affect but not much
This is a common technique, remove rare words, some might even be typos or just wrong

The problem os using the 59k vocab size is that your model will need to deal with embedding layers that has 59K size (eg:hot encoding your input). This can create memory pressure and make your model too big.

For example, BERT models have a vocab size close to 30K so even if they don’t use every possible word, they still achieve very good results

So how can I reduce the vocabulary size. Because I’m using number of unique words as vocab size

There are some tips here: Load text  |  TensorFlow Core