Points to ponder before building a Text Classification model

Prashant Kumar
3 min readJul 5, 2021

Overview

There are vast approaches/models available in the NLP domain to solve any particular task. It can be overwhelming to come up with a particular approach while working on simple problems like text classification or sentiment analysis. So here I’ve jotted down some points which can help to narrow down the options and choose a suitable model for the job.

Analyze the problem statement

This is the basic requirement, to get to know in-depth detail regarding the problem. That is,

  • How many types of classes are we targeting to classify. Binary or multi-class or multi-label classification?
  • If we are dealing with sentiment analysis there can be more than 2 targets involved.
  • Is the training data present in equal proportions for each class or is it imbalanced?

Based on these parameters we can move to the next part, which is analyzing the data.

Analyzing data

The first important part for text data to consider is: if text cleaning is required? That is are we interested only in relevant text and not numbers, HTTP links, accented characters, etc.

Text cleaning

If texts we are dealing with are small sentences like tweets and we need to predict the sentiment, then it doesn’t make much sense to remove stop words, because most stop-words included in libraries like NLTK, Gensim, spaCy, etc, are instead helpful to correctly predict the right sentiment, and removing those could lead to poor model performance. Therefore it is best practice to observe the general stop words contained in these libraries before removing them.
Eg. below is a list of stopwords contained in nltk corpus:

Highlighting keywords

Similarly removing all punctuations and non-alphanumeric characters can sometimes lead to poor context building for the model. For example, characters like #hashtags could be more relevant to add extra weights in the text, by using NERs and help the model to classify better.

Reducing Vocabulary size

If the texts are too long, there are chances of words being used in more than one form, which contextually has the same meaning.
Eg: A list of 5 words in the below code is reduced to 1 word with stemming which has also helped to preserve the relevant meaning of the word. Thus reducing the vocabulary size for the model.

Like we see above in the example stemming and lemmatization can help to reduce the word vector size significantly while preserving the context. While a custom Stemmer should be preferred to correctly preserve the meaning of the sentences and similarly a custom list of stopwords is chosen over a general list to help better with the data at hand.

Type of Text and Target

Analyzing the type of texts data we are dealing with can help to pick the correct word embedding technique for our problem.
Eg: If we are dealing with document classification, then TF-IDF vectors could be a recommended choice over Count Vectors.
On the contrary, if we want to capture the meaning and relationship between the words and the target, let’s say for a movie review, then Word2Vec models will be a superior choice over hashing vectorizer.

Type of Model

In most cases, models are deployed on cloud servers and we do not intend to use unnecessary computing power and resources.
Therefore if the word vectors are not very large after all the text pre-processing, we can try training and predictions with sklearn classifiers like logistic regression, Naive Bayes, Random forest, SVM, etc.
If there is a need for a better understanding of the sentence meaning, then we can move to deep learning methods utilizing the sequence of text to feed the context into the word embeddings and use architectures like RNN, LSTM, or state of the art models like BERT, etc.

Conclusion

We have seen here various steps which can be considered to narrow down on the type of model or algorithm we will need to achieve an effective accuracy for the problem.

If you have any thoughts, ideas, and suggestions, do share them and let me know. Thanks for reading!

--

--