As humans, if you read any text, you will only remember important words.

The key information you will keep in mind, not word to word.

Similarly, we want to process only important words, not the words that are not helping in prediction.

These words are called stop words.

Removal of stop words is one of the parts of text pre-processing.

Let me give you a pipeline that will help you understand the whole pre-processing of text, before feeding it into any language model.

Text Pre-Processing Pipeline

We will start with raw text data; at the end of the process, you will see what the data will look like.

You already know different types of LLMs like GPT were trained on internet data.

You cannot just feed that text internet data directly to the model.