It is a simple pipeline in which we have 3 things →

  1. Text Normalization
  2. Tokenization
  3. Tokens to IDs

1 → Text Normalization

Consider the following sentence →

"Celebration         of Worldcup winning !!"

You are going to apply normalization steps to it →

When you apply them, the output will look like this →

"celebrate of worldcup win"

2 → Tokenization

You are going to apply tokenization steps to it →