Sunday, March 06, 2022

Ruminating on Text Normalization

Text normalization is the process of converting text to a standard form before we use them for training AI NLP models. The following techniques are typically used for normalizing text. 

Tokenization:  Tokenization is the process of breaking down sentences into words. In many Latin-derived languages, "space" is considered to be a word delimeter. But there are special cases such as 'New York', 'Rock-n-Roll' etc. Also Chinese and Japanese languages do not have spaces between words. We may laso wante to tokenize emoticons and hashtags. 

Lemmatization: In this process, we check if words have the same 'root' - e.g. sings, sang. We then normalize the words to the common root word. 

Stemming: Stemming can be considered a form of Lemmatization wherein we just strip the suffixes from the end of the word - e.g. troubled and troubles can be stemmed to 'troubl'.

Lemmatization is more computationally intensive than Stemming because it actually maps the word to a dictionary and finds the root word. Whereas Stemming just uses some crude heuristic process that chops off the ends of words in the hope of getting the root word. Stemming is thus much faster when you are dealing with a large corupus of text. The following examples will make the difference clear. 

  • The word "better" has "good" as its lemma. This link is missed by stemming, as it requires a dictionary look-up.
  • The word "walk" is the base form for word "walking", and hence this is matched in both stemming and lemmatisation.
  • If you lemmatize the word 'Caring', it would return 'Care'. If you stem, it would return 'Car' and this is erroneous.

Sentence Segmentation: This entails breaking up a long sentence into smaller sentences using chars such as '! ; ?'. 

Spelling Correction and UK/US differences: As part of the normalization activity, we may also want to correct some common spelling mistakes and also normalize the different spelling between UK/US like neighbour/neighbor.

No comments:

Post a Comment