Sunday, March 06, 2022

Ruminating on n-gram models

N-gram is a fundamental concept in NLP and is used in many language models. In simple terms, N-gram is nothing but a sequence on N words - e.g. San Francisco (is a 2-gram) and The Three Musketeers (is a 3-gram). 

N-grams are very useful because they can be used for making next word predictions, correcting spellings or grammar. Ever wondered how Gmail is able to suggest auto-completion of sentences? This is possible because Google has created a language model that can predict next words.

N-grams are also used for correcting spelling errors - e.g. “drink cofee” could be corrected to “drink coffee” because the language model can predict that 'drink' and 'coffee' being together have a high probability. Also the 'edit distance' between 'cofee' and 'coffee' is 1, hence it is a typo.

Thus N-grams are used to create probabilistic language models called n-gram models. N-gram models predict the occurrence of a word based on its N – 1 previous word. 

The 'N' depends on the type of analysis we want to do - e.g. Research has also shown that trigrams and 4-grams work the best for spam filtering. 

Some good info on N-grams is available at the Standford University site - https://web.stanford.edu/~jurafsky/slp3/slides/LM_4.pdf

Google books also has a "N-Gram" viewer displays a graph showing how those phrases have occurred in a corpus of books (e.g., "British English", "English Fiction", "French") over the selected years. I found this to be useful in understanding what topic was popular in which years: https://books.google.com/ngrams

No comments:

Post a Comment