Information About

N-gram




An ''n''-gram is a sub-sequence of ''n'' items from a given Sequence . ''n''-grams are used in various areas of statistical Natural Language Processing and genetic sequence analysis. The items in question can be letters, words or Base Pairs according to the application.

An ''n''-gram of size 1 is a " Unigram "; size 2 is a " Bigram " (or, more etymologically sound but less commonly used, a "digram"); size 3 is a " Trigram "; and size 4 or more is simply called an "''n''-gram". Some Language Model s built from n-grams are "(''n'' − 1)-order Markov Model s".


EXAMPLES


Here are examples of ''word'' level 3-grams and 4-grams (and counts of the number of times they appeared) from the Google N-gram Corpus .

  • ceramics collectables collectibles (55)

  • ceramics collectables fine (130)

  • ceramics collected by (52)

  • ceramics collectible pottery (50)

  • ceramics collectibles cooking (45)


4-grams

  • serve as the incoming (92)

  • serve as the incubator (99)

  • serve as the independent (794)

  • serve as the index (223)

  • serve as the indication (72)

  • serve as the indicator (120)



''N''-GRAM MODELS

An ''n''-gram model models sequences, notably natural languages, using the statistical properties of ''n''-grams.

This idea can be traced to an experiment by Claude Shannon 's work in Information Theory . His question was, given a sequence of letters (for example, the sequence "for ex"), what is the Likelihood of the next letter? From training data, one can derive a Probability Distribution for the next letter given a history of size n: ''a'' = 0.4, ''b'' = 0.00001, ''c'' = 0, ....; where the probabilities of all possible "next-letters" sums to 1.0.