Information AboutN-gram |
| CATEGORIES ABOUT N-GRAM | |
| natural language processing | |
| computational linguistics | |
| speech recognition | |
| corpus linguistics | |
|
An ''n''-gram is a sub-sequence of ''n'' items from a given Sequence . ''n''-grams are used in various areas of statistical Natural Language Processing and genetic sequence analysis. The items in question can be letters, words or Base Pairs according to the application. An ''n''-gram of size 1 is a " Unigram "; size 2 is a " Bigram " (or, more etymologically sound but less commonly used, a "digram"); size 3 is a " Trigram "; and size 4 or more is simply called an "''n''-gram". Some Language Model s built from n-grams are "(''n'' − 1)-order Markov Model s". EXAMPLES Here are examples of ''word'' level 3-grams and 4-grams (and counts of the number of times they appeared) from the Google N-gram Corpus .
4-grams
''N''-GRAM MODELS An ''n''-gram model models sequences, notably natural languages, using the statistical properties of ''n''-grams. This idea can be traced to an experiment by Claude Shannon 's work in Information Theory . His question was, given a sequence of letters (for example, the sequence "for ex"), what is the Likelihood of the next letter? From training data, one can derive a Probability Distribution for the next letter given a history of size : ''a'' = 0.4, ''b'' = 0.00001, ''c'' = 0, ....; where the probabilities of all possible "next-letters" sums to 1.0. |
|
|