Information About

N-gram




More concisely, an n-gram is a model of predicting x_{i} based on x_{i-1}, x_{i-2}...x_{i-n}. In application to Language Modeling , because of computational limitations and the open nature of language (there are infinite possible words), independence assumptions are made so that each word depends only on the last n words, making it a good Markov Model .

An n-gram of size 1 is a "unigram"; size 2 is a " Bigram " (or, more appropriately but less commonly, a digram); size 3 is a " Trigram "; and size 4 or more is simply called an "n-gram" or "n-1 order Markov Model".

N-grams are a popular technique in statistical Natural Language Processing . In Speech Recognition , Phonemes are modeled using a n-gram distribution. For parsing, words are modeled such that each n-gram is composed of n words. For a sequence of words, (for example "the dog smelled like a skunk"), the trigrams would be: "the dog smelled", "dog smelled like", "smelled like a", and "like a skunk". For sequences of characters, the 3-grams (sometimes spelled "trigrams") that can be generated from "good morning" are "goo", "ood", "od ", "d m", " mo", "mor" and so forth. Some practitioners preprocess strings to remove spaces, others do not. In almost all cases, punctuation is removed by preprocessing. N-grams can also be used for sequences of words or, in fact, for almost any type of data. They have been used for example for extracting features for clustering large sets of satellite earth images and for determining what part of the Earth a particular image came from.

By converting an original sequence of items to N-grams, it can be embedded in a Vector Space (in other words, represented as a Histogram ), thus allowing the sequence to be compared to other sequences in an efficient manner. For example, if we convert strings with only letters in the English alphabet into 3-grams, we get a 26^3-dimensional space (the first dimension measures the number of occurrences of "aaa", the second "aab", and so forth for all possible combinations of three letters). Note that using this representation we lose information about the string. For example, both the strings "abcba" and "bcbab" give rise to exactly the same 2-grams. However, we know empirically that if two strings of real text have a similar vectorial representation (such as a small Cosine Distance ) then they are likely to be similar. Other metrics have also been applied to vectors of N-grams with varying, sometimes better, results. For example Z-scores have been used to compare documents by examining how many standard deviations each N-gram differs from its mean occurrence in a large collection, or Corpus , of documents (which form the "background" vector).

N-grams find use in several areas of computer science, computational linguistics, and applied mathematics. They are a commonly used technique to design Kernels that allow Machine Learning algorithms such as Support Vector Machine s to learn from string data. N-grams can also be used to find likely candidates for the correct spelling of a misspelled word. Also in compression algorithms where a small area of data requires N-grams of greater length to improve compression. N-grams are often used in pattern recognition systems in order to assess the probability of a given word sequence appearing in text of the language of interest. This capability can be useful in Speech Recognition , OCR ( Optical Character Recognition ), Intelligent Character Recognition ( ICR ), Machine Translation and similar applications where a system is trying to choose what the next item (letter, word, phoneme, etc.) might be from among a list of the best candidates. They are also used in information retrieval when it is necessary to find similar "documents" (a term for which the conventional meaning is sometimes stretched, depending on the data set) given a single query document and a database of reference documents.


BIAS VS. VARIANCE TRADE-OFF

What goes into picking the n for the n-gram?

There are problems of assigning too much weight to ''infrequent grams'' (for example, if a proper name appeared in the training data) and too much weight to ''frequent grams''. Also, items not seen in the training data will be given a probability of 0.0 without smoothing


SMOOTHING TECHNIQUES



SEE ALSO



EXTERNAL LINKS