Fuzzy String Searching Article Index for
Fuzzy
Website Links For
Fuzzy
 

Information About

Fuzzy String Searching




Finding Strings that approximately match some given pattern string. It may also be known as approximate or inexact matching.

Approximate string searching has two different flavors:
:finding one or more matching substrings of a text, and
:finding similar strings in a string set, often referred to as a dictionary.
Approximate string searching has many application areas including
Information Retrieval , Spellchecking and Computational Biology .


SIMILARITY FUNCTIONS

The cornerstone of any approximate searching method is a Similarity function or ''' String Metric '''. Among the most commonly used similarity functions are Levenshtein Distance (a type of Edit Distance ) and N-gram Distance . The latter is based on counting of the number of common N-gram s, and is used mostly for filtering. In contrast to n-gram distance, Levenshtein Distance is a de-facto standard similarity function. It has several extensions. One well known extension is Damerau-Levenshtein Distance that counts Transposition as a single edit operation. Another extension is the so-called generalized or weighted Levenshtein distance. It assigns different costs to elementary edit operations. Ukkonen described even more sophisticated similarity function where edit operations go beyond single-character insertions, deletions and substitutions and include substitutions of arbitrary-length strings.


ON-LINE VS. OFF-LINE

Traditionally, approximate string matching algorithms
are classified into two categories: on-line and off-line. With on-line algorithms
the pattern can be preprocessed before searching but the text cannot.
In other words, on-line techniques do searching without indexation. Early
algorithms for on-line approximate matching were suggested by Wagner and
Fisher and by Sellers. Both algorithms are based on
Dynamic Programming but solve different problems. Sellers' algorithm
searches approximately for a substring in a text while the algorithm of Wagner
and Fisher calculates Levenshtein Distance , being appropriate for dictionary fuzzy search only.

On-line searching techniques were repeatedly improved. Perhaps the most
famous improvement is the Bitap Algorithm (also known as the shift-or and shift-and algorithm), which is very efficient for relatively short pattern strings. Bitap Algorithm is the heart of Unix searching Utility Agrep . An excellent review of on-line searching algorithms was done by G. Navarro.

Although very fast on-line techniques exist, their
performance on large data is unacceptable.
Text preprocessing or Indexing makes searching dramatically faster.
Today, a variety of indexing algorithms are presented. Among them are Suffix Tree s, Metric Tree s and N-gram methods.
For a detailed list of indexing techniques see the paper of Navarro ''et al.''


SEE ALSO



REFERENCES