| Fuzzy String Searching |
Article Index for Fuzzy |
Website Links For Fuzzy |
Information AboutFuzzy String Searching |
| CATEGORIES ABOUT FUZZY STRING SEARCHING | |
| searching | |
| algorithms on strings | |
|
Finding Strings that approximately match some given pattern string. It may also be known as approximate or inexact matching. Approximate string searching has two different flavors: :finding one or more matching substrings of a text, and :finding similar strings in a string set, often referred to as a dictionary. Approximate string searching has many application areas including Information Retrieval , Spellchecking and Computational Biology . SIMILARITY FUNCTIONS The cornerstone of any approximate searching method is a Similarity function or ''' String Metric '''. Among the most commonly used similarity functions are Levenshtein Distance (a type of Edit Distance ) and N-gram Distance . The latter is based on counting of the number of common N-gram s, and is used mostly for filtering. In contrast to n-gram distance, Levenshtein Distance is a de-facto standard similarity function. It has several extensions. One well known extension is Damerau-Levenshtein Distance that counts Transposition as a single edit operation. Another extension is the so-called generalized or weighted Levenshtein distance. It assigns different costs to elementary edit operations. Ukkonen described even more sophisticated similarity function where edit operations go beyond single-character insertions, deletions and substitutions and include substitutions of arbitrary-length strings. ON-LINE VS. OFF-LINE Traditionally, approximate string matching algorithms are classified into two categories: on-line and off-line. With on-line algorithms the pattern can be preprocessed before searching but the text cannot. In other words, on-line techniques do searching without indexation. Early algorithms for on-line approximate matching were suggested by Wagner and Fisher and by Sellers. Both algorithms are based on Dynamic Programming but solve different problems. Sellers' algorithm searches approximately for a substring in a text while the algorithm of Wagner and Fisher calculates Levenshtein Distance , being appropriate for dictionary fuzzy search only. On-line searching techniques were repeatedly improved. Perhaps the most famous improvement is the Bitap Algorithm (also known as the shift-or and shift-and algorithm), which is very efficient for relatively short pattern strings. Bitap Algorithm is the heart of Unix searching Utility Agrep . An excellent review of on-line searching algorithms was done by G. Navarro. Although very fast on-line techniques exist, their performance on large data is unacceptable. Text preprocessing or Indexing makes searching dramatically faster. Today, a variety of indexing algorithms are presented. Among them are Suffix Tree s, Metric Tree s and N-gram methods. For a detailed list of indexing techniques see the paper of Navarro ''et al.'' SEE ALSO
REFERENCES |
|
|