String metric
In mathematics and computer science, a string metric is a metric that measures distance between two text strings for approximate string matching or comparison and in fuzzy string searching. A requirement for a string metric is fulfillment of the triangle inequality. For example, the strings "Sam" and "Samuel" can be considered to be close. A string metric provides a number indicating an algorithm-specific indication of distance.
The most widely known string metric is a rudimentary one called the Levenshtein distance. It operates between two input strings, returning a number equivalent to the number of substitutions and deletions needed in order to transform one input string into another. Simplistic string metrics such as Levenshtein distance have expanded to include phonetic, token, grammatical and character-based methods of statistical comparisons.
String metrics are used heavily in information integration and are currently used in areas including fraud detection, fingerprint analysis, plagiarism detection, ontology merging, DNA analysis, RNA analysis, image analysis, evidence-based machine learning, database data deduplication, data mining, incremental search, data integration, and semantic knowledge integration.
List of string metrics
- Levenshtein distance, or its generalization edit distance
- Damerau–Levenshtein distance
- Sørensen–Dice coefficient
- Block distance or L1 distance or City block distance
- Hamming distance
- Jaro–Winkler distance
- Simple matching coefficient
- Jaccard similarity or Jaccard coefficient or Tanimoto coefficient
- Tversky index
- Overlap coefficient
- Variational distance
- Hellinger distance or Bhattacharyya distance
- Information radius
- Skew divergence
- Confusion probability
- Tau metric, an approximation of the Kullback–Leibler divergence
- Fellegi and Sunters metric
- Maximal matches
- Grammar-based distance
- TFIDF distance metric
Selected string measures examples
Name | Example |
Hamming distance | "'" and "'" is 3. |
Levenshtein distance and Damerau–Levenshtein distance | and have a distance of 3.
|
Jaro–Winkler distance | JaroWinklerDist =
|
Most frequent k characters | MostFreqKeySimilarity = 2 |