For sets X and Y of keywords used in Information Retrieval , the coefficient may be defined as:C. J. van Rijsbergen (1979)
|
where
is the number of character bigrams found in both strings,
is the number of bigrams in string ''x'' and
is the number of bigrams in string ''y''. For example, to calculate the similarity between:
:
night
:
nacht
We would find the set of bigrams in each word:
:{
ni,
ig,
gh,
ht}
:{
na,
ac,
ch,
ht}
Each set has 4 elements, and the intersection of these two sets has only one element:
ht.
- C. J. van Rijsbergen (1979) Information Retrieval (London: Butterworths)
- Kondrak, G., Marcu, D. and Knight, K. (2003) "Cognates Can Improve Statistical Translation Models" in ''Proceedings of HLT-NAACL 2003: Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics'', pp. 46--48