Text Normalization Article Index for
Text
Website Links For
Text
 

Information About

Text Normalization




Examples of text normalization:

  • Unicode normalisation

  • --- Unicode NFC (Normalization Form Composition) where the base character and combining accents are canonically composed.

  • --- Unicode NFD (Normalization Form Decomposition) where the base character and combining accents are canonically decomposed. Usually this is into separate codepoints.

  • --- Unicode NFKC (Normalization Form Canonical Composition) NFC plus some lookalike characters, such as half-width and double-width variants of Kana characters, are mapped together

  • --- Unicode NFKD (Normalization Form Canonical Decomposition) NFD plus some lookalike characters, such as half-width and double-width variants of Kana characters, are mapped together

  • converting all letters to lower or upper case

  • removing punctuation

  • removing letters with accent marks and other diacritics

  • expanding abbreviations


While this may be done manually, and usually is in the case of ad hoc and personal documents, many Programming Language s support mechanisms which enable text normalization.


EXTERNAL LINKS