| Basic Multilingual Plane |
Shopping Unicode |
Website Links For Mapping |
Information AboutBasic Multilingual Plane |
| CATEGORIES ABOUT MAPPING OF UNICODE CHARACTERS | |
| unicode | |
|
Unicode ’s Universal Character Set potentially supports over 1 million (1,114,112 = 220 + 216 ''or'' 17 × 216, Hexadecimal 110000) code points. As of Unicode 5.0.0, 102,012 (9.2%) of these code points are assigned, with another 137,468 (12.3%) reserved for Private Use , 2,048 for Surrogates , and 66 designated Noncharacters , leaving 872,582 (78.3%) unassigned. The number of assigned code points is made up as follows:
(See the Summary Table for a more detailed breakdown). Unicode characters can be categorized in many ways. Every character is assigned a ''script'' (though many are assigned the common or inherited scripts where they inherit the script from the adjacent character). In Unicode a script is a coherent writing system that includes letters but also may include script specific punctuation, diacritic and other marks and numerals and symbols. A single script supports one or more languages. Characters are assigned in ''blocks'' of characters. These blocks are usually groups of code points in some multiple of eight: many, for example, are grouped in blocks of 128 or 256 code points. Every character is also assigned a ''general category'' and subcategory. The general categories are: letter, mark, number, punctuation, symbol, or control (in other words a formatting or non-graphical character). The blocks of characters are assigned according to various ''planes''. Most characters are currently assigned to the first plane: the ''Basic Multilingual Plane''. This is to help ease the transition for legacy software since the Basic Multilingual Plane is addressable with just two Octet bytes. The characters outside the first plane usually have very specialized or seldom use. The first 256 code points correspond with those of ISO 8859-1 , the most popular 8-bit Character Encoding in the Western World . As a result, the first 128 characters are also identical to ASCII . Though Unicode refers to these as a Latin script block, these two blocks contain many characters that are commonly useful outside of the Latin script. PLANES The Unicode characters can be categorized in many different ways, Unicode code points can be logically divided into 17 ''planes'', each with 65,536 (= 216) code points, although currently only a few planes are used:
Currently, about ten percent of the potential space is used. Furthermore, ranges of characters have been tentatively blocked out for every current and ancient writing system (script) the Unicode consortium has been able to identify: (see {Link without Title} ). While Unicode may eventually need to use another of the spare 11 planes for ideographic characters, other planes remain, if previously unknown scripts with tens of thousands of characters are discovered. This 20 bit limit is therefore unlikely to be reached in the near future. Basic Multilingual Plane The first plane (plane 0), the ''Basic Multilingual Plane'' (BMP), is where most characters have been assigned so far. The BMP contains characters for almost all modern languages, and a large number of special characters. Most of the allocated code points in the BMP are used to encode Chinese, Japanese, and Korean ( CJK ) characters. The graphic on the right is a visual roadmap to the Basic Multilingual Plane. The colours in use are:
As Of Unicode 5.0 , The BMP includes the following scripts: Future additions Several scripts are expected to be included in the BMP in the next revision of Unicode. These scripts, and their proposed code point ranges, are the following:
Several other scripts are proposed for inclusion in the BMP, including:
Supplementary Multilingual Plane Plane 1, the ''Supplementary Multilingual Plane'' (SMP), is mostly used for historic scripts such as Linear B , but is also used for musical and mathematical symbols. Supplementary Ideographic Plane Plane 2, the ''Supplementary Ideographic Plane'' (SIP), is used for about 40,000 Unified Han Ideograph s that have previously been seldom used in daily written communications. Unused planes Unicode has not yet assigned any characters to Planes 3 through 13. The current study of written language have not identified any need for these planes yet. However, symbol characters that arise outside the script writing systems could have potentially limitless possibilities for characters. The UCS and Unicode take requests for symbols on a case by case basis. Supplementary Special-purpose Plane Plane 14 (''E'' in Hexadecimal ), the ''Supplementary Special-purpose Plane'' (SSP), currently contains non-graphical characters in two blocks of 128 and 240 characters. The first block is for language tag characters for use when language cannot be indicated through other protocols (such as the Besides the Unihan ideographs, Han unification also provides Han unified punctuation, symbols, numerals, ideograph stroke characters and ideographic description characters. Phonetic characters See Also: Unicode Phonetic Symbols Unicode includes letters and marks from the International Phonetic Alphabet (IPA) and those supporting other phonetic writing systems too. Essentially these characters are used as Graphemes for Phonemes . In terms of script or writing system, these phonetic alphabets are basically one writing system. What distinguishes the various phonetic alphabets are their glyphs. However, as with numerals, the UCS often focus more on the presentational forms or glyphs given to these phonemes by the various phonetic alphabets. This is in contrast to the alternate names of these characters provided by Unicode NamesList property which typically reflects the common phoneme semantics shared by those various writing systems regardless of the glyphs used. So these differences manifest in the alternate names given to these characters: the canonical UCS name and the NamesList property names. Similarly, Unicode assignees the value of “Latin” to the script property of many of these characters. However, the primary purpose for these characters inclusion in the character set is to support the various phonetic writing systems. These phonetic writing system, in many ways, constitute a single unified writing system on its own: despite borrowing glyphs from other Latin, Greek and Cyrillic scripts. Numerals See Also: Unicode numerals Numerals (often called numbers in Unicode) are characters that denote a number. The same Arabic-Indic Numerals are used widely in various writing systems throughout the world and all share the same semantics for denoting numbers, However, the glyphs representing these numerals differ widely from one writing system to another. To support these glyph differences, Unicode includes duplicate encodings of these numerals within many of the script blocks. These digits are repeated in 23 separate blocks: twice in Arabic. Six additional blocks contain the digits again as rich text or legacy software compatibility characters. Unicode also includes several less common numerals: Roman numerals, counting rod numerals, Cuneiform numerals and ancient Greek numerals. Numerals invariably involve composition of glyphs as a limited number of characters are composed to make other numerals. For example the sequence 9 - 9 - 0 in Arabic-Indic numerals composes the numeral for nine hundred and ninety (990). In Roman numerals, the same number is expressed by the composed numeral Ⅹↀ or ⅩⅯ. Each of these is a distinct numeral for representing the same abstract number. The semantics of the numerals differ in particular in their composition. The Arabic-Indic decimal digits are positional-value compositions, while the Roman numerals are sign-value and they are additive and subtractive depending on their composition. Punctuation and diacritics Unicode includes several blocks for unified diacritics and other combining marks and also blocks for unified punctuation. However, when a mark or punctuation character is intended primarily for use within a particular script, the character is assigned to that particular script’s blocks. Therefore authors will find these types of characters throughout the Unicode character database. Unicode categorizes them as:
Symbols Unicode has dozens of blocks dedicated to symbols that are useful regardless of one’s writing system. Other script-specific symbols are often included within a particular script’s blocks. Symbols are categorized as: Symbols:
Music notation Unicode now includes characters for music notation. The remarkable thing about this is it makes musical notation as transferable largely as plain text. It's likely few basic text systems handle these music notation characters correctly, however an interoperable interchange of musical notation is possible using Unicode alone. COMPATIBILITY CHARACTERS See Also: Unicode compatibility characters In discussing Unicode and the UCS, many often refer to compatibility characters. Compatibility characters are graphical characters that are discouraged by the Unicode Consortium. As the Unicode consortium says :
However, the definition is more complicated that the glossary reveals. One of the properties given to characters by the Unicode consortium is the characters decomposition or '''compatibility decomposition'''. Most characters have no value for this property, but over 5 thousand characters do have a compatibility decomposition mapping that compatibility character to one or more other characters. By setting a characters decomposition property, Unicode establishes that character as a compatibility character. The reasons for these compatibility designations are varied and are discussed in further detail below. The term decomposition can sometimes confuse because a characters decomposition can, in some cases, be a singleton. In these cases the decomposition of one character is simply another equivalent or approximately equivalent character. Canonical and Non-canonical The compatibility decomposition property for the 5,402 Unicode compatibility characters includes a keyword that divides the compatibility characters into 17 logical groups. Those without a keyword are termed canonical equivalent or canonical decomposable characters. These characters have the closest relationship. Other keywords include: <initial>, <medial>, <final>, <isolated>, <wide>, <narrow, <small>, <square>, <vertical>, <circle>, <noBreak>, <fraction>, <subscript>, <superscript>, and <compat>. These keywords provide some indication of the relation between the compatibility character and its compatibility decomposition character sequence. However, the compatibility characters — whether canonical or not — fall in three basic categories: 1) characters corresponding to multiple alternate glyph forms and precomposed diacritics to support software and font implementations that do not include complete Unicode text layout capabilities; 2) characters included from other character sets or otherwise added to the UCS that constitute rich text rather than the plain text goals of Unicode; 3) some other characters that are semantically distinct, but visually similar. Because these semantically distinct characters may be displayed with glyphs similar to the glyphs of other characters, text processing software should try to address possible confusion for the sake of end users. When comparing and collating (sorting) text strings, different forms and rich text variants of characters should not alter the text processing results. For example, software users may be confused when performing a find on a page for a capital Latin letter ‘I’ and their software application fails to find the visually similar Roman numeral ‘Ⅰ’. Compatibility Blocks Several blocks of Unicode characters include either entirely or almost entirely all compatibility characters. These compatibility blocks contain none of the semantically distinct compatibility characters and so they fall unambiguously into the set of discouraged characters. Unicode recommends authors use the plain text compatibility decomposition equivalents instead and complement those characters with rich text markup. This approach is much more flexible and open-ended than using the finite set of circled or enclosed alphanumerics to give just one example. Unfortunately, there are a small number of characters even within the compatibility blocks that themselves are not compatibility characters and therefore may confuse authors. The “Enclosed CJK Letters and Months” block contains a single non-compatibility character: the ‘Korean Standard Symbol’ (㉿ U+327F). This symbol and 12 other characters have been included in these blocks for no known reasons. The “CJK Compatibility Ideographs” block contains these non-compatibility unified Han ideographs: # (U+FA0E): 﨎 # (U+FA0F): 﨏 # (U+FA11): 﨑 # (U+FA13): 﨓 # (U+FA14): 﨔 # (U+FA1F): 﨟 # (U+FA21): 﨡 # (U+FA23): 﨣 # (U+FA24): 﨤 # (U+FA27): 﨧 # (U+FA28): 﨨 # (U+FA29): 﨩 These thirteen characters are neither compatibility characters nor are their use discouraged in any way. Several other characters in these blocks have no compatibility mapping but are clearly intended for legacy support: Alphabetic Presentation Forms (1) # Hebrew Point Judeo-Spanish Varika (U+FB1E): ﬞ. This is a glyph variant of Hebrew Point Rafe (U+05BF): ֿ , though Unicode provides no compatibility mapping. Arabic Presentation Forms (4) # “Ornate Left Parenthesis” (U+FD3E): ﴾. A glyph variant for U+0029 ‘)’ # “Ornate Right Parenthesis” (U+FD3F): ﴿. A glyph variant for U+0028 ‘ (’ # “Ligature Bismillah Ar-Rahman Ar-Raheem” (U+FDFD): ﷽. Bismillah Ar-Rahman Ar-Raheem is a ligature for Teh Marbuta (U+0629), Lam (U+0644), Meem (U+0645), Seen (U+0633), Beh (U+0628), (بسملة) # “Arabic Tail Fragment” (U+FE73): ﹳ for supporting text systems without contextual glyph handling CJK Compatibility Forms (2 that are both related to CJK Unified Ideograph: U+4E36 丶) # Sesame Dot (U+FE45): ﹅ # White Sesame Dot (U+FE46): ﹆ Enclosed Alphanumerics (21 rich text variants) # 10 Negative Circled Numbers (0 and 11 through 20) (U+24FF and U+24EB through U+24F4): ⓫ – ⓴ # 11 Double Circled Numbers (0 through 10) (U+24F5 through U+24FE): ⓵ – ⓾ Compatibility characters and normalization See Also: Unicode normalization Normalization is the process by which Unicode conforming software first performs compatibility decomposition before making comparisons or collating text strings. This is similar to other operations needed when, for example, a user performs a case or diacritic insensitive search within some text. In such cases software must equate or ignore characters it would not otherwise equate or ignore. Typically normalization is performed without altering the underlying stored text data (lossless). However, some software may potentially make permanent changes to text that eliminates the canonical or even non-canonical compatibility characters differences from text storage (lossy). NON-GRAPHICAL CHARACTERS See Also: Unicode control characters Many characters are used to control the interpretation or display of text, but these characters themselves have no visual or spatial representation. For example, the null character (U+0000) is used in C-programming application environments to indicate the end of a string of characters. In this way, these programs only require a single starting memory address for a string. The string ends once the program reads the null character. Legacy control characters The legacy control characters come from ASCII and ISO 8859-1 character sets and are sometimes referred to as C0 and C1 respectively. Many of these characters play no explicit role in Unicode text handling, though they are still used in mainframe computing environments. Others, like the null character and many whitespace characters are still used commonly in text processing. Other common control characters are tabulation or tab (U+0009), linefeed (U+000A), carriage return (U+000D) and newline (U+0085). These are included among whitespace characters because, though they have no visual glyph, they do insert vertical or horizontal spacing between the display of characters. Unicode introduced separators In an attempt to simplify the several new line characters used in legacy text, UCS introduces its own new line characters to separate either lines or paragraphs: the line separator (U+2028) and paragraph separator (U+2029) characters. Language tags Unicode includes 128 characters as language tags. The characters essentially mirror the 128 ASCII characters except, when used they identify the subsequent text as belonging to a particular language according to BCP 47 . For example, for indicating subsequent text as the variant of English as written in the United States, the initiating ‘Language Tag character’ (U+E0001) followed by the sequence ‘Tag Small Letter e’ (U+U+E0065), ‘Tag Small Letter n’ (U+E006E), “Tag Hyphen-minus’ (U+E002D), ‘Tag Small Letter u’ (U+E0075) and ‘Tag Small Letter s’ (U+E0073). These language tag characters would not be displayed themselves. However, they would provide information for text processing or even for the display of other characters. For example the display of Unihan ideographs might substitute different glyphs if the language tags indicated Korean than if the tags indicated Japanese. Another example, might influence the display of decimal digits 0 through 9 differently depending on the language they appeared in. Interlinear annotation Three formatting characters provide support for interlinear annotation (U+FFF9, U+FFFA, U+FFFB). This may be used for providing notes that would typically be displayed between the lines of other text. Unicode considers such annotation to be rich text and recommends using other protocols for such annotation. The W3C Ruby Markup recommendation is an example of an alternate protocol supporting more advanced interlinear annotation. Bidirectional text control Unicode supports standard bidirectional text without any special characters. In other words Unicode conforming software should display right-to-left characters such as Hebrew letters as right-to-left simply from the properties of those characters. Similarly, the Unicode handles the mixture of left-to-right-text alongside right-to-left text without any special characters. For example, one can quote Arabic (“بسملة”) right alongside English and the Arabic letters will flow from right-to-left and the Latin letters left-to-right.. However, support for bidirectional text becomes more complicated when text flowing in opposite directions is embedded hierarchically. So that for example if one quotes an Arabic phrase that in turn quotes an English phrase. Other situations may complicate this when for example, an author wants the left-to-right characters overridden so that they flow from right-to-left. While these situations are fairly rare, Unicode provides seven characters ((U+200E, U+200F, U+202A, U+202B, U+202C, U+202D, U+202E) to help control these embedded bidirectional text levels up to 61 levels deep. Variation Selectors Many characters map to alternate glyphs depending on the context. For example Arabic and Latin cursive characters substitute different glyphs to connect glyphs together depending on whether the character is the initial character in a word, the final character, a medial character or an isolated character. These types of glyph substitution are easily handled by the context of the character with no other authoring input involved. Authors may also use special-purpose characters such as joiners and non-joiners to force an alternate form of glyph where it would not otherwise appear. Ligatures are similar instances where glyphs may be substituted simply by turning ligatures on or off as a rich text attribute. However, for other glyph substitution, the authors intent may need to be encoded with the text and cannot be determined contextually. This is the case with character/glyphs referred to as Gaiji where different glyphs are used for the same character either historically or for ideographs for family names. This is one of the gray areas in distinguishing between a glyph and a character. If a family name differs slightly from the ideograph character it derives from, then is that a simple glyph variant or a character variant? As of Unicode 3.2 and 4.0, the character set now includes 256 variation selectors so that these combining mark characters can select from 256 possible character/glyph variations for the preceding character. Unicode does not as yet provide any registry for these variations, so the issue of interoperable variation registration is left to other parties. OTHER SPECIAL-PURPOSE CHARACTERS Several characters fall between the non-graphical control and formatting characters and full-fledged graphical characters. Joiners and Non-joiners Word Joiner (U+2060), Zero-width Joiner (U+200D), Zero-width Non-joiner (U+200C), Zero-width space (U+200B), Combining Grapheme Joiner (U+034F). Invisible Separator Primarily for mathematics, the Invisible Separator (U+2063) provides a separator between characters where punctuation or space may be omitted such as in a two-dimensional index like ij. Invisible Times and Function Application Invisible Times (U+2062) and Function Application (U+2061) are useful in mathematics text where the multiplication of terms or the application of a function is implied without any glyph indicating the operation. Spaces The space character (U+0020) typically input by the space bar on a keyboard serves semantically as a word separator in many languages. For legacy reasons, the UCS also includes spaces of varying sizes that are compatibility equivalents for the space character. These spaces include: # Space (U+0020) # En Quad (U+2000) # Em Quad (U+2001) # En Space (U+2002) # Em Space (U+2003) # Three-Per-Em Space (U+2004) # Four-Per-Em Space (U+2005) # Six-Per-Em Space (U+2006) # Figure Space (U+2007) # Punctuation Space (U+2008) # Thin Space (U+2009) # Hair Space (U+200A) # Mathematical Space (U+205F) Aside from the original ASCII space, the other spaces are all compatibility characters. In this context this means that they effectively add no semantic content to the text, but instead provide styling control. Within Unicode, this non-semantic styling control is often referred to as rich text and is outside the thrust of Unicode’s goals. Rather than using different spaces in different contexts, this styling could instead be handled through intelligent text layout software. Line-break control characters Several characters are designed to help control line-breaks either by discouraging them (no-break characters) or suggesting line breaks such as the soft or shy hyphen (U+00AD). Such characters, though designed for styling, are probably indispensable for the intricate types of line-breaking they make possible. # Shy Hyphen (U+00AD) # Non-breaking Hyphen (U+2011) # No-break Space (U+00A0) # Narrow No-break Space (U+202F) # Zero-width space (U+200B) WHITESPACE CHARACTERS Whitespace characters are not a separate group of characters, but instead Unicode provides a list of characters it deems whitespace characters for interoperability support. Software Implementations and other standards may use the term to denote a slightly different set of characters. Whitespace characters are characters typically designated for programming environments. Often they have no syntactic meaning in such programming environments and are ignored by the machine interpreters. Unicode designates the legacy control characters U+0009 through U+000D and U+0085 as white space characters as well as the Unicode introduced line separator and paragraph separator. Also the core space character (U+0020) is designated as a whitespace character, but none of the other styling spaces. PRIVATE USE CHARACTERS The UCS includes over 100,000 code points for private use. This means these code points can be assigned characters with specific properties by individuals, organizations and software vendors outside the ISO and Unicode Consortium. A ''Private Use Area'' (PUA) is one of several ranges which are reserved for private use. For this range, the Unicode standard does not specify any characters. The Basic Multilingual Plane includes a PUA in the range from U+E000 to U+F8FF (57344–63743). ''Plane Fifteen'' (U+F0000 to U+FFFFD), and ''Plane Sixteen'' (U+100000 to U+10FFFD) are completely reserved for private use as well. The use of the PUA was a concept inherited from certain Asian encoding systems. These systems had private use areas to encode Japanese Gaiji (rare personal name characters) in application-specific ways. Similarly the ConScript Unicode Registry aims to coordinate the mapping of scripts not yet encoded in or rejected by Unicode in the PUAs. The Medieval Unicode Font Initiative uses the PUA to encode various ligatures, Precomposed Character s, and symbols found in medieval texts. One example of usage of the Private Use Area is Apple Computer 's usage of U+F8FF for The Apple Logo . SPECIAL CODE POINTS At the simplest level, each character in the UCS represents a code point and a particular semantic function: For graphical characters, the semantic function is often implied by its name, and the script or block it is included within. A graphical character may also have a recommended glyph that helps define the meaning of the character. Ideographs for languages in China, Japan, Korea and Vietnam include many other rich properties that participate in defining the semantic role for a character. However, the UCS and Unicode designate other code points for other purposes. Those code points may have no or few character properties associated with them. Surrogates The 2,048 surrogates are not characters, but instead serve as pairs to address code points outside the Basic Multilingual Plane. These surrogate pairs therefore provide support for text encoding formats use fewer than the 21 bits necessary to encode every UCS code point with a single word-byte. For example, UTF-16 uses the surrogate pairs to address code points outside the BMP using two 16-bit bytes. Noncharacters Unicode reserves several code points as noncharacters. These code points are guaranteed to never have a character assigned to them. Software implementations are therefore free to use these code points for internal use. However, these noncharacters should never be included in text interchange between implementations. One inherently useful example of a noncharacter is the code point U+FFFE. This code point has the reverse binary sequence of the byte order mark (U+FEFF). If a stream of text contains this noncharacter, this is a good indication the text has been interpreted with the incorrect Endianness . SUMMARY TABLE OF UCS CHARACTERS ASSIGNMENTS Description of Table Columns and Rows The following table lists all of the blocks currently assigned characters as of April 2007 (Unicode 5.0). Blocks are grouped according to their function.
Working backwards:
Though the table name unallocated blocks, those blocks could potentially be allocated for any purpose. For example unused code point blocks within the generral area of the BMP dedicated to Unihan ideographs could instead be allocated to modern scripts. The names merely indicate the general region of the plane in which they are situated. Totals
Modern Scripts
Ancient Scripts
Phonetics
Unified Diacritics
Unified Punctuation
Unified Symbols
Music Notation
Unihan CJKV Blocks
Legacy Compatibility Blocks
Other Compatibility Blocks
Special-purpose characters
Surrogates
Private use characters
Unused Planes
SEE ALSO Tables EXTERNAL LINKS
NOTES REFERENCES |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Review our Terms of Service. © Copyright 2003-2009. Information About, Where you can have ALL your shopping needs met without unwanted poppups and clutter. If you have any comments, please write to info@informationdelight.info. |
|
|
