Glossary of Terms
This page gives an overview of the terms and concepts presented on this site. Many definitions are taken directly from the respective specification at the Unicode website and may remain a bit technical. If you see this sign on a page, it means you can access the definition from here by hovering over it.
- Age
- This property shows when various code points were designated/assigned in successive versions of the Unicode Standard.
The Age property is normative in the sense that it is completely specified based on when a character is encoded in the standard. However, DerivedAge.txt is provided for information. The value of the Age property for a code point can be derived by analysis of successive versions of the UCD, and Age is not used normatively in the specification of any Unicode algorithm.
(Source: UAX44) - Alphabetic
- Characters with the Alphabetic property. For more information, see Chapter 4, Character Properties in [Unicode].
Generated from: Lu + Ll + Lt + Lm + Lo + Nl + Other Alphabetic
(Source: UAX44) - ASCII Hex Digit
- ASCII characters commonly used for the representation of hexadecimal numbers. (Source: UAX44)
- Bidi Class
- These are the categories required by the Unicode Bidirectional Algorithm. For the property values, see Bidirectional Class Values. For more information, see Unicode Standard Annex #9, "The Unicode Bidirectional Algorithm" [UAX9].
The default property values depend on the code point, and are explained in DerivedBidiClass.txt
(Source: UAX44) - Bidi Control
- Format control characters which have specific functions in the Unicode Bidirectional Algorithm [UAX9]. (Source: UAX44)
- Bidi Mirrored
- If the character is a "mirrored" character in bidirectional text, this field has the value "Y"; otherwise "N". See Section 4.7, Bidi Mirrored—Normative of [Unicode]. Do not confuse this with the Bidi Mirroring Glyph property. (Source: UAX44)
- Bidi Mirroring Glyph
- Informative mapping for substituting characters in an implementation of bidirectional mirroring. This maps a subset of characters with the Bidi_Mirrored property to other characters that normally are displayed with the corresponding mirrored glyph. When a character with the Bidi_Mirrored property has the default value for Bidi_Mirroring_Glyph, that means that no other character exists whose glyph is appropriate for character-based glyph mirroring. Implementations must then use other mechanisms to implement mirroring of those characters for the Unicode Bidirectional Algorithm. See Unicode Standard Annex #9:" The Unicode Bidirectional Algorithm" [UAX9]. Do not confuse this property with the Bidi Mirrored property itself. (Source: UAX44)
- Block
- List of block names, which are arbitrary names for ranges of code points. See the code charts in [Unicode]. (Source: UAX44)
- Canonical Combining Class
- The classes used for the Canonical Ordering Algorithm in the Unicode Standard. This property could be considered either an enumerated property or a numeric property: the principal use of the property is in terms of the numeric values. For the property value names associated with different numeric values, see DerivedCombiningClass.txt and Canonical Combining Class Values. (Source: UAX44)
- Case Folding
- Simple Case Folding
- Mapping from characters to their case-folded forms. This is an informative file containing normative derived properties.
Derived from UnicodeData and SpecialCasing.
Note: The case foldings are omitted in the data file if they are the same as the code point itself.
(Source: UAX44) - Case Ignorable
- Characters which are ignored for casing purposes. For more information, see D136 in Section 3.13, Default Case Algorithms in [Unicode].
Generated from: Mn + Me + Cf + Lm + Sk + Word Break=MidLetter + Word Break=MidNumLet
(Source: UAX44) - Cased
- Characters which are considered to be either uppercase, lowercase or titlecase characters. This property is not identical to the Changes_When_Casemapped property. For more information, see D135 in Section 3.13, Default Case Algorithms in [Unicode].
Generated from: Lowercase + Uppercase + Lt
(Source: UAX44) - Changes When Casefolded
- Characters whose normalized forms are not stable under case folding. For more information, see D142 in Section 3.13, Default Case Algorithms in [Unicode].
Generated from: toCasefold(toNFD(X)) != toNFD(X)
(Source: UAX44) - Changes When Casemapped
- Characters which may change when they undergo case mapping. For more information, see D143 in Section 3.13, Default Case Algorithms in [Unicode].
Generated from: Changes_When_Lowercased(X) or Changes_When_Uppercased(X) or Changes_When_Titlecased(X)
(Source: UAX44) - Changes When Lowercased
- Characters whose normalized forms are not stable under a toLowercase mapping. For more information, see D139 in Section 3.13, Default Case Algorithms in [Unicode].
Generated from: toLowercase(toNFD(X)) != toNFD(X)
(Source: UAX44) - Changes When NFKC Casefolded
- Characters which are not identical to their NFKC_Casefold mapping.
Generated from: (cp != NFKC_CaseFold(cp))
(Source: UAX44) - Changes When Titlecased
- Characters whose normalized forms are not stable under a toTitlecase mapping. For more information, see D141 in Section 3.13, Default Case Algorithms in [Unicode].
Generated from: toTitlecase(toNFD(X)) != toNFD(X)
(Source: UAX44) - Changes When Uppercased
- Characters whose normalized forms are not stable under a toUppercase mapping. For more information, see D140 in Section 3.13, Default Case Algorithms in [Unicode].
Generated from: toUppercase(toNFD(X)) != toNFD(X)
(Source: UAX44) - Code Point
- A number in the Unicode standard denoting one single character. A code point is different from a Glyph.
- Composition Exclusion
- A property used in normalization. See Unicode Standard Annex #15: "Unicode Normalization Forms" [UAX15]. Unlike other files, CompositionExclusions.txt simply lists the relevant code points. (Source: UAX44)
- Dash
- Punctuation characters explicitly called out as dashes in the Unicode Standard, plus their compatibility equivalents. Most of these have the General_Category value Pd, but some have the General_Category value Sm because of their use in mathematics. (Source: UAX44)
- Decomposition Type
- Decomposition Mapping
- This field contains both values, with the type in angle brackets. The decomposition mappings exactly match the decomposition mappings published with the character names in the Unicode Standard. For more information, see Character Decomposition Mappings. (Source: UAX44)
- Default Ignorable Code Point
- For programmatic determination of default ignorable code points. New characters that should be ignored in rendering (unless explicitly supported) will be assigned in these ranges, permitting programs to correctly handle the default rendering of such characters when not otherwise supported. For more information, see the FAQ Display of Unsupported Characters, and Section 5.21, Default Ignorable Code Points in [Unicode].
Generated from
(Source: UAX44)
Other Default Ignorable Code Point
+ Cf (format characters)
+ Variation_Selector
- White_Space
- FFF9..FFFB (annotation characters)
- 0600..0604, 06DD, 070F, 110BD (exceptional Cf characters that should be visible) - Deprecated
- For a machine-readable list of deprecated characters. No characters will ever be removed from the standard, but the usage of deprecated characters is strongly discouraged. (Source: UAX44)
- Diacritic
- Characters that linguistically modify the meaning of another character to which they apply. Some diacritics are not combining characters, and some combining characters are not diacritics. (Source: UAX44)
- East Asian Width
- Properties for determining the choice of wide versus narrow glyphs in East Asian contexts. Property values are described in Unicode Standard Annex #11, "East Asian Width" [UAX11]. (Source: UAX44)
- Expands On NFC (deprecated)
- Expands On NFD (deprecated)
- Expands On NFKC (deprecated)
- Expands On NFKD (deprecated)
- Characters that expand to more than one character in the specified normalization form. (Source: UAX44)
- Extender
- Characters whose principal function is to extend the value or shape of a preceding alphabetic character. Typical of these are length and iteration marks. (Source: UAX44)
- FC NFKC Closure (deprecated)
- Characters that require extra mappings for closure under Case Folding plus Normalization Form KC.
The mapping is listed in Field 2.
(Source: UAX44) - Full Composition Exclusion
- Characters that are excluded from composition: those listed explicitly in CompositionExclusions.txt, plus the derivable sets of Singleton Decompositions and Non-Starter Decompositions, as documented in that data file. (Source: UAX44)
- General Category
- This is a useful breakdown into various character types which can be used as a default categorization in implementations. For the property values, see General Category Values. (Source: UAX44)
- Glyph
- The representation of a codepoint under certain circumstances. For example, the letter “A” looks quite differently in latin and blackletter fonts. Both are different glyphs for the underlying character.
- Grapheme Base
- Property used together with the definition of Standard Korean Syllable Block to define "Grapheme base". See D58 in Chapter 3, Conformance in [Unicode].
Generated from: [0..10FFFF] - Cc - Cf - Cs - Co - Cn - Zl - Zp - Grapheme Extend
Note: Grapheme_Base is a property of individual characters. That usage contrasts with "grapheme base", which is an attribute of Unicode strings; a grapheme base may consist of a Korean syllable which is itself represented by a sequence of conjoining jamos.
(Source: UAX44) - Grapheme Cluster Break
- See Unicode Standard Annex #29, "Unicode Text Segmentation" [UAX29] (Source: UAX44)
- Grapheme Extend
- Property used to define "Grapheme extender". See D59 in Chapter 3, Conformance in [Unicode].
Generated from: Me + Mn + Other Grapheme Extend
Note: The set of characters for which Grapheme_Extend=Yes is equivalent to the set of characters for which Grapheme_Cluster_Break=Extend.
(Source: UAX44) - Grapheme Link (deprecated)
- Formerly proposed for programmatic determination of grapheme cluster boundaries.
Generated from: Canonical_Combining_Class=Virama
(Source: UAX44) - Hangul Syllable Type
- The values L, V, T, LV, and LVT used in Chapter 3, Conformance in [Unicode]. (Source: UAX44)
- Hex Digit
- Characters commonly used for the representation of hexadecimal numbers, plus their compatibility equivalents. (Source: UAX44)
- Hyphen (deprecated, stabilized)
- Dashes which are used to mark connections between pieces of words, plus the Katakana middle dot. The Katakana middle dot functions like a hyphen, but is shaped like a dot rather than a dash. (Source: UAX44)
- ID Start
- ID Continue
- XID Start
- XID Continue
- Used to determine programming identifiers, as described in Unicode Standard Annex #31, "Unicode Identifier and Pattern Syntax" [UAX31]. (Source: UAX44)
- Ideographic
- Characters considered to be CJKV (Chinese, Japanese, Korean, and Vietnamese) ideographs. This property roughly defines the class of "Chinese characters" and does not include characters of other logographic scripts such as Cuneiform or Egyptian Hieroglyphs. (Source: UAX44)
- IDS Binary Operator
- Used in Ideographic Description Sequences. (Source: UAX44)
- IDS Trinary Operator
- Used in Ideographic Description Sequences. (Source: UAX44)
- Indic Positional Category
- A property defining the placement categories for dependent vowels, viramas, combining marks, and other characters used in Indic scripts. (Source: UAX44)
- Indic Syllabic Category
- A property defining the structural categories of syllabic components in Indic scripts. (Source: UAX44)
- ISO Comment (deprecated, stabilized)
- ISO 10646 comment field. It was used for notes that appeared in parentheses in the 10646 names list, or contained an asterisk to mark an Annex P note.
As of Unicode 5.2.0, this field no longer contains any non-null values.
(Source: UAX44) - Join Control
- Format control characters which have specific functions for control of cursive joining and ligation. (Source: UAX44)
- Joining Type
- Joining Group
- Basic Arabic and Syriac character shaping properties, such as initial, medial and final shapes. See Section 8.2, Arabic in [Unicode]. (Source: UAX44)
- Jamo Short Name
- The Hangul Syllable names are derived from the Jamo Short Names, as described in Chapter 3, Conformance in [Unicode]. (Source: UAX44)
- kAccountingNumeric
- The value of the character when used in the writing of accounting numerals.
Accounting numerals are used in East Asia to prevent fraud. Because a number like ten (十) is easily turned into one thousand (千) with a stroke of a brush, monetary documents will often use an accounting form of the numeral ten (such as 拾) in their place.
The three numeric-value fields should have no overlap; that is, characters with a kAccountingNumeric value should not have a kPrimaryNumeric or kOtherNumeric value as well.
(Source: UAX38) - kBigFive
- The Big Five mapping for this character in hex; note that this does not cover any of the Big Five extensions in common use, including the ETEN extensions. (Source: UAX38)
- kCangjie
- The cangjie input code for the character. This incorporates data from the file cangjie-table.b5 by Christian Wittern. (Source: UAX38)
- kCantonese
- The Cantonese pronunciation(s) for this character using the jyutping romanization.
A full description of jyutping can be found at <https://www.lshk.org/cantonese.php>. The main differences between jyutping and the Yale romanization previously used are:
1) Jyutping always uses tone numbers and does not distinguish the high falling and high level tones.
2) Jyutping always writes a long a as “aa”.
3) Jyutping uses “oe” and “eo” for the Yale “eu” vowel.
4) Jyutping uses “c” instead of “ch”, “z” instead of “j”, and “j” instead of “y” as initials.
5) A non-null initial is always explicitly written (thus “jyut” in jyutping instead of Yale’s “yut”).
Cantonese pronunciations are sorted alphabetically, not in order of frequency. (Source: UAX38) - kCCCII
- The CCCII mapping for this character in hex. (Source: UAX38)
- kCheungBauer
- Data regarding the character in Cheung Kwan-hin and Robert S. Bauer, _The Representation of Cantonese with Chinese Characters_, Journal of Chinese Linguistics, Monograph Series Number 18, 2002. Each data value consists of three pieces, separated by semicolons: (1) the character’s radical-stroke index as a three-digit radical, slash, two-digit stroke count; (2) the character’s cangjie input code (if any); and (3) a comma-separated list of Cantonese readings using the jyutping romanization in alphabetical order. (Source: UAX38)
- kCheungBauerIndex
- The position of the character in Cheung Kwan-hin and Robert S. Bauer, _The Representation of Cantonese with Chinese Characters_, Journal of Chinese Linguistics, Monograph Series Number 18, 2002. The format is a three-digit page number followed by a two-digit position number, separated by a period. (Source: UAX38)
- kCihaiT
- The position of this character in the Cihai (辭海) dictionary, single volume edition, published in Hong Kong by the Zhonghua Bookstore, 1983 (reprint of the 1947 edition), ISBN 962-231-005-2.
The position is indicated by a decimal number. The digits to the left of the decimal are the page number. The first digit after the decimal is the row on the page, and the remaining two digits after the decimal are the position on the row. (Source: UAX38) - kCNS1986
- The CNS 11643-1986 mapping for this character in hex. (Source: UAX38)
- kCNS1992
- The CNS 11643-1992 mapping for this character in hex. (Source: UAX38)
- kCompatibilityVariant
- The compatibility decomposition for this ideograph, derived from the UnicodeData.txt file. (Source: UAX38)
- kCowles
- The index or indices of this character in Roy T. Cowles, A Pocket Dictionary of Cantonese, Hong Kong: University Press, 1999.
Approximately 100 characters from Cowles which are not currently encoded are being submitted to the IRG by Unicode for inclusion in future versions of the standard. (Source: UAX38) - kDaeJaweon
- The position of this character in the Dae Jaweon (Korean) dictionary used in the four-dictionary sorting algorithm. The position is in the form “page.position” with the final digit in the position being “0” for characters actually in the dictionary and “1” for characters not found in the dictionary and assigned a “virtual” position in the dictionary.
Thus, “1187.060” indicates the sixth character on page 1187. A character not in this dictionary but assigned a position between the 6th and 7th characters on page 1187 for sorting purposes would have the code “1187.061”
The edition used is the first edition, published in Seoul by Samseong Publishing Co., Ltd., 1988. (Source: UAX38) - kDefinition
- An English definition for this character. Definitions are for modern written Chinese and are usually (but not always) the same as the definition in other Chinese dialects or non-Chinese languages. In some cases, synonyms are indicated. Fuller variant information can be found using the various variant fields.
Definitions specific to non-Chinese languages or Chinese dialects other than modern Mandarin are marked, e.g., (Cant.) or (J).
Major definitions are separated by semicolons, and minor definitions by commas. Any valid Unicode character (except for tab, double-quote, and any line break character) may be used within the definition field. (Source: UAX38) - kEACC
- The EACC mapping for this character in hex. (Source: UAX38)
- kFenn
- Data on the character from The Five Thousand Dictionary (aka Fenn’s Chinese-English Pocket Dictionary) by Courtenay H. Fenn, Cambridge, Mass.: Harvard University Press, 1979.
The data here consists of a decimal number followed by a letter A through K, the letter P, or an asterisk. The decimal number gives the Soothill number for the character’s phonetic, and the letter is a rough frequency indication, with A indicating the 500 most common ideographs, B the next five hundred, and so on.
P is used by Fenn to indicate a rare character included in the dictionary only because it is the phonetic element in other characters.
An asterisk is used instead of a letter in the final position to indicate a character which belongs to one of Soothill’s phonetic groups but is not found in Fenn’s dictionary.
Characters which have a frequency letter but no Soothill phonetic group are assigned group 0. (Source: UAX38) - kFennIndex
- The position of this character in _Fenn’s Chinese-English Pocket Dictionary_ by Courtenay H. Fenn, Cambridge, Mass.: Harvard University Press, 1942. The position is indicated by a three-digit page number followed by a period and a two-digit position on the page. (Source: UAX38)
- kFourCornerCode
- The four-corner code(s) for the character. This data is derived from data provided in the public domain by Hartmut Bohn, Urs App, and Christian Wittern.
The four-corner system assigns each character a four-digit code from 0 through 9. The digit is derived from the “shape” of the four corners of the character (upper-left, upper-right, lower-left, lower-right). An optional fifth digit can be used to further distinguish characters; the fifth digit is derived from the shape in the character’s center or region immediately to the left of the fourth corner.
The four-corner system is now used only rarely. Full descriptions are available online, e.g., at <https://en.wikipedia.org/wiki/Four_corner_input>.
Values in this field consist of four decimal digits, optionally followed by a period and fifth digit for a five-digit form. (Source: UAX38) - kFrequency
- A rough frequency measurement for the character based on analysis of traditional Chinese USENET postings; characters with a kFrequency of 1 are the most common, those with a kFrequency of 2 are less common, and so on, through a kFrequency of 5. (Source: UAX38)
- kGB0
- The GB 2312-80 mapping for this character in ku/ten form. (Source: UAX38)
- kGB1
- The GB 12345-90 mapping for this character in ku/ten form. (Source: UAX38)
- kGB3
- The GB 7589-87 mapping for this character in ku/ten form. (Source: UAX38)
- kGB5
- The GB 7590-87 mapping for this character in ku/ten form. (Source: UAX38)
- kGB7
- The GB 8565-89 mapping for this character in ku/ten form. (Source: UAX38)
- kGB8
- The GB 8565-89 mapping for this character in ku/ten form. (Source: UAX38)
- kGradeLevel
- The primary grade in the Hong Kong school system by which a student is expected to know the character; this data is derived from 朗文初級中文詞典, Hong Kong: Longman, 2001. (Source: UAX38)
- kGSR
- The position of this character in Bernhard Karlgren’s Grammata Serica Recensa (1957).
This dataset contains a total of 7,405 records. References are given in the form DDDDa('), where “DDDD” is a set number in the range [0001..1260] zero-padded to 4-digits, “a” is a letter in the range [a..z] (excluding “w”), optionally followed by apostrophe ('). The data from which this mapping table is extracted contains a total of 10,023 references. References to inscriptional forms have been omitted. (Source: UAX38) - kHangul
- The modern Korean pronunciation(s) for this character in Hangul. (Source: UAX38)
- kHanYu
- The position of this character in the Hanyu Da Zidian (HDZ) Chinese character dictionary (bibliographic information below).
The first character assigned a given virtual position has an index ending in 1; the second assigned the same virtual position has an index ending in 2; and so on. (Source: UAX38) - kHanyuPinlu
- The Pronunciations and Frequencies of this character, based in part on those appearing in 《現代漢語頻率詞典》 <Xiandai Hanyu Pinlu Cidian> (XDHYPLCD) [Modern Standard Beijing Chinese Frequency Dictionary]. (Source: UAX38)
- kHanyuPinyin
- The 漢語拼音 Hànyǔ Pīnyīn reading(s) appearing in the edition of 《漢語大字典》 Hànyǔ Dà Zìdiǎn (HDZ) specified in the “kHanYu” property description (q.v.). Each location has the form “ABCDE.XYZ” (as in “kHanYu”); multiple locations for a given pīnyīn reading are separated by “,” (comma). The list of locations is followed by “:” (colon), followed by a comma-separated list of one or more pīnyīn readings. Where multiple pīnyīn readings are associated with a given mapping, these are ordered as in HDZ (for the most part reflecting relative commonality). The following are representative records.
| U+34CE | 㓎 | 10297.260: qīn,qìn,qǐn |
| U+34D8 | 㓘 | 10278.080,10278.090: sù |
| U+5364 | 卤 | 10093.130: xī,lǔ 74609.020: lǔ,xī |
| U+5EFE | 廾 | 10513.110,10514.010,10514.020: gǒng |
For example, the “kHanyuPinyin” value for 卤 U+5364 is “10093.130: xī,lǔ 74609.020: lǔ,xī”. This means that 卤 U+5364 is found in “kHanYu” at entries 10093.130 and 74609.020. The former entry has the two pīnyīn readings xī and lǔ (in that order), whereas the latter entry has the readings lǔ and xī (reversing the order).
Multiple Value Order: Individual entries are in same order as they are found in the Hanyu Da Zidian. This is true both for the locations and the individual readings. While this is generally in the order of utility for modern Chinese, such is not invariably the case, as the example above illustrates.
This data was originally input by 井作恆 Jǐng Zuòhéng, proofed by 聃媽歌 Dān Māgē (Magda Danish, using software donated by 文林 Wénlín Institute, Inc. and tables prepared by 曲理查 Qū Lǐchá), and proofed again and prepared for the Unicode Consortium by 曲理查 Qū Lǐchá (2008-01-14) (Source: UAX38) - kHDZRadBreak
- Indicates that 《漢語大字典》 Hanyu Da Zidian has a radical break beginning at this character’s position. The field consists of the radical (with its Unicode code point), a colon, and then the Hanyu Da Zidian position as in the kHanyu field. (Source: UAX38)
- kHKGlyph
- The index of the character in 常用字字形表 (二零零零年修訂本),香港: 香港教育學院, 2000, ISBN 962-949-040-4. This publication gives the “proper” shapes for 4759 characters as used in the Hong Kong school system. The index is an integer, zero-padded to four digits. (Source: UAX38)
- kHKSCS
- Mappings to the Big Five extended code points used for the Hong Kong Supplementary Character Set. (Source: UAX38)
- kIBMJapan
- The IBM Japanese mapping for this character in hexadecimal. (Source: UAX38)
- kIICore
- A boolean indicating that a character is in IICore, the IRG-produced minimal set of required ideographs for East Asian use. A character is in IICore if and only if it has a value for the kIICore field.
The only value currently in this field is “2.1”, which is the identifier of the version of IICore used to populate this field. (Source: UAX38) - kIRGDaeJaweon
- The position of this character in the Dae Jaweon (Korean) dictionary used in the four-dictionary sorting algorithm. The position is in the form “page.position” with the final digit in the position being “0” for characters actually in the dictionary and “1” for characters not found in the dictionary and assigned a “virtual” position in the dictionary.
Thus, “1187.060” indicates the sixth character on page 1187. A character not in this dictionary but assigned a position between the 6th and 7th characters on page 1187 for sorting purposes would have the code “1187.061”
This field represents the official position of the character within the Dae Jaweon dictionary as used by the IRG in the four-dictionary sorting algorithm.
The edition used is the first edition, published in Seoul by Samseong Publishing Co., Ltd., 1988.
(Source: UAX38) - kIRGDaiKanwaZiten
- The index of this character in the Dai Kanwa Ziten, aka Morohashi dictionary (Japanese) used in the four-dictionary sorting algorithm.
This field represents the official position of the character within the DaiKanwa dictionary as used by the IRG in the four-dictionary sorting algorithm. The edition used is the revised edition, published in Tokyo by Taishuukan Shoten, 1986. (Source: UAX38) - kIRGHanyuDaZidian
- The position of this character in the Hanyu Da Zidian (PRC) dictionary used in the four-dictionary sorting algorithm. The position is in the form “volume page.position” with the final digit in the position being “0” for characters actually in the dictionary and “1” for characters not found in the dictionary and assigned a “virtual” position in the dictionary.
This field represents the official position of the character within the Hanyu Da Zidian dictionary as used by the IRG in the four-dictionary sorting algorithm.
The edition of the Hanyu Da Zidian used is the first edition, published in Chengdu by Sichuan Cishu Publishing, 1986. (Source: UAX38) - kIRGKangXi
- The official IRG position of this character in the 《康熙字典》 Kang Xi Dictionary used in the four-dictionary sorting algorithm. The position is in the form “page.position” with the final digit in the position being “0” for characters actually in the dictionary and “1” for characters not found in the dictionary but assigned a “virtual” position in the dictionary.
Thus, “1187.060” indicates the sixth character on page 1187. A character not in this dictionary but assigned a position between the 6th and 7th characters on page 1187 for sorting purposes would have the code “1187.061”.
The edition of the Kang Xi Dictionary used is the 7th edition published by Zhonghua Bookstore in Beijing, 1989.
(Source: UAX38) - kIRG_GSource
- The IRG “G” source mapping for this character in hex. The IRG G source consists of data from the following national standards, publications, and lists from the People’s Republic of China and Singapore. The versions of the standards used are those provided by the PRC to the IRG and may not always reflect published versions of the standards generally available.
G1 GB12345-90 with 58 Hong Kong and 92 Korean “Idu” characters
G3 GB7589-87 unsimplified forms
G5 GB7590-87 unsimplified forms
G7 General Purpose Hanzi List for Modern Chinese Language, and General List of Simplified Hanzi
GS Singapore Characters
G8 GB8565-88
G9 GB18030-2000
GE GB16500-95
G4K Siku Quanshu (四庫全書)
GBK Chinese Encyclopedia (中國大百科全書)
GCH Ci Hai (辞海)
GCY Ci Yuan (辭源)
GCYY Chinese Academy of Surveying and Mapping Ideographs (中国测绘科学院用字) GFZ Founder Press System (方正排版系统)
GGH Gudai Hanyu Cidian (古代汉语词典)
GHC Hanyu Dacidian (漢語大詞典)
GHZ Hanyu Dazidian ideographs (漢語大字典)
GIDC ID system of the Ministry of Public Security of China, 2009
GJZ Commercial Press Ideographs (商务印书馆用字)
GKX Kangxi Dictionary ideographs(康熙字典)9th edition (1958) including the addendum (康熙字典)補遺
GXC Xiandai Hanyu Cidian (现代汉语词典)
GZFY Hanyu Fangyan Dacidian (汉语方言大辞典)
GZH ZhongHua ZiHai (中华字海)
GZJW Yinzhou Jinwen Jicheng Yinde (殷周金文集成引得)
(Source: UAX38) - kIRG_HSource
- The IRG “H” source mapping for this character in hex. The IRG “H” source consists of data from the Hong Kong Supplementary Character Set – 2008. (Source: UAX38)
- kIRG_JSource
- The IRG “J” source mapping for this character in hex. The IRG “J” source consists of data from the following national standards and lists from Japan.
J0 JIS X 0208-1990
J1 JIS X 0212-1990
JA Unified Japanese IT Vendors Contemporary Ideographs, 1993
JH Hanyo-Denshi Program (汎用電子情報交換環境整備プログラム), 2002-2009
JK Japanese KOKUJI Collection
JARIB Association of Radio Industries and Businesses (ARIB) ARIB STD-B24 Version 5.1, March 14 2007 (Source: UAX38) - kIRG_KPSource
- The IRG “KP” source mapping for this character in hex. The IRG “KP” source consists of data from the following national standards and lists from the Democratic People’s Republic of Korea (North Korea).
KP0 KPS 9566-97
KP1 KPS 10721-2000 (Source: UAX38) - kIRG_KSource
- The IRG “K” source mapping for this character in hex. The IRG “K” source consists of data from the following national standards and lists from the Republic of Korea (South Korea).
K0 KS X 1001:2004 (formerly KS C 5601-1987)
K1 KS X 1002:2001 (formerly KS C 5657-1991)
K2 PKS C 5700-1 1994
K3 PKS C 5700-2 1994
K4 PKS 5700-3:1998
K5 Korean IRG Hanja Character Set 5th Edition: 2001
Note that the K4 source is expressed in hexadecimal, but unlike the other sources, it is not organized in row/column. The content of the repertoire covered by the K2, K3, K4, and K5 sources is in the process of being reedited in new Korean standards.
(Source: UAX38) - kIRG_MSource
- The IRG “M” source mapping for this character. The IRG “M” source consists of data from the Macao Information System Character Set (澳門資訊系統字集). (Source: UAX38)
- kIRG_TSource
- The IRG “T” source mapping for this character in hex. The IRG “T” source consists of data from the following national standards and lists from the Republic of China (Taiwan).
T1 TCA-CNS 11643-1992 1st plane
T2 TCA-CNS 11643-1992 2nd plane
T3 TCA-CNS 11643-1992 3rd plane with some additional characters
T4 TCA-CNS 11643-1992 4th plane
T5 TCA-CNS 11643-1992 5th plane
T6 TCA-CNS 11643-1992 6th plane
T7 TCA-CNS 11643-1992 7th plane
TB TCA-CNS Ministry of Education, Hakka dialect, May 2007
TC TCA-CNS 11643-1992 12th plane
TD TCA-CNS 11643-1992 13th plane
TE TCA-CNS 11643-1992 14th plane
TF TCA-CNS 11643-1992 15th plane (Source: UAX38) - kIRG_USource
- The IRG “U” source mapping for this character. U-source references are a reference into the U-source ideograph database; see UTR #45. These consist of “UTC” or “UCI” followed by a hyphen and a five-digit, zero-padded index into the database. (Source: UAX38)
- kIRG_VSource
- The IRG “V” source mapping for this character in hex. The IRG “V” source consists of data from the following national standards and lists from Vietnam.
V0 TCVN 5773:1993
V1 TCVN 6056:1995
V2 VHN 01:1998
V3 VHN 02: 1998
V4 Dictionary on Nom 2006, Dictionary on Nom of Tay ethnic 2006, Lookup Table for Nom in the South 1994 (Source: UAX38) - kJapaneseKun
- The Japanese pronunciation(s) of this character. (Source: UAX38)
- kJapaneseOn
- The Sino-Japanese pronunciation(s) of this character. (Source: UAX38)
- kJis0
- The JIS X 0208-1990 mapping for this character in ku/ten form. (Source: UAX38)
- kJIS0213
- The JIS X 0213-2000 mapping for this character in min/ku/ten form. (Source: UAX38)
- kJis1
- The JIS X 0212-1990 mapping for this character in ku/ten form. (Source: UAX38)
- kKangXi
- The position of this character in the 《康熙字典》 Kang Xi Dictionary used in the four-dictionary sorting algorithm. The position is in the form “page.position” with the final digit in the position being “0” for characters actually in the dictionary and “1” for characters not found in the dictionary but assigned a “virtual” position in the dictionary.
Thus, “1187.060” indicates the sixth character on page 1187. A character not in this dictionary but assigned a position between the 6th and 7th characters on page 1187 for sorting purposes would have the code “1187.061”.
The edition of the Kang Xi Dictionary used is the 7th edition published by Zhonghua Bookstore in Beijing, 1989.
(Source: UAX38) - kKarlgren
- The index of this character in _Analytic Dictionary of Chinese and Sino-Japanese_ by Bernhard Karlgren, New York: Dover Publications, Inc., 1974.
If the index is followed by an asterisk (*), then the index is an interpolated one, indicating where the character would be found if it were to have been included in the dictionary. Note that while the index itself is usually an integer, there are some cases where it is an integer followed by an “A”. (Source: UAX38) - kKorean
- The Korean pronunciation(s) of this character, using the Yale romanization system. (See <https://en.wikipedia.org/wiki/Korean_romanization> for a discussion of the various Korean romanization systems.) (Source: UAX38)
- kKPS0
- The KPS 9566-97 mapping for this character in hexadecimal form. (Source: UAX38)
- kKPS1
- The KPS 10721-2000 mapping for this character in hexadecimal form. (Source: UAX38)
- kKSC0
- The KS X 1001:1992 (KS C 5601-1989) mapping for this character in ku/ten form. (Source: UAX38)
- kKSC1
- The KS X 1002:1991 (KS C 5657-1991) mapping for this character in ku/ten form. (Source: UAX38)
- kLau
- The index of this character in A Practical Cantonese-English Dictionary by Sidney Lau, Hong Kong: The Government Printer, 1977.
The index consists of an integer. Missing indices indicate unencoded characters which are being submitted to the IRG for inclusion in future versions of the standard. (Source: UAX38) - kMainlandTelegraph
- The PRC telegraph code for this character, derived from “Kanzi denpou koudo henkan-hyou” (“Chinese character telegraph code conversion table”), Lin Jinyi, KDD Engineering and Consulting, Tokyo, 1984. (Source: UAX38)
- kMandarin
- The most customary pinyin reading for this character; that is, the reading most commonly used in modern text, with some preference given to readings most likely to be in sorted lists.
Multiple Value Order: When there are two values, then the first is preferred for zh-Hans (CN) and the second is preferred for zh-Hant (TW). When there is only one value, it is appropriate for both. (Source: UAX38) - kMatthews
- The index of this character in Mathews’ Chinese-English Dictionary by Robert H. Mathews, Cambrige: Harvard University Press, 1975.
Note that the field name is kMatthews instead of kMathews to maintain compatibility with earlier versions of this file, where it was inadvertently misspelled. (Source: UAX38) - kMeyerWempe
- The index/indices of this character in the Student’s Cantonese-English Dictionary by Bernard F. Meyer and Theodore F. Wempe (3rd edition, 1947). The index is an integer, optionally followed by a lower-case Latin letter if the listing is in a subsidiary entry and not a main one. In some cases where the character is found in the radical-stroke index, but not in the main body of the dictionary, the integer is followed by an asterisk (e.g., U+50E5, which is listed as 736* as well as 1185a). (Source: UAX38)
- kMorohashi
- The index/indices of this character in the Dae Kanwa Ziten, aka Morohashi dictionary (Japanese) used in the four-dictionary sorting algorithm.
The edition used is the revised edition, published in Tokyo by Taishuukan Shoten, 1986. (Source: UAX38) - kNelson
- The index of this character in The Modern Reader’s Japanese-English Character Dictionary by Andrew Nathaniel Nelson, Rutland, Vermont: Charles E. Tuttle Company, 1974. (Source: UAX38)
- kOtherNumeric
- The numeric value for the character in certain unusual, specialized contexts.
The three numeric-value fields should have no overlap; that is, characters with a kOtherNumeric value should not have a kAccountingNumeric or kPrimaryNumeric value as well.
(Source: UAX38) - kPhonetic
- The phonetic index for the character from _Ten Thousand Characters: An Analytic Dictionary_, by G. Hugh Casey, S.J. Hong Kong: Kelley and Walsh, 1980. (Source: UAX38)
- kPrimaryNumeric
- The value of the character when used in the writing of numbers in the standard fashion.
The three numeric-value fields should have no overlap; that is, characters with a kPrimaryNumeric value should not have a kAccountingNumeric or kOtherNumeric value as well.
(Source: UAX38) - kPseudoGB1
- A “GB 12345-90” code point assigned to this character for the purposes of including it within Unihan. Pseudo-GB1 codes were used to provide official code points for characters not already in national standards, such as characters used to write Cantonese, and so on. (Source: UAX38)
- kRSAdobe_Japan1_6
- Information on the glyphs in Adobe-Japan1-6 as contributed by Adobe. The value consists of a number of space-separated entries. Each entry consists of three pieces of information separated by a plus sign:
1) C or V. “C” indicates that the Unicode code point maps directly to the Adobe-Japan1-6 CID that appears after it, and “V” indicates that it is considered a variant form, and thus not directly encoded.
2) The Adobe-Japan1-6 CID.
3) Radical-stroke data for the indicated Adobe-Japan1-6 CID. The radical-stroke data consists of three pieces separated by periods: the KangXi radical (1-214), the number of strokes in the form the radical takes in the glyph, and the number of strokes in the residue. The standard Unicode radical-stroke form can be obtained by omitting the second value, and the total strokes in the glyph from adding the second and third values. (Source: UAX38) - kRSJapanese
- One or more Japanese radical/stroke counts for this character in the form “radical.additional strokes”. (Source: UAX38)
- kRSKangXi
- One or more KangXi radical/stroke counts for this character consistent with the value of the kKangXi field in the form “radical.additional strokes”. (Source: UAX38)
- kRSKanWa
- One or more Morohashi radical/stroke counts for this character in the form “radical.additional strokes”. (Source: UAX38)
- kRSKorean
- One or more Korean radical/stroke counts for this character in the form “radical.additional strokes”. (Source: UAX38)
- kRSUnicode
- One or more standard radical/stroke counts for this character in the form “radical.additional strokes”. The radical is indicated by a number in the range (1..214) inclusive. An apostrophe (') after the radical indicates a simplified version of the given radical. The “additional strokes” value is the residual stroke-count, the count of all strokes remaining after eliminating all strokes associated with the radical.
This field is also used for additional radical-stroke indices where either a character may be reasonably classified under more than one radical, or alternate stroke count algorithms may provide different stroke counts.
The first value is equal to the normative radical-stroke value defined in ISO/IEC 10646. (Source: UAX38) - kSBGY
- The position of this character in the Song Ben Guang Yun (SBGY) Medieval Chinese character dictionary (bibliographic and general information below).
The 25334 character references are given in the form “ABC.XY”, in which: “ABC” is the zero-padded page number [004..546]; “XY” is the zero-padded number of the character on the page [01..73]. For example, 364.38 indicates the 38th character on Page 364 (i.e. 澍). Where a given Unicode Scalar Value (USV) has more than one reference, these are space-delimited. (Source: UAX38) - kSemanticVariant
- The Unicode value for a semantic variant for this character. A semantic variant is an x- or y-variant with similar or identical meaning which can generally be used in place of the indicated character.
The basic syntax is a Unicode scalar value. It may optionally be followed by additional data. The additional data is separated from the Unicode scalar value by a less-than sign (<), and may be subdivided itself into substrings by commas, each of which may be divided into two pieces by a colon. The additional data consists of a series of field tags for another field in the Unihan database indicating the source of the information. If subdivided, the final piece is a string consisting of the letters T (for tòng, U+540C 同) B (for bù, U+4E0D 不), Z (for zhèng, U+6B63 正), F (for fán, U+7E41 繁), or J (for jiǎn U+7C21 簡/U+7B80 简).
T is used if the indicated source explicitly indicates the two are the same (e.g., by saying that the one character is “the same as” the other).
B is used if the source explicitly indicates that the two are used improperly one for the other.
Z is used if the source explicitly indicates that the given character is the preferred form. Thus, kHanYu indicates that U+5231 刱 and U+5275 創 are semantic variants and that U+5275 創 is the preferred form.
F is used if the source explicitly indicates that the given character is the traditional form.
J is used if the source explicitly indicates that the given character is the simplified form.
Data on simplified and traditional variations can be included in this field to document cases where different sources disagree on the nature of the relationship between two characters. The kSemanticVariant and kSpecializedSemanticVariant fields need not be consulted when interconverting between traditional and simplified Chinese. (Source: UAX38) - kSimplifiedVariant
- The Unicode value(s) for the simplified Chinese variant(s) for this character. A full discussion of the kSimplifiedVariant and kTraditionalVariant fields is found in section 3.7.1 above.
Much of the of the data on simplified and traditional variants was graciously supplied by Wenlin Institute, Inc. <https://www.wenlin.com>. (Source: UAX38) - kSpecializedSemanticVariant
- The Unicode value for a specialized semantic variant for this character. The syntax is the same as for the kSemanticVariant field.
A specialized semantic variant is an x- or y-variant with similar or identical meaning only in certain contexts (such as accountants’ numerals).
(Source: UAX38) - kTaiwanTelegraph
- The Taiwanese telegraph code for this character, derived from “Kanzi denpou koudo henkan-hyou” (“Chinese character telegraph code conversion table”), Lin Jinyi, KDD Engineering and Consulting, Tokyo, 1984.
(Source: UAX38) - kTang
- The Tang dynasty pronunciation(s) of this character, derived from or consistent with _T’ang Poetic Vocabulary_ by Hugh M. Stimson, Far Eastern Publications, Yale Univ. 1976. An asterisk indicates that the word or morpheme represented in toto or in part by the given character with the given reading occurs more than four times in the seven hundred poems covered. (Source: UAX38)
- kTotalStrokes
- The total number of strokes in the character (including the radical), that is, the stroke count most commonly associated with the character in modern text using customary fonts.
Multiple Value Order: When there are two values, then the first is preferred for zh-Hans (CN) and the second is preferred for zh-Hant (TW). When there is only one value, it is appropriate for both. (Source: UAX38) - kTraditionalVariant
- The Unicode value(s) for the traditional Chinese variant(s) for this character. A full discussion of the kSimplifiedVariant and kTraditionalVariant fields is found in section 3.7.1 above.
Much of the of the data on simplified and traditional variants was graciously supplied by Wenlin Institute, Inc. <https://www.wenlin.com>. (Source: UAX38) - kVietnamese
- The character’s pronunciation(s) in Quốc ngữ. (Source: UAX38)
- kXerox
- The Xerox code for this character. (Source: UAX38)
- kXHC1983
- One or more Hànyǔ Pīnyīn readings as given in the Xiàndài Hànyǔ Cídiǎn.
Each pīnyīn reading is preceded by the character’s location(s) in the dictionary, separated from the reading by “:” (colon); multiple locations for a given reading are separated by “,” (comma); multiple “location: reading” values are separated by “ ” (space). Each location reference is of the form /[0-9]{4}\.[0-9]{3}\*?/ . The number preceding the period is the page number, zero-padded to four digits. The first two digits of the number following the period are the entry’s position on the page, zero-padded. The third digit is 0 for a main entry and greater than 0 for a parenthesized variant of the main entry. A trailing “*” (asterisk) on the location indicates an encoded variant substituted for an unencoded character (see below). (Source: UAX38) - kZVariant
- The Unicode value(s) for known z-variants of this character.
The basic syntax is a Unicode scalar value. It may optionally be followed by additional data. The additional data is separated from the Unicode scalar value by a less-than sign (<), and may be subdivided itself into substrings by commas. The additional data consists of a series of field tags for another field in the Unihan database indicating the source of the information. (Source: UAX38) - Line Break
- Properties for line breaking. For more information, see Unicode Standard Annex #14, "Unicode Line Breaking Algorithm" [UAX14]. (Source: UAX44)
- Logical Order Exception
- A small number of spacing vowel letters occurring in certain Southeast Asian scripts such as Thai and Lao, which use a visual order display model. These letters are stored in text ahead of syllable-initial consonants, and require special handling for processes such as searching and sorting. (Source: UAX44)
- Lowercase
- Characters with the Lowercase property. For more information, see Chapter 4, Character Properties in [Unicode].
Generated from: Ll + Other Lowercase
(Source: UAX44) - Lowercase Mapping
- Titlecase Mapping
- Uppercase Mapping
- Data for producing (in combination with the simple case mappings from UnicodeData.txt) the full case mappings. (Source: UAX44)
- Math
- Characters with the Math property. For more information, see Chapter 4, Character Properties in [Unicode].
Generated from: Sm + Other Math
(Source: UAX44) - Name
- These names match exactly the names published in the code charts of the Unicode Standard. The derived Hangul Syllable names are omitted from this file; see Jamo.txt for their derivation. (Source: UAX44)
- Name Alias
- Normative formal aliases for characters with erroneous names, for control characters and some format characters, and for character abbreviations, as described in Chapter 4, Character Properties in [Unicode]. The aliases tagged with the type "correction" exactly match the formal aliases published in the Unicode Standard code charts. (Source: UAX44)
- NFKC Casefold
- A mapping designed for best behavior when doing caseless matching of strings interpreted as identifiers. (Abbreviated name: NFKC_CF)
For the definition of the related string transform toNFKC_Casefold() based on this mapping, see Section 3.13, Default Case Algorithms in [Unicode].
The mapping is listed in Field 2.
(Source: UAX44) - NFC Quick Check
- NFKC Quick Check
- NFD Quick Check
- NFKD Quick Check
- For property values, see Decompositions and Normalization. (Abbreviated names: NFD_QC, NFKD_QC, NFC_QC, NFKC_QC) (Source: UAX44)
- Noncharacter Code Point
- Code points permanently reserved for internal use. (Source: UAX44)
- Numeric Type
- Numeric Value
- If the character has the property value Numeric_Type=Decimal, then the Numeric_Value of that digit is represented with an integer value (limited to the range 0..9) in fields 6, 7, and 8. Characters with the property value Numeric_Type=Decimal are restricted to digits which can be used in a decimal radix positional numeral system and which are encoded in the standard in a contiguous ascending range 0..9. See the discussion of decimal digits in Chapter 4, Character Properties in [Unicode]. (Source: UAX44)
- If the character has the property value Numeric_Type=Digit, then the Numeric_Value of that digit is represented with an integer value (limited to the range 0..9) in fields 7 and 8, and field 6 is null. This covers digits that need special handling, such as the compatibility superscript digits. (Source: UAX44)
- If the character has the property value Numeric_Type=Numeric, then the Numeric_Value of that character is represented with a positive or negative integer or rational number in this field, and fields 6 and 7 are null. This includes fractions such as, for example, "1/5" for U+2155 VULGAR FRACTION ONE FIFTH.
Some characters have these properties based on values from the Unihan data files. See Numeric Type Han.
(Source: UAX44) - Numeric Type (Han)
- Numeric Value (Han)
- The characters tagged with either kPrimaryNumeric, kAccountingNumeric, or kOtherNumeric are given the property value Numeric_Type=Numeric, and the Numeric_Value indicated in those tags.
Most characters have these numeric properties based on values from UnicodeData.txt. See Numeric_Type.
(Source: UAX44) - Other Alphabetic
- Used in deriving the Alphabetic property. (Source: UAX44)
- Other Default Ignorable Code Point
- Used in deriving the Default_Ignorable_Code_Point property. (Source: UAX44)
- Other Grapheme Extend
- Used in deriving the Grapheme_Extend property. (Source: UAX44)
- Other ID Continue
- Used to maintain backward compatibility of ID Continue. (Source: UAX44)
- Other ID Start
- Used to maintain backward compatibility of ID Start. (Source: UAX44)
- Other Lowercase
- Used in deriving the Lowercase property. (Source: UAX44)
- Other Math
- Used in deriving the Math property. (Source: UAX44)
- Other Uppercase
- Used in deriving the Uppercase property. (Source: UAX44)
- Pattern Syntax
- Pattern White Space
- Used for pattern syntax as described in Unicode Standard Annex #31, "Unicode Identifier and Pattern Syntax" [UAX31]. (Source: UAX44)
- Plane
- A Unicode plane is one of 17 sets of 65536 codepoints each. Currently only the first three planes contain character definitions. The last two planes are reserved for private use.
- Private Use
- So-called “private use” areas are Unicode codepoints, that are deliberately not assigned to characters. These codepoints can be used by application developers to add their own extensions to Unicode.
- Quotation Mark
- Punctuation characters that function as quotation marks. (Source: UAX44)
- Radical
- Used in Ideographic Description Sequences. (Source: UAX44)
- Script
- Script values for use in regular expressions and elsewhere. For more information, see Unicode Standard Annex #24, "Unicode Script Property" [UAX24]. (Source: UAX44)
- Script Extensions
- Enumerated sets of Script values for use in regular expressions and elsewhere. For more information, see Unicode Standard Annex #24, "Unicode Script Property" [UAX24]. (Source: UAX44)
- Sentence Break
- See Unicode Standard Annex #29, "Unicode Text Segmentation" [UAX29] (Source: UAX44)
- Simple Lowercase Mapping
- Simple lowercase mapping (single character result). (Source: UAX44)
- Simple Titlecase Mapping
- Simple titlecase mapping (single character result).
Note: If this field is null, then the Simple_Titlecase_Mapping is the same as the Simple_Uppercase_Mapping for this character.
(Source: UAX44) - Simple Uppercase Mapping
- Simple uppercase mapping (single character result).
If a character is part of an alphabet with case distinctions, and has a simple uppercase equivalent, then the uppercase equivalent is in this field. The simple mappings have a single character result, where the full mappings may have multi-character results. For more information, see Case and Case Mapping. (Source: UAX44) - Soft Dotted
- Characters with a "soft dot", like i or j. An accent placed on these characters causes the dot to disappear. An explicit dot above can be added where required, such as in Lithuanian. (Source: UAX44)
- STerm
- Sentence Terminal. Used in Unicode Standard Annex #29, "Unicode Text Segmentation" [UAX29]. (Source: UAX44)
- Terminal Punctuation
- Punctuation characters that generally mark the end of textual units. (Source: UAX44)
- Unicode
- A standard to map characters to codepoints, numeric representations. The Unicode standard is curated by the Unicode Consortium. It is internationally standardized as ISO 10464.
- Unicode 1 Name
- Old name as published in Unicode 1.0. This name is only provided when it is significantly different from the current name for the character. The value of field 10 for control characters does not always match the Unicode 1.0 names. Instead, field 10 contains ISO 6429 names for control functions, for printing in the code charts. (Source: UAX44)
- Unicode Radical Stroke
- The Unicode radical-stroke count, based on the tag kRSUnicode. (Source: UAX44)
- Unified Ideograph
- A property which specifies the exact set of Unified CJK Ideographs in the standard. This set excludes CJK Compatibility Ideographs (which have canonical decompositions to Unified CJK Ideographs), as well as characters from the CJK Symbols and Punctuation block. The property is used in the definition of Ideographic Description Sequences. (Source: UAX44)
- Uppercase
- Characters with the Uppercase property. For more information, see Chapter 4, Character Properties in [Unicode].
Generated from: Lu + Other Uppercase
(Source: UAX44) - Variation Selector
- Indicates characters that are Variation Selectors. For details on the behavior of these characters, see StandardizedVariants.html, Section 16.4, Variation Selectors in [Unicode], and Unicode Standard Annex #37, "Unicode Ideographic Variation Database" [UTS37]. (Source: UAX44)
- White Space
- Spaces, separator characters and other control characters which should be treated by programming languages as "white space" for the purpose of parsing elements. See also Line Break, Grapheme Cluster Break, Sentence Break, and Word Break, which classify space characters and related controls somewhat differently for particular text segmentation contexts. (Source: UAX44)
- Word Break
- See Unicode Standard Annex #29, "Unicode Text Segmentation" [UAX29] (Source: UAX44)