Home: go to the homepage Search All Planes Random

Glossary of Terms

This page gives an overview of the terms and concepts presented on this site. Many definitions are taken directly from the respective specification at the Unicode website and may remain a bit technical. If you see this sign on a page, it means you can access the definition from here by hovering over it.

Age
This property shows when various code points were designated/assigned in successive versions of the Unicode Standard.

The Age property is normative in the sense that it is completely specified based on when a character is encoded in the standard. However, DerivedAge.txt is provided for information. The value of the Age property for a code point can be derived by analysis of successive versions of the UCD, and Age is not used normatively in the specification of any Unicode algorithm.

(Source: UAX44)
Alphabetic
Characters with the Alphabetic property. For more information, see Chapter 4, Character Properties in [Unicode].

Generated from: Lu + Ll + Lt + Lm + Lo + Nl + Other Alphabetic

(Source: UAX44)
ASCII Hex Digit
ASCII characters commonly used for the representation of hexadecimal numbers. (Source: UAX44)
Bidi Class
These are the categories required by the Unicode Bidirectional Algorithm. For the property values, see Bidirectional Class Values. For more information, see Unicode Standard Annex #9, "The Unicode Bidirectional Algorithm" [UAX9].

The default property values depend on the code point, and are explained in DerivedBidiClass.txt

(Source: UAX44)
Bidi Control
Format control characters which have specific functions in the Unicode Bidirectional Algorithm [UAX9]. (Source: UAX44)
Bidi Mirrored
If the character is a "mirrored" character in bidirectional text, this field has the value "Y"; otherwise "N". See Section 4.7, Bidi Mirrored—Normative of [Unicode]. Do not confuse this with the Bidi Mirroring Glyph property. (Source: UAX44)
Bidi Mirroring Glyph
Informative mapping for substituting characters in an implementation of bidirectional mirroring. This maps a subset of characters with the Bidi_Mirrored property to other characters that normally are displayed with the corresponding mirrored glyph. When a character with the Bidi_Mirrored property has the default value for Bidi_Mirroring_Glyph, that means that no other character exists whose glyph is appropriate for character-based glyph mirroring. Implementations must then use other mechanisms to implement mirroring of those characters for the Unicode Bidirectional Algorithm. See Unicode Standard Annex #9:" The Unicode Bidirectional Algorithm" [UAX9]. Do not confuse this property with the Bidi Mirrored property itself. (Source: UAX44)
Block
List of block names, which are arbitrary names for ranges of code points. See the code charts in [Unicode]. (Source: UAX44)
Canonical Combining Class
The classes used for the Canonical Ordering Algorithm in the Unicode Standard. This property could be considered either an enumerated property or a numeric property: the principal use of the property is in terms of the numeric values. For the property value names associated with different numeric values, see DerivedCombiningClass.txt and Canonical Combining Class Values. (Source: UAX44)
Case Folding
Simple Case Folding
Mapping from characters to their case-folded forms. This is an informative file containing normative derived properties.

Derived from UnicodeData and SpecialCasing.

Note: The case foldings are omitted in the data file if they are the same as the code point itself.

(Source: UAX44)
Case Ignorable
Characters which are ignored for casing purposes. For more information, see D136 in Section 3.13, Default Case Algorithms in [Unicode].

Generated from: Mn + Me + Cf + Lm + Sk + Word Break=MidLetter + Word Break=MidNumLet

(Source: UAX44)
Cased
Characters which are considered to be either uppercase, lowercase or titlecase characters. This property is not identical to the Changes_When_Casemapped property. For more information, see D135 in Section 3.13, Default Case Algorithms in [Unicode].

Generated from: Lowercase + Uppercase + Lt

(Source: UAX44)
Changes When Casefolded
Characters whose normalized forms are not stable under case folding. For more information, see D142 in Section 3.13, Default Case Algorithms in [Unicode].

Generated from: toCasefold(toNFD(X)) != toNFD(X)

(Source: UAX44)
Changes When Casemapped
Characters which may change when they undergo case mapping. For more information, see D143 in Section 3.13, Default Case Algorithms in [Unicode].

Generated from: Changes_When_Lowercased(X) or Changes_When_Uppercased(X) or Changes_When_Titlecased(X)

(Source: UAX44)
Changes When Lowercased
Characters whose normalized forms are not stable under a toLowercase mapping. For more information, see D139 in Section 3.13, Default Case Algorithms in [Unicode].

Generated from: toLowercase(toNFD(X)) != toNFD(X)

(Source: UAX44)
Changes When NFKC Casefolded
Characters which are not identical to their NFKC_Casefold mapping.

Generated from: (cp != NFKC_CaseFold(cp))

(Source: UAX44)
Changes When Titlecased
Characters whose normalized forms are not stable under a toTitlecase mapping. For more information, see D141 in Section 3.13, Default Case Algorithms in [Unicode].

Generated from: toTitlecase(toNFD(X)) != toNFD(X)

(Source: UAX44)
Changes When Uppercased
Characters whose normalized forms are not stable under a toUppercase mapping. For more information, see D140 in Section 3.13, Default Case Algorithms in [Unicode].

Generated from: toUppercase(toNFD(X)) != toNFD(X)

(Source: UAX44)
Code Point
A number in the Unicode standard denoting one single character. A code point is different from a Glyph.
Composition Exclusion
A property used in normalization. See Unicode Standard Annex #15: "Unicode Normalization Forms" [UAX15]. Unlike other files, CompositionExclusions.txt simply lists the relevant code points. (Source: UAX44)
Dash
Punctuation characters explicitly called out as dashes in the Unicode Standard, plus their compatibility equivalents. Most of these have the General_Category value Pd, but some have the General_Category value Sm because of their use in mathematics. (Source: UAX44)
Decomposition Type
Decomposition Mapping
This field contains both values, with the type in angle brackets. The decomposition mappings exactly match the decomposition mappings published with the character names in the Unicode Standard. For more information, see Character Decomposition Mappings. (Source: UAX44)
Default Ignorable Code Point
For programmatic determination of default ignorable code points. New characters that should be ignored in rendering (unless explicitly supported) will be assigned in these ranges, permitting programs to correctly handle the default rendering of such characters when not otherwise supported. For more information, see the FAQ Display of Unsupported Characters, and Section 5.21, Default Ignorable Code Points in [Unicode].

Generated from
Other Default Ignorable Code Point
+ Cf (format characters)
+ Variation_Selector
- White_Space
- FFF9..FFFB (annotation characters)
- 0600..0604, 06DD, 070F, 110BD (exceptional Cf characters that should be visible)

(Source: UAX44)
Deprecated
For a machine-readable list of deprecated characters. No characters will ever be removed from the standard, but the usage of deprecated characters is strongly discouraged. (Source: UAX44)
Diacritic
Characters that linguistically modify the meaning of another character to which they apply. Some diacritics are not combining characters, and some combining characters are not diacritics. (Source: UAX44)
East Asian Width
Properties for determining the choice of wide versus narrow glyphs in East Asian contexts. Property values are described in Unicode Standard Annex #11, "East Asian Width" [UAX11]. (Source: UAX44)
Expands On NFC (deprecated)
Expands On NFD (deprecated)
Expands On NFKC (deprecated)
Expands On NFKD (deprecated)
Characters that expand to more than one character in the specified normalization form. (Source: UAX44)
Extender
Characters whose principal function is to extend the value or shape of a preceding alphabetic character. Typical of these are length and iteration marks. (Source: UAX44)
FC NFKC Closure (deprecated)
Characters that require extra mappings for closure under Case Folding plus Normalization Form KC.

The mapping is listed in Field 2.

(Source: UAX44)
Full Composition Exclusion
Characters that are excluded from composition: those listed explicitly in CompositionExclusions.txt, plus the derivable sets of Singleton Decompositions and Non-Starter Decompositions, as documented in that data file. (Source: UAX44)
General Category
This is a useful breakdown into various character types which can be used as a default categorization in implementations. For the property values, see General Category Values. (Source: UAX44)
Glyph
The representation of a codepoint under certain circumstances. For example, the letter “A” looks quite differently in latin and blackletter fonts. Both are different glyphs for the underlying character.
Grapheme Base
Property used together with the definition of Standard Korean Syllable Block to define "Grapheme base". See D58 in Chapter 3, Conformance in [Unicode].

Generated from: [0..10FFFF] - Cc - Cf - Cs - Co - Cn - Zl - Zp - Grapheme Extend

Note: Grapheme_Base is a property of individual characters. That usage contrasts with "grapheme base", which is an attribute of Unicode strings; a grapheme base may consist of a Korean syllable which is itself represented by a sequence of conjoining jamos.

(Source: UAX44)
Grapheme Cluster Break
See Unicode Standard Annex #29, "Unicode Text Segmentation" [UAX29] (Source: UAX44)
Grapheme Extend
Property used to define "Grapheme extender". See D59 in Chapter 3, Conformance in [Unicode].

Generated from: Me + Mn + Other Grapheme Extend

Note: The set of characters for which Grapheme_Extend=Yes is equivalent to the set of characters for which Grapheme_Cluster_Break=Extend.

(Source: UAX44)
Formerly proposed for programmatic determination of grapheme cluster boundaries.

Generated from: Canonical_Combining_Class=Virama

(Source: UAX44)
Hangul Syllable Type
The values L, V, T, LV, and LVT used in Chapter 3, Conformance in [Unicode]. (Source: UAX44)
Hex Digit
Characters commonly used for the representation of hexadecimal numbers, plus their compatibility equivalents. (Source: UAX44)
Hyphen (deprecated, stabilized)
Dashes which are used to mark connections between pieces of words, plus the Katakana middle dot. The Katakana middle dot functions like a hyphen, but is shaped like a dot rather than a dash. (Source: UAX44)
ID Start
ID Continue
XID Start
XID Continue
Used to determine programming identifiers, as described in Unicode Standard Annex #31, "Unicode Identifier and Pattern Syntax" [UAX31]. (Source: UAX44)
Ideographic
Characters considered to be CJKV (Chinese, Japanese, Korean, and Vietnamese) ideographs. This property roughly defines the class of "Chinese characters" and does not include characters of other logographic scripts such as Cuneiform or Egyptian Hieroglyphs. (Source: UAX44)
IDS Binary Operator
Used in Ideographic Description Sequences. (Source: UAX44)
IDS Trinary Operator
Used in Ideographic Description Sequences. (Source: UAX44)
Indic Positional Category
A property defining the placement categories for dependent vowels, viramas, combining marks, and other characters used in Indic scripts. (Source: UAX44)
Indic Syllabic Category
A property defining the structural categories of syllabic components in Indic scripts. (Source: UAX44)
ISO Comment (deprecated, stabilized)
ISO 10646 comment field. It was used for notes that appeared in parentheses in the 10646 names list, or contained an asterisk to mark an Annex P note.

As of Unicode 5.2.0, this field no longer contains any non-null values.

(Source: UAX44)
Join Control
Format control characters which have specific functions for control of cursive joining and ligation. (Source: UAX44)
Joining Type
Joining Group
Basic Arabic and Syriac character shaping properties, such as initial, medial and final shapes. See Section 8.2, Arabic in [Unicode]. (Source: UAX44)
Jamo Short Name
The Hangul Syllable names are derived from the Jamo Short Names, as described in Chapter 3, Conformance in [Unicode]. (Source: UAX44)
kAccountingNumeric
The value of the character when used in the writing of accounting numerals.

Accounting numerals are used in East Asia to prevent fraud. Because a number like ten (十) is easily turned into one thousand (千) with a stroke of a brush, monetary documents will often use an accounting form of the numeral ten (such as 拾) in their place.

The three numeric-value fields should have no overlap; that is, characters with a kAccountingNumeric value should not have a kPrimaryNumeric or kOtherNumeric value as well.
(Source: UAX38)
kBigFive
The Big Five mapping for this character in hex; note that this does not cover any of the Big Five extensions in common use, including the ETEN extensions. (Source: UAX38)
kCangjie
The cangjie input code for the character. This incorporates data from the file cangjie-table.b5 by Christian Wittern. (Source: UAX38)
kCantonese
The Cantonese pronunciation(s) for this character using the jyutping romanization.

A full description of jyutping can be found at <http://www.lshk.org/cantonese.php>. The main differences between jyutping and the Yale romanization previously used are:

1) Jyutping always uses tone numbers and does not distinguish the high falling and high level tones.
2) Jyutping always writes a long a as “aa”.
3) Jyutping uses “oe” and “eo” for the Yale “eu” vowel.
4) Jyutping uses “c” instead of “ch”, “z” instead of “j”, and “j” instead of “y” as initials.
5) A non-null initial is always explicitly written (thus “jyut” in jyutping instead of Yale’s “yut”).

Cantonese pronunciations are sorted alphabetically, not in order of frequency. (Source: UAX38)
kCCCII
The CCCII mapping for this character in hex. (Source: UAX38)
kCheungBauer
Data regarding the character in Cheung Kwan-hin and Robert S. Bauer, _The Representation of Cantonese with Chinese Characters_, Journal of Chinese Linguistics, Monograph Series Number 18, 2002. Each data value consists of three pieces, separated by semicolons: (1) the character’s radical-stroke index as a three-digit radical, slash, two-digit stroke count; (2) the character’s cangjie input code (if any); and (3) a comma-separated list of Cantonese readings using the jyutping romanization in alphabetical order. (Source: UAX38)
kCheungBauerIndex
The position of the character in Cheung Kwan-hin and Robert S. Bauer, _The Representation of Cantonese with Chinese Characters_, Journal of Chinese Linguistics, Monograph Series Number 18, 2002. The format is a three-digit page number followed by a two-digit position number, separated by a period. (Source: UAX38)
kCihaiT
The position of this character in the Cihai (辭海) dictionary, single volume edition, published in Hong Kong by the Zhonghua Bookstore, 1983 (reprint of the 1947 edition), ISBN 962-231-005-2.

The position is indicated by a decimal number. The digits to the left of the decimal are the page number. The first digit after the decimal is the row on the page, and the remaining two digits after the decimal are the position on the row. (Source: UAX38)
kCNS1986
The CNS 11643-1986 mapping for this character in hex. (Source: UAX38)
kCNS1992
The CNS 11643-1992 mapping for this character in hex. (Source: UAX38)
kCompatibilityVariant
The compatibility decomposition for this ideograph, derived from the UnicodeData.txt file. (Source: UAX38)
kCowles
The index or indices of this character in Roy T. Cowles, A Pocket Dictionary of Cantonese, Hong Kong: University Press, 1999.

Approximately 100 characters from Cowles which are not currently encoded are being submitted to the IRG by Unicode for inclusion in future versions of the standard. (Source: UAX38)
kDaeJaweon
The position of this character in the Dae Jaweon (Korean) dictionary used in the four-dictionary sorting algorithm. The position is in the form “page.position” with the final digit in the position being “0” for characters actually in the dictionary and “1” for characters not found in the dictionary and assigned a “virtual” position in the dictionary.

Thus, “1187.060” indicates the sixth character on page 1187. A character not in this dictionary but assigned a position between the 6th and 7th characters on page 1187 for sorting purposes would have the code “1187.061”

The edition used is the first edition, published in Seoul by Samseong Publishing Co., Ltd., 1988. (Source: UAX38)
kDefinition
An English definition for this character. Definitions are for modern written Chinese and are usually (but not always) the same as the definition in other Chinese dialects or non-Chinese languages. In some cases, synonyms are indicated. Fuller variant information can be found using the various variant fields.

Definitions specific to non-Chinese languages or Chinese dialects other than modern Mandarin are marked, e.g., (Cant.) or (J).

Major definitions are separated by semicolons, and minor definitions by commas. Any valid Unicode character (except for tab, double-quote, and any line break character) may be used within the definition field. (Source: UAX38)
kEACC
The EACC mapping for this character in hex. (Source: UAX38)
kFenn
Data on the character from The Five Thousand Dictionary (aka Fenn’s Chinese-English Pocket Dictionary) by Courtenay H. Fenn, Cambridge, Mass.: Harvard University Press, 1979.

The data here consists of a decimal number followed by a letter A through K, the letter P, or an asterisk. The decimal number gives the Soothill number for the character’s phonetic, and the letter is a rough frequency indication, with A indicating the 500 most common ideographs, B the next five hundred, and so on.

P is used by Fenn to indicate a rare character included in the dictionary only because it is the phonetic element in other characters.

An asterisk is used instead of a letter in the final position to indicate a character which belongs to one of Soothill’s phonetic groups but is not found in Fenn’s dictionary.

Characters which have a frequency letter but no Soothill phonetic group are assigned group 0. (Source: UAX38)
kFennIndex
The position of this character in _Fenn’s Chinese-English Pocket Dictionary_ by Courtenay H. Fenn, Cambridge, Mass.: Harvard University Press, 1942. The position is indicated by a three-digit page number followed by a period and a two-digit position on the page. (Source: UAX38)
kFourCornerCode
The four-corner code(s) for the character. This data is derived from data provided in the public domain by Hartmut Bohn, Urs App, and Christian Wittern.

The four-corner system assigns each character a four-digit code from 0 through 9. The digit is derived from the “shape” of the four corners of the character (upper-left, upper-right, lower-left, lower-right). An optional fifth digit can be used to further distinguish characters; the fifth digit is derived from the shape in the character’s center or region immediately to the left of the fourth corner.

The four-corner system is now used only rarely. Full descriptions are available online, e.g., at <http://en.wikipedia.org/wiki/Four_corner_input>.

Values in this field consist of four decimal digits, optionally followed by a period and fifth digit for a five-digit form. (Source: UAX38)
kFrequency
A rough frequency measurement for the character based on analysis of traditional Chinese USENET postings; characters with a kFrequency of 1 are the most common, those with a kFrequency of 2 are less common, and so on, through a kFrequency of 5. (Source: UAX38)
kGB0
The GB 2312-80 mapping for this character in ku/ten form. (Source: UAX38)
kGB1
The GB 12345-90 mapping for this character in ku/ten form. (Source: UAX38)
kGB3
The GB 7589-87 mapping for this character in ku/ten form. (Source: UAX38)
kGB5
The GB 7590-87 mapping for this character in ku/ten form. (Source: UAX38)
kGB7
The GB 8565-89 mapping for this character in ku/ten form. (Source: UAX38)
kGB8
The GB 8565-89 mapping for this character in ku/ten form. (Source: UAX38)
kGradeLevel
The primary grade in the Hong Kong school system by which a student is expected to know the character; this data is derived from 朗文初級中文詞典, Hong Kong: Longman, 2001. (Source: UAX38)
kGSR
The position of this character in Bernhard Karlgren’s Grammata Serica Recensa (1957).

This dataset contains a total of 7,405 records. References are given in the form DDDDa('), where “DDDD” is a set number in the range [0001..1260] zero-padded to 4-digits, “a” is a letter in the range [a..z] (excluding “w”), optionally followed by apostrophe ('). The data from which this mapping table is extracted contains a total of 10,023 references. References to inscriptional forms have been omitted. (Source: UAX38)
kHangul
The modern Korean pronunciation(s) for this character in Hangul. (Source: UAX38)
kHanYu
The position of this character in the Hanyu Da Zidian (HDZ) Chinese character dictionary (bibliographic information below).

The first character assigned a given virtual position has an index ending in 1; the second assigned the same virtual position has an index ending in 2; and so on. (Source: UAX38)
kHanyuPinlu
The Pronunciations and Frequencies of this character, based in part on those appearing in 《現代漢語頻率詞典》 <Xiandai Hanyu Pinlu Cidian> (XDHYPLCD) [Modern Standard Beijing Chinese Frequency Dictionary]. (Source: UAX38)
kHanyuPinyin
The 漢語拼音 Hànyǔ Pīnyīn reading(s) appearing in the edition of 《漢語大字典》 Hànyǔ Dà Zìdiǎn (HDZ) specified in the “kHanYu” property description (q.v.). Each location has the form “ABCDE.XYZ” (as in “kHanYu”); multiple locations for a given pīnyīn reading are separated by “,” (comma). The list of locations is followed by “:” (colon), followed by a comma-separated list of one or more pīnyīn readings. Where multiple pīnyīn readings are associated with a given mapping, these are ordered as in HDZ (for the most part reflecting relative commonality). The following are representative records.

| U+34CE | 㓎 | 10297.260: qīn,qìn,qǐn |
| U+34D8 | 㓘 | 10278.080,10278.090: sù |
| U+5364 | 卤 | 10093.130: xī,lǔ 74609.020: lǔ,xī |
| U+5EFE | 廾 | 10513.110,10514.010,10514.020: gǒng |

For example, the “kHanyuPinyin” value for 卤 U+5364 is “10093.130: xī,lǔ 74609.020: lǔ,xī”. This means that 卤 U+5364 is found in “kHanYu” at entries 10093.130 and 74609.020. The former entry has the two pīnyīn readings xī and lǔ (in that order), whereas the latter entry has the readings lǔ and xī (reversing the order).

Multiple Value Order: Individual entries are in same order as they are found in the Hanyu Da Zidian. This is true both for the locations and the individual readings. While this is generally in the order of utility for modern Chinese, such is not invariably the case, as the example above illustrates.

This data was originally input by 井作恆 Jǐng Zuòhéng, proofed by 聃媽歌 Dān Māgē (Magda Danish, using software donated by 文林 Wénlín Institute, Inc. and tables prepared by 曲理查 Qū Lǐchá), and proofed again and prepared for the Unicode Consortium by 曲理查 Qū Lǐchá (2008-01-14) (Source: UAX38)
kHDZRadBreak
Indicates that 《漢語大字典》 Hanyu Da Zidian has a radical break beginning at this character’s position. The field consists of the radical (with its Unicode code point), a colon, and then the Hanyu Da Zidian position as in the kHanyu field. (Source: UAX38)
kHKGlyph
The index of the character in 常用字字形表 (二零零零年修訂本),香港: 香港教育學院, 2000, ISBN 962-949-040-4. This publication gives the “proper” shapes for 4759 characters as used in the Hong Kong school system. The index is an integer, zero-padded to four digits. (Source: UAX38)
kHKSCS
Mappings to the Big Five extended code points used for the Hong Kong Supplementary Character Set. (Source: UAX38)
kIBMJapan
The IBM Japanese mapping for this character in hexadecimal. (Source: UAX38)
kIICore
A boolean indicating that a character is in IICore, the IRG-produced minimal set of required ideographs for East Asian use. A character is in IICore if and only if it has a value for the kIICore field.

The only value currently in this field is “2.1”, which is the identifier of the version of IICore used to populate this field. (Source: UAX38)
kIRGDaeJaweon
The position of this character in the Dae Jaweon (Korean) dictionary used in the four-dictionary sorting algorithm. The position is in the form “page.position” with the final digit in the position being “0” for characters actually in the dictionary and “1” for characters not found in the dictionary and assigned a “virtual” position in the dictionary.

Thus, “1187.060” indicates the sixth character on page 1187. A character not in this dictionary but assigned a position between the 6th and 7th characters on page 1187 for sorting purposes would have the code “1187.061”

This field represents the official position of the character within the Dae Jaweon dictionary as used by the IRG in the four-dictionary sorting algorithm.

The edition used is the first edition, published in Seoul by Samseong Publishing Co., Ltd., 1988.
(Source: UAX38)
kIRGDaiKanwaZiten
The index of this character in the Dai Kanwa Ziten, aka Morohashi dictionary (Japanese) used in the four-dictionary sorting algorithm.

This field represents the official position of the character within the DaiKanwa dictionary as used by the IRG in the four-dictionary sorting algorithm. The edition used is the revised edition, published in Tokyo by Taishuukan Shoten, 1986. (Source: UAX38)
kIRGHanyuDaZidian
The position of this character in the Hanyu Da Zidian (PRC) dictionary used in the four-dictionary sorting algorithm. The position is in the form “volume page.position” with the final digit in the position being “0” for characters actually in the dictionary and “1” for characters not found in the dictionary and assigned a “virtual” position in the dictionary.

This field represents the official position of the character within the Hanyu Da Zidian dictionary as used by the IRG in the four-dictionary sorting algorithm.

The edition of the Hanyu Da Zidian used is the first edition, published in Chengdu by Sichuan Cishu Publishing, 1986. (Source: UAX38)
kIRGKangXi
The official IRG position of this character in the 《康熙字典》 Kang Xi Dictionary used in the four-dictionary sorting algorithm. The position is in the form “page.position” with the final digit in the position being “0” for characters actually in the dictionary and “1” for characters not found in the dictionary but assigned a “virtual” position in the dictionary.

Thus, “1187.060” indicates the sixth character on page 1187. A character not in this dictionary but assigned a position between the 6th and 7th characters on page 1187 for sorting purposes would have the code “1187.061”.

The edition of the Kang Xi Dictionary used is the 7th edition published by Zhonghua Bookstore in Beijing, 1989.
(Source: UAX38)
kIRG_GSource
The IRG “G” source mapping for this character in hex. The IRG G source consists of data from the following national standards, publications, and lists from the People’s Republic of China and Singapore. The versions of the standards used are those provided by the PRC to the IRG and may not always reflect published versions of the standards generally available.

G1 GB12345-90 with 58 Hong Kong and 92 Korean “Idu” characters
G3 GB7589-87 unsimplified forms
G5 GB7590-87 unsimplified forms
G7 General Purpose Hanzi List for Modern Chinese Language, and General List of Simplified Hanzi
GS Singapore Characters
G8 GB8565-88
G9 GB18030-2000
GE GB16500-95
G4K Siku Quanshu (四庫全書)
GBK Chinese Encyclopedia (中國大百科全書)
GCH Ci Hai (辞海)
GCY Ci Yuan (辭源)
GCYY Chinese Academy of Surveying and Mapping Ideographs (中国测绘科学院用字) GFZ Founder Press System (方正排版系统)
GGH Gudai Hanyu Cidian (古代汉语词典)
GHC Hanyu Dacidian (漢語大詞典)
GHZ Hanyu Dazidian ideographs (漢語大字典)
GIDC ID system of the Ministry of Public Security of China, 2009
GJZ Commercial Press Ideographs (商务印书馆用字)
GKX Kangxi Dictionary ideographs(康熙字典)9th edition (1958) including the addendum (康熙字典)補遺
GXC Xiandai Hanyu Cidian (现代汉语词典)
GZFY Hanyu Fangyan Dacidian (汉语方言大辞典)
GZH ZhongHua ZiHai (中华字海)
GZJW Yinzhou Jinwen Jicheng Yinde (殷周金文集成引得)
(Source: UAX38)
kIRG_HSource
The IRG “H” source mapping for this character in hex. The IRG “H” source consists of data from the Hong Kong Supplementary Character Set – 2008. (Source: UAX38)
kIRG_JSource
The IRG “J” source mapping for this character in hex. The IRG “J” source consists of data from the following national standards and lists from Japan.

J0 JIS X 0208-1990
J1 JIS X 0212-1990
JA Unified Japanese IT Vendors Contemporary Ideographs, 1993
JH Hanyo-Denshi Program (汎用電子情報交換環境整備プログラム), 2002-2009
JK Japanese KOKUJI Collection
JARIB Association of Radio Industries and Businesses (ARIB) ARIB STD-B24 Version 5.1, March 14 2007 (Source: UAX38)
kIRG_KPSource
The IRG “KP” source mapping for this character in hex. The IRG “KP” source consists of data from the following national standards and lists from the Democratic People’s Republic of Korea (North Korea).

KP0 KPS 9566-97
KP1 KPS 10721-2000 (Source: UAX38)
kIRG_KSource
The IRG “K” source mapping for this character in hex. The IRG “K” source consists of data from the following national standards and lists from the Republic of Korea (South Korea).

K0 KS X 1001:2004 (formerly KS C 5601-1987)
K1 KS X 1002:2001 (formerly KS C 5657-1991)
K2 PKS C 5700-1 1994
K3 PKS C 5700-2 1994
K4 PKS 5700-3:1998
K5 Korean IRG Hanja Character Set 5th Edition: 2001

Note that the K4 source is expressed in hexadecimal, but unlike the other sources, it is not organized in row/column. The content of the repertoire covered by the K2, K3, K4, and K5 sources is in the process of being reedited in new Korean standards.
(Source: UAX38)
kIRG_MSource
The IRG “M” source mapping for this character. The IRG “M” source consists of data from the Macao Information System Character Set (澳門資訊系統字集). (Source: UAX38)
kIRG_TSource
The IRG “T” source mapping for this character in hex. The IRG “T” source consists of data from the following national standards and lists from the Republic of China (Taiwan).

T1 TCA-CNS 11643-1992 1st plane
T2 TCA-CNS 11643-1992 2nd plane
T3 TCA-CNS 11643-1992 3rd plane with some additional characters
T4 TCA-CNS 11643-1992 4th plane
T5 TCA-CNS 11643-1992 5th plane
T6 TCA-CNS 11643-1992 6th plane
T7 TCA-CNS 11643-1992 7th plane
TB TCA-CNS Ministry of Education, Hakka dialect, May 2007
TC TCA-CNS 11643-1992 12th plane
TD TCA-CNS 11643-1992 13th plane
TE TCA-CNS 11643-1992 14th plane
TF TCA-CNS 11643-1992 15th plane (Source: UAX38)
kIRG_USource
The IRG “U” source mapping for this character. U-source references are a reference into the U-source ideograph database; see UTR #45. These consist of “UTC” or “UCI” followed by a hyphen and a five-digit, zero-padded index into the database. (Source: UAX38)
kIRG_VSource
The IRG “V” source mapping for this character in hex. The IRG “V” source consists of data from the following national standards and lists from Vietnam.

V0 TCVN 5773:1993
V1 TCVN 6056:1995
V2 VHN 01:1998
V3 VHN 02: 1998
V4 Dictionary on Nom 2006, Dictionary on Nom of Tay ethnic 2006, Lookup Table for Nom in the South 1994 (Source: UAX38)
kJapaneseKun
The Japanese pronunciation(s) of this character. (Source: UAX38)
kJapaneseOn
The Sino-Japanese pronunciation(s) of this character. (Source: UAX38)
kJis0
The JIS X 0208-1990 mapping for this character in ku/ten form. (Source: UAX38)
kJIS0213
The JIS X 0213-2000 mapping for this character in min/ku/ten form. (Source: UAX38)
kJis1
The JIS X 0212-1990 mapping for this character in ku/ten form. (Source: UAX38)
kKangXi
The position of this character in the 《康熙字典》 Kang Xi Dictionary used in the four-dictionary sorting algorithm. The position is in the form “page.position” with the final digit in the position being “0” for characters actually in the dictionary and “1” for characters not found in the dictionary but assigned a “virtual” position in the dictionary.

Thus, “1187.060” indicates the sixth character on page 1187. A character not in this dictionary but assigned a position between the 6th and 7th characters on page 1187 for sorting purposes would have the code “1187.061”.

The edition of the Kang Xi Dictionary used is the 7th edition published by Zhonghua Bookstore in Beijing, 1989.
(Source: UAX38)
kKarlgren
The index of this character in _Analytic Dictionary of Chinese and Sino-Japanese_ by Bernhard Karlgren, New York: Dover Publications, Inc., 1974.

If the index is followed by an asterisk (*), then the index is an interpolated one, indicating where the character would be found if it were to have been included in the dictionary. Note that while the index itself is usually an integer, there are some cases where it is an integer followed by an “A”. (Source: UAX38)
kKorean
The Korean pronunciation(s) of this character, using the Yale romanization system. (See <http://en.wikipedia.org/wiki/Korean_romanization> for a discussion of the various Korean romanization systems.) (Source: UAX38)
kKPS0
The KPS 9566-97 mapping for this character in hexadecimal form. (Source: UAX38)
kKPS1
The KPS 10721-2000 mapping for this character in hexadecimal form. (Source: UAX38)
kKSC0
The KS X 1001:1992 (KS C 5601-1989) mapping for this character in ku/ten form. (Source: UAX38)
kKSC1
The KS X 1002:1991 (KS C 5657-1991) mapping for this character in ku/ten form. (Source: UAX38)
kLau
The index of this character in A Practical Cantonese-English Dictionary by Sidney Lau, Hong Kong: The Government Printer, 1977.

The index consists of an integer. Missing indices indicate unencoded characters which are being submitted to the IRG for inclusion in future versions of the standard. (Source: UAX38)
kMainlandTelegraph
The PRC telegraph code for this character, derived from “Kanzi denpou koudo henkan-hyou” (“Chinese character telegraph code conversion table”), Lin Jinyi, KDD Engineering and Consulting, Tokyo, 1984. (Source: UAX38)
kMandarin
The most customary pinyin reading for this character; that is, the reading most commonly used in modern text, with some preference given to readings most likely to be in sorted lists.

Multiple Value Order: When there are two values, then the first is preferred for zh-Hans (CN) and the second is preferred for zh-Hant (TW). When there is only one value, it is appropriate for both. (Source: UAX38)
kMatthews
The index of this character in Mathews’ Chinese-English Dictionary by Robert H. Mathews, Cambrige: Harvard University Press, 1975.

Note that the field name is kMatthews instead of kMathews to maintain compatibility with earlier versions of this file, where it was inadvertently misspelled. (Source: UAX38)
kMeyerWempe
The index/indices of this character in the Student’s Cantonese-English Dictionary by Bernard F. Meyer and Theodore F. Wempe (3rd edition, 1947). The index is an integer, optionally followed by a lower-case Latin letter if the listing is in a subsidiary entry and not a main one. In some cases where the character is found in the radical-stroke index, but not in the main body of the dictionary, the integer is followed by an asterisk (e.g., U+50E5, which is listed as 736* as well as 1185a). (Source: UAX38)
kMorohashi
The index/indices of this character in the Dae Kanwa Ziten, aka Morohashi dictionary (Japanese) used in the four-dictionary sorting algorithm.

The edition used is the revised edition, published in Tokyo by Taishuukan Shoten, 1986. (Source: UAX38)
kNelson
The index of this character in The Modern Reader’s Japanese-English Character Dictionary by Andrew Nathaniel Nelson, Rutland, Vermont: Charles E. Tuttle Company, 1974. (Source: UAX38)
kOtherNumeric
The numeric value for the character in certain unusual, specialized contexts.

The three numeric-value fields should have no overlap; that is, characters with a kOtherNumeric value should not have a kAccountingNumeric or kPrimaryNumeric value as well.
(Source: UAX38)
kPhonetic
The phonetic index for the character from _Ten Thousand Characters: An Analytic Dictionary_, by G. Hugh Casey, S.J. Hong Kong: Kelley and Walsh, 1980. (Source: UAX38)
kPrimaryNumeric
The value of the character when used in the writing of numbers in the standard fashion.

The three numeric-value fields should have no overlap; that is, characters with a kPrimaryNumeric value should not have a kAccountingNumeric or kOtherNumeric value as well.
(Source: UAX38)
kPseudoGB1
A “GB 12345-90” code point assigned to this character for the purposes of including it within Unihan. Pseudo-GB1 codes were used to provide official code points for characters not already in national standards, such as characters used to write Cantonese, and so on. (Source: UAX38)
kRSAdobe_Japan1_6
Information on the glyphs in Adobe-Japan1-6 as contributed by Adobe. The value consists of a number of space-separated entries. Each entry consists of three pieces of information separated by a plus sign:

1) C or V. “C” indicates that the Unicode code point maps directly to the Adobe-Japan1-6 CID that appears after it, and “V” indicates that it is considered a variant form, and thus not directly encoded.

2) The Adobe-Japan1-6 CID.

3) Radical-stroke data for the indicated Adobe-Japan1-6 CID. The radical-stroke data consists of three pieces separated by periods: the KangXi radical (1-214), the number of strokes in the form the radical takes in the glyph, and the number of strokes in the residue. The standard Unicode radical-stroke form can be obtained by omitting the second value, and the total strokes in the glyph from adding the second and third values. (Source: UAX38)
kRSJapanese
One or more Japanese radical/stroke counts for this character in the form “radical.additional strokes”. (Source: UAX38)
kRSKangXi
One or more KangXi radical/stroke counts for this character consistent with the value of the kKangXi field in the form “radical.additional strokes”. (Source: UAX38)
kRSKanWa
One or more Morohashi radical/stroke counts for this character in the form “radical.additional strokes”. (Source: UAX38)
kRSKorean
One or more Korean radical/stroke counts for this character in the form “radical.additional strokes”. (Source: UAX38)
kRSUnicode
One or more standard radical/stroke counts for this character in the form “radical.additional strokes”. The radical is indicated by a number in the range (1..214) inclusive. An apostrophe (') after the radical indicates a simplified version of the given radical. The “additional strokes” value is the residual stroke-count, the count of all strokes remaining after eliminating all strokes associated with the radical.

This field is also used for additional radical-stroke indices where either a character may be reasonably classified under more than one radical, or alternate stroke count algorithms may provide different stroke counts.

The first value is equal to the normative radical-stroke value defined in ISO/IEC 10646. (Source: UAX38)
kSBGY
The position of this character in the Song Ben Guang Yun (SBGY) Medieval Chinese character dictionary (bibliographic and general information below).

The 25334 character references are given in the form “ABC.XY”, in which: “ABC” is the zero-padded page number [004..546]; “XY” is the zero-padded number of the character on the page [01..73]. For example, 364.38 indicates the 38th character on Page 364 (i.e. 澍). Where a given Unicode Scalar Value (USV) has more than one reference, these are space-delimited. (Source: UAX38)
kSemanticVariant
The Unicode value for a semantic variant for this character. A semantic variant is an x- or y-variant with similar or identical meaning which can generally be used in place of the indicated character.

The basic syntax is a Unicode scalar value. It may optionally be followed by additional data. The additional data is separated from the Unicode scalar value by a less-than sign (<), and may be subdivided itself into substrings by commas, each of which may be divided into two pieces by a colon. The additional data consists of a series of field tags for another field in the Unihan database indicating the source of the information. If subdivided, the final piece is a string consisting of the letters T (for tòng, U+540C 同) B (for bù, U+4E0D 不), Z (for zhèng, U+6B63 正), F (for fán, U+7E41 繁), or J (for jiǎn U+7C21 簡/U+7B80 简).

T is used if the indicated source explicitly indicates the two are the same (e.g., by saying that the one character is “the same as” the other).

B is used if the source explicitly indicates that the two are used improperly one for the other.

Z is used if the source explicitly indicates that the given character is the preferred form. Thus, kHanYu indicates that U+5231 刱 and U+5275 創 are semantic variants and that U+5275 創 is the preferred form.

F is used if the source explicitly indicates that the given character is the traditional form.

J is used if the source explicitly indicates that the given character is the simplified form.

Data on simplified and traditional variations can be included in this field to document cases where different sources disagree on the nature of the relationship between two characters. The kSemanticVariant and kSpecializedSemanticVariant fields need not be consulted when interconverting between traditional and simplified Chinese. (Source: UAX38)
kSimplifiedVariant
The Unicode value(s) for the simplified Chinese variant(s) for this character. A full discussion of the kSimplifiedVariant and kTraditionalVariant fields is found in section 3.7.1 above.

Much of the of the data on simplified and traditional variants was graciously supplied by Wenlin Institute, Inc. <http://www.wenlin.com>. (Source: UAX38)
kSpecializedSemanticVariant
The Unicode value for a specialized semantic variant for this character. The syntax is the same as for the kSemanticVariant field.

A specialized semantic variant is an x- or y-variant with similar or identical meaning only in certain contexts (such as accountants’ numerals).
(Source: UAX38)
kTaiwanTelegraph
The Taiwanese telegraph code for this character, derived from “Kanzi denpou koudo henkan-hyou” (“Chinese character telegraph code conversion table”), Lin Jinyi, KDD Engineering and Consulting, Tokyo, 1984.
(Source: UAX38)
kTang
The Tang dynasty pronunciation(s) of this character, derived from or consistent with _T’ang Poetic Vocabulary_ by Hugh M. Stimson, Far Eastern Publications, Yale Univ. 1976. An asterisk indicates that the word or morpheme represented in toto or in part by the given character with the given reading occurs more than four times in the seven hundred poems covered. (Source: UAX38)
kTotalStrokes
The total number of strokes in the character (including the radical), that is, the stroke count most commonly associated with the character in modern text using customary fonts.

Multiple Value Order: When there are two values, then the first is preferred for zh-Hans (CN) and the second is preferred for zh-Hant (TW). When there is only one value, it is appropriate for both. (Source: UAX38)
kTraditionalVariant
The Unicode value(s) for the traditional Chinese variant(s) for this character. A full discussion of the kSimplifiedVariant and kTraditionalVariant fields is found in section 3.7.1 above.

Much of the of the data on simplified and traditional variants was graciously supplied by Wenlin Institute, Inc. <http://www.wenlin.com>. (Source: UAX38)
kVietnamese
The character’s pronunciation(s) in Quốc ngữ. (Source: UAX38)
kXerox
The Xerox code for this character. (Source: UAX38)
kXHC1983
One or more Hànyǔ Pīnyīn readings as given in the Xiàndài Hànyǔ Cídiǎn.

Each pīnyīn reading is preceded by the character’s location(s) in the dictionary, separated from the reading by “:” (colon); multiple locations for a given reading are separated by “,” (comma); multiple “location: reading” values are separated by “ ” (space). Each location reference is of the form /[0-9]{4}\.[0-9]{3}\*?/ . The number preceding the period is the page number, zero-padded to four digits. The first two digits of the number following the period are the entry’s position on the page, zero-padded. The third digit is 0 for a main entry and greater than 0 for a parenthesized variant of the main entry. A trailing “*” (asterisk) on the location indicates an encoded variant substituted for an unencoded character (see below). (Source: UAX38)
kZVariant
The Unicode value(s) for known z-variants of this character.

The basic syntax is a Unicode scalar value. It may optionally be followed by additional data. The additional data is separated from the Unicode scalar value by a less-than sign (<), and may be subdivided itself into substrings by commas. The additional data consists of a series of field tags for another field in the Unihan database indicating the source of the information. (Source: UAX38)
Line Break
Properties for line breaking. For more information, see Unicode Standard Annex #14, "Unicode Line Breaking Algorithm" [UAX14]. (Source: UAX44)
Logical Order Exception
A small number of spacing vowel letters occurring in certain Southeast Asian scripts such as Thai and Lao, which use a visual order display model. These letters are stored in text ahead of syllable-initial consonants, and require special handling for processes such as searching and sorting. (Source: UAX44)
Lowercase
Characters with the Lowercase property. For more information, see Chapter 4, Character Properties in [Unicode].

Generated from: Ll + Other Lowercase

(Source: UAX44)
Lowercase Mapping
Titlecase Mapping
Uppercase Mapping
Data for producing (in combination with the simple case mappings from UnicodeData.txt) the full case mappings. (Source: UAX44)
Math
Characters with the Math property. For more information, see Chapter 4, Character Properties in [Unicode].

Generated from: Sm + Other Math

(Source: UAX44)
Name
These names match exactly the names published in the code charts of the Unicode Standard. The derived Hangul Syllable names are omitted from this file; see Jamo.txt for their derivation. (Source: UAX44)
Name Alias
Normative formal aliases for characters with erroneous names, for control characters and some format characters, and for character abbreviations, as described in Chapter 4, Character Properties in [Unicode]. The aliases tagged with the type "correction" exactly match the formal aliases published in the Unicode Standard code charts. (Source: UAX44)
NFKC Casefold
A mapping designed for best behavior when doing caseless matching of strings interpreted as identifiers. (Abbreviated name: NFKC_CF)

For the definition of the related string transform toNFKC_Casefold() based on this mapping, see Section 3.13, Default Case Algorithms in [Unicode].

The mapping is listed in Field 2.

(Source: UAX44)
NFC Quick Check
NFKC Quick Check
NFD Quick Check
NFKD Quick Check
For property values, see Decompositions and Normalization. (Abbreviated names: NFD_QC, NFKD_QC, NFC_QC, NFKC_QC) (Source: UAX44)
Noncharacter Code Point
Code points permanently reserved for internal use. (Source: UAX44)
Numeric Type
Numeric Value
If the character has the property value Numeric_Type=Decimal, then the Numeric_Value of that digit is represented with an integer value (limited to the range 0..9) in fields 6, 7, and 8. Characters with the property value Numeric_Type=Decimal are restricted to digits which can be used in a decimal radix positional numeral system and which are encoded in the standard in a contiguous ascending range 0..9. See the discussion of decimal digits in Chapter 4, Character Properties in [Unicode]. (Source: UAX44)
If the character has the property value Numeric_Type=Digit, then the Numeric_Value of that digit is represented with an integer value (limited to the range 0..9) in fields 7 and 8, and field 6 is null. This covers digits that need special handling, such as the compatibility superscript digits. (Source: UAX44)
If the character has the property value Numeric_Type=Numeric, then the Numeric_Value of that character is represented with a positive or negative integer or rational number in this field, and fields 6 and 7 are null. This includes fractions such as, for example, "1/5" for U+2155 VULGAR FRACTION ONE FIFTH.

Some characters have these properties based on values from the Unihan data files. See Numeric Type Han.

(Source: UAX44)
Numeric Type (Han)
Numeric Value (Han)
The characters tagged with either kPrimaryNumeric, kAccountingNumeric, or kOtherNumeric are given the property value Numeric_Type=Numeric, and the Numeric_Value indicated in those tags.

Most characters have these numeric properties based on values from UnicodeData.txt. See Numeric_Type.

(Source: UAX44)
Other Alphabetic
Used in deriving the Alphabetic property. (Source: UAX44)
Other Default Ignorable Code Point
Used in deriving the Default_Ignorable_Code_Point property. (Source: UAX44)
Other Grapheme Extend
Used in deriving  the Grapheme_Extend property. (Source: UAX44)
Other ID Continue
Used to maintain backward compatibility of ID Continue. (Source: UAX44)
Other ID Start
Used to maintain backward compatibility of ID Start. (Source: UAX44)
Other Lowercase
Used in deriving the Lowercase property. (Source: UAX44)
Other Math
Used in deriving the Math property. (Source: UAX44)
Other Uppercase
Used in deriving the Uppercase property. (Source: UAX44)
Pattern Syntax
Pattern White Space
Used for pattern syntax as described in Unicode Standard Annex #31, "Unicode Identifier and Pattern Syntax" [UAX31]. (Source: UAX44)
Plane
A Unicode plane is one of 17 sets of 65536 codepoints each. Currently only the first three planes contain character definitions. The last two planes are reserved for private use.
Private Use
So-called “private use” areas are Unicode codepoints, that are deliberately not assigned to characters. These codepoints can be used by application developers to add their own extensions to Unicode.
Quotation Mark
Punctuation characters that function as quotation marks. (Source: UAX44)
Radical
Used in Ideographic Description Sequences. (Source: UAX44)
Script
Script values for use in regular expressions and elsewhere. For more information, see Unicode Standard Annex #24, "Unicode Script Property" [UAX24]. (Source: UAX44)
Script Extensions
Enumerated sets of Script values for use in regular expressions and elsewhere. For more information, see Unicode Standard Annex #24, "Unicode Script Property" [UAX24]. (Source: UAX44)
Sentence Break
See Unicode Standard Annex #29, "Unicode Text Segmentation" [UAX29] (Source: UAX44)
Simple Lowercase Mapping
Simple lowercase mapping (single character result). (Source: UAX44)
Simple Titlecase Mapping
Simple titlecase mapping (single character result).

Note: If this field is null, then the Simple_Titlecase_Mapping is the same as the Simple_Uppercase_Mapping for this character.

(Source: UAX44)
Simple Uppercase Mapping
Simple uppercase mapping (single character result).
If a character is part of an alphabet with case distinctions, and has a simple uppercase equivalent, then the uppercase equivalent is in this field. The simple mappings have a single character result, where the full mappings may have multi-character results. For more information, see Case and Case Mapping. (Source: UAX44)
Soft Dotted
Characters with a "soft dot", like i or j. An accent placed on these characters causes the dot to disappear. An explicit dot above can be added where required, such as in Lithuanian. (Source: UAX44)
STerm
Sentence Terminal. Used in Unicode Standard Annex #29, "Unicode Text Segmentation" [UAX29]. (Source: UAX44)
Terminal Punctuation
Punctuation characters that generally mark the end of textual units. (Source: UAX44)
Unicode
A standard to map characters to codepoints, numeric representations. The Unicode standard is curated by the Unicode Consortium. It is internationally standardized as ISO 10464.
Unicode 1 Name
Old name as published in Unicode 1.0. This name is only provided when it is significantly different from the current name for the character. The value of field 10 for control characters does not always match the Unicode 1.0 names. Instead, field 10 contains ISO 6429 names for control functions, for printing in the code charts. (Source: UAX44)
Unicode Radical Stroke
The Unicode radical-stroke count, based on the tag kRSUnicode. (Source: UAX44)
Unified Ideograph
A property which specifies the exact set of Unified CJK Ideographs in the standard. This set excludes CJK Compatibility Ideographs (which have canonical decompositions to Unified CJK Ideographs), as well as characters from the CJK Symbols and Punctuation block. The property is used in the definition of Ideographic Description Sequences. (Source: UAX44)
Uppercase
Characters with the Uppercase property. For more information, see Chapter 4, Character Properties in [Unicode].

Generated from: Lu + Other Uppercase

(Source: UAX44)
Variation Selector
Indicates characters that are Variation Selectors. For details on the behavior of these characters, see StandardizedVariants.html, Section 16.4, Variation Selectors in [Unicode], and Unicode Standard Annex #37, "Unicode Ideographic Variation Database" [UTS37]. (Source: UAX44)
White Space
Spaces, separator characters and other control characters which should be treated by programming languages as "white space" for the purpose of parsing elements. See also Line Break, Grapheme Cluster Break, Sentence Break, and Word Break, which classify space characters and related controls somewhat differently for particular text segmentation contexts. (Source: UAX44)
Word Break
See Unicode Standard Annex #29, "Unicode Text Segmentation" [UAX29] (Source: UAX44)