GB 18030

From Wikipedia, the free encyclopedia
GB 18030
GB18030 encoding.svg
GB 18030 encoding layout. "Half codes" indicates codes used in pairs as four-byte codes.
MIME / IANAGB18030
Alias(es)Code page 54936
Language(s)International, but primarily meant for Chinese
StandardGB 18030-2005, GB 18030-2000
ClassificationUnicode Transformation Format, extended ASCII,[a] variable-width encoding, CJK encoding
ExtendsEUC-CN, GBK
Transforms / EncodesISO 10646 (Unicode)
Preceded byGBK, GB2312
  1. ^ Not in the strictest sense of the term, as ASCII bytes can appear as trail bytes.

GB 18030 is a Chinese government standard, described as Information Technology — Chinese coded character set and defines the required language and character support necessary for software in China. GB18030 is the registered Internet name for the official character set of the People's Republic of China (PRC) superseding GB2312.[1] As a Unicode Transformation Format[a] (i.e. an encoding of all Unicode code points), GB18030 supports both simplified and traditional Chinese characters. It is also compatible with legacy encodings including GB2312, CP936,[b] and GBK 1.0.

In addition to the "GB18030 character encoding", this standard contains requirements about which scripts must be supported, font support, etc.[2]

History[]

The GB18030 character set is formally called "Chinese National Standard GB 18030-2005: Information Technology—Chinese coded character set". GB abbreviates Guójiā Biāozhǔn (国家标准), which means national standard in Chinese. The standard was published by the China Standard Press, Beijing, 8 November 2005. Only a portion of the standard is mandatory.[2] Since 1 May 2006, support for the mandatory subset is officially required for all software products sold in the PRC.

Different Unicode mappings between GB 18030 versions
GB byte
sequence
Unicode code point
GB 18030-2000 GB 18030-2005
A8 BC (ḿ) U+E7C7 U+1E3F ḿ
81 35 F4 37 U+1E3F ḿ U+E7C7

An older version of the standard, known as "Chinese National Standard GB 18030-2000: Information Technology—Chinese ideograms coded character set for information interchange—Extension for the basic set", was published on March 17, 2000. The encoding scheme stays the same in the new version, and the only difference in GB-to-Unicode mapping is that GB 18030-2000 mapped the character A8 BC (ḿ) to a private use code point U+E7C7, and character 81 35 F4 37 (without specifying any glyph) to U+1E3F (ḿ), whereas GB 18030-2005 swaps these two mapping assignments.[3]: 534 More code points are now associated with characters due to update of Unicode, especially the appearance of CJK Unified Ideographs Extension B. Some characters used by ethnic minorities in China, such as Mongolian characters and Tibetan characters (-1997 and -2006), have been added as well, which accounts for the renaming of the standard.

Compared with its ancestors, GB 18030's mapping to Unicode has been modified for the 81 characters that were provisionally assigned a Unicode Private Use Area code point (U+E000–F8FF) in GBK 1.0 and that have later been encoded in Unicode.[4] This is specified in Appendix E of GB 18030.[3]: 534[5]: 499 There are 24 characters in GB 18030-2005 that are still mapped to Unicode PUA.[6] According to Ken Lunde, the 2018 Draft of a new revision of GB 18030 will finally eliminate these mappings.[7]

Private use characters in GB-to-Unicode mappings
GB byte
sequence
Unicode code point (blue = private use)
GBK 1.0[8][3]: 534 GB 18030
-2005[6]
Unicode 4.1
A6 D9[9]: 108 U+E78D U+FE10
A6 DA U+E78E U+FE12
A6 DB U+E78F U+FE11
A6 DC U+E790 U+FE13
A6 DD U+E791 U+FE14
A6 DE U+E792 U+FE15
A6 DF U+E793 U+FE16
A6 EC U+E794 U+FE17
A6 ED U+E795 U+FE18
A6 F3 U+E796 U+FE19
A8 BC U+E7C7 U+1E3F ḿ
A8 BF U+E7C8 U+01F9 ǹ
A9 89 U+E7E7 U+303E
A9 8A U+E7E8 U+2FF0
A9 8B U+E7E9 U+2FF1
A9 8C U+E7EA U+2FF2
A9 8D U+E7EB U+2FF3
A9 8E U+E7EC U+2FF4
A9 8F U+E7ED U+2FF5
A9 90 U+E7EE U+2FF6
A9 91 U+E7EF U+2FF7
A9 92 U+E7F0 U+2FF8
A9 93 U+E7F1 U+2FF9
A9 94[9]: 173 U+E7F2 U+2FFA
A9 95 U+E7F3 U+2FFB
FE 50 U+E815 U+2E81
FE 51 U+E816 U+20087
WIKI