GB 18030

GB 18030
	GB 18030 encoding layout. "Half codes" indicates codes used in pairs as four-byte codes.
MIME / IANA	GB18030
Alias(es)	Code page 54936
Language(s)	International, but primarily meant for Chinese
Standard	GB 18030-2005, GB 18030-2000
Classification	Unicode Transformation Format, extended ASCII, variable-width encoding, CJK encoding
Extends	EUC-CN, GBK
Transforms / Encodes	ISO 10646 (Unicode)
Preceded by	GBK, GB2312
	^ Not in the strictest sense of the term, as ASCII bytes can appear as trail bytes. ;
	v; t; ;

GB 18030 is a Chinese government standard, described as Information Technology — Chinese coded character set and defines the required language and character support necessary for software in China. GB18030 is the registered Internet name for the official character set of the People's Republic of China (PRC) superseding GB2312.^[1] As a Unicode Transformation Format^[a] (i.e. an encoding of all Unicode code points), GB18030 supports both simplified and traditional Chinese characters. It is also compatible with legacy encodings including GB2312, CP936,^[b] and GBK 1.0.

In addition to the "GB18030 character encoding", this standard contains requirements about which scripts must be supported, font support, etc.^[2]

History[]

The GB18030 character set is formally called "Chinese National Standard GB 18030-2005: Information Technology—Chinese coded character set". GB abbreviates Guójiā Biāozhǔn (国家标准), which means national standard in Chinese. The standard was published by the China Standard Press, Beijing, 8 November 2005. Only a portion of the standard is mandatory.^[2] Since 1 May 2006, support for the mandatory subset is officially required for all software products sold in the PRC.

Different Unicode mappings between GB 18030 versions
GB byte sequence	Unicode code point
GB byte sequence	GB 18030-2000	GB 18030-2005
A8 BC (ḿ)	`U+E7C7`	U+1E3F ḿ
81 35 F4 37	U+1E3F ḿ	`U+E7C7`

An older version of the standard, known as "Chinese National Standard GB 18030-2000: Information Technology—Chinese ideograms coded character set for information interchange—Extension for the basic set", was published on March 17, 2000. The encoding scheme stays the same in the new version, and the only difference in GB-to-Unicode mapping is that GB 18030-2000 mapped the character A8 BC (ḿ) to a private use code point U+E7C7, and character 81 35 F4 37 (without specifying any glyph) to U+1E3F (ḿ), whereas GB 18030-2005 swaps these two mapping assignments.^[3]^: 534 More code points are now associated with characters due to update of Unicode, especially the appearance of CJK Unified Ideographs Extension B. Some characters used by ethnic minorities in China, such as Mongolian characters and Tibetan characters (-1997 and -2006), have been added as well, which accounts for the renaming of the standard.

Compared with its ancestors, GB 18030's mapping to Unicode has been modified for the 81 characters that were provisionally assigned a Unicode Private Use Area code point (U+E000–F8FF) in GBK 1.0 and that have later been encoded in Unicode.^[4] This is specified in Appendix E of GB 18030.^[3]^: 534^[5]^: 499 There are 24 characters in GB 18030-2005 that are still mapped to Unicode PUA.^[6] According to Ken Lunde, the 2018 Draft of a new revision of GB 18030 will finally eliminate these mappings.^[7]

Private use characters in GB-to-Unicode mappings
GB byte sequence	Unicode code point (blue = private use)
GB byte sequence	GBK 1.0^[8]^[3]^: 534	GB 18030 -2005^[6]	Unicode 4.1
A6 D9^[9]^: 108		`U+E78D`	U+FE10 ︐
A6 DA		`U+E78E`	U+FE12 ︒
A6 DB		`U+E78F`	U+FE11 ︑
A6 DC		`U+E790`	U+FE13 ︓
A6 DD		`U+E791`	U+FE14 ︔
A6 DE		`U+E792`	U+FE15 ︕
A6 DF		`U+E793`	U+FE16 ︖
A6 EC		`U+E794`	U+FE17 ︗
A6 ED		`U+E795`	U+FE18 ︘
A6 F3		`U+E796`	U+FE19 ︙
A8 BC	`U+E7C7`	U+1E3F ḿ
A8 BF	`U+E7C8`	U+01F9 ǹ
A9 89	`U+E7E7`	U+303E 〾
A9 8A	`U+E7E8`	U+2FF0 ⿰
A9 8B	`U+E7E9`	U+2FF1 ⿱
A9 8C	`U+E7EA`	U+2FF2 ⿲
A9 8D	`U+E7EB`	U+2FF3 ⿳
A9 8E	`U+E7EC`	U+2FF4 ⿴
A9 8F	`U+E7ED`	U+2FF5 ⿵
A9 90	`U+E7EE`	U+2FF6 ⿶
A9 91	`U+E7EF`	U+2FF7 ⿷
A9 92	`U+E7F0`	U+2FF8 ⿸
A9 93	`U+E7F1`	U+2FF9 ⿹
A9 94^[9]^: 173	`U+E7F2`	U+2FFA ⿺
A9 95	`U+E7F3`	U+2FFB ⿻
FE 50	`U+E815`	U+2E81 ⺁
FE 51	`U+E816`		U+20087 WIKI

[1] Not in the strictest sense of the term, as ASCII bytes can appear as trail bytes.

[a]

[1]

[a]

[b]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]