KPS 9566

From Wikipedia, the free encyclopedia
KPS 9566
Alias(es)ISO-IR-202 (1997 version)
Language(s)Korean, English, Russian
Partial support:
Greek, Japanese
StandardKPS 9566
Current statusUsed only in North Korea.
ClassificationISO-2022-compatible DBCS, CJK encoding
Encoding formatsUHC-style encoding,[1] ISO 2022.
Other related encoding(s)KS X 1001, GB 12052

KPS 9566 ("DPRK Standard Korean Graphic Character Set for Information Interchange")[2] is a North Korean standard specifying a character encoding for the Chosŏn'gŭl (Hangul) writing system used for the Korean language. The edition of 1997 specified an ISO 2022-compliant 94×94 two-byte coded character set. Subsequent editions have added additional encoded characters outside of the 94×94 plane, in a manner comparable to UHC or GBK.[3]

KPS 9566 differs in approach from KS X 1001, its South Korean counterpart, in using a different ordering of chosŏn'gŭl,[4] in encoding explicit vertical presentation forms of punctuation, in not encoding duplicate hanja for multiple readings, and in including several characters specific to the North Korean political system, including special encodings for the names of the country's past and present leaders (Kim Il-sung, Kim Jong-il and Kim Jong-un).[1][2][3][5]

Although KPS 9566 was the original source of several characters added to Unicode,[6] not all KPS 9566 characters have Unicode equivalents. Those which do not are mapped to similar Unicode characters or to the Private Use Area.[7]

Background and other standards[]

The ASCII character set originated in the United States in 1963, and was revised in 1967 to the form it has today.[8] ASCII also became accepted as an international standard in 1967, becoming ECMA-6,[8] designated ISO/IEC 646 by the International Organisation for Standardization.[9] It is presently designated ANSI X3.4-1986 and ISO 646:1991.[10] ASCII was a 7-bit, single-byte encoding including 94 graphical characters, the space, and 33 control codes, which provided basic support for representing American English text as a series of bytes.[8][10]

The next edition of ISO 646, published in 1972, revised the standard to introduce the concept of national versions of the code, allowing countries to replace a few less commonly used codes with their own required characters. At the same time, work on defining extension mechanisms for ASCII was underway, with the intention of being applicable to both 7-bit and 8-bit environments. This was completed in 1973 and published as JIS X 0202, ECMA-35 and ISO 2022.[11] ISO 2022 specifies mechanisms for using single-byte and multiple-byte character sets with a certain structure in both 7-bit and 8-bit environments, and for declaring and switching between them in a standard fashion using shift codes and escape sequences.[12]

Countries in East Asia, due to using large repertoires of Chinese characters, introduced standardised double-byte encodings (DBCS) for their writing systems, since the number of characters representable in a single-byte code was not sufficient. In an ISO 2022 compliant DBCS, every character can be represented with two ASCII printing character bytes; the location of a character can be referenced by these byte values, or by two numbers from 1 to 94 (a kuten), equal to the respective bytes minus 32.[13] The first registered ISO 2022 compliant DBCS, and the first East Asian DBCS to be established as a national standard, was the first edition of JIS X 0208 (Japan), published in 1978.[14][15] This was followed by GB 2312 (Mainland China) in 1980, and by Wansung code (South Korea; first designated KS C 5601-1987) in 1987.[16][15] Big5 (Taiwan), defined in 1984, did not follow the ISO 2022 structure.[16] When used in an 8-bit (rather than 7-bit) environment, GB 2312 and Wansung code were usually used with the eighth bit set, with ASCII or a similar SBCS used with the eighth bit unset; these encoding schemes are known as EUC-CN and EUC-KR, respectively.[17]

Although the Korean writing system includes individual symbols (jamo) for consonants and vowels, serving as an alphabet, Korean text is properly typeset with these symbols composed into blocks for each syllable. Wansung code included individual Korean syllable blocks separately, treating them as a large set of characters similarly to hanja,[18] and was first defined by the third edition of the South Korean standard KS C 5601. The first edition had defined an encoding of individual jamo which allowed syllable blocks to be encoded as sequences, which was named N-byte Hangul, and had not been adopted as widely as intended.[19][20]

Wansung code did not encode all possible modern Korean syllables, only a selection of the 2350 most common,[2] although it allowed them to be specified using combining sequences, which often were not supported.[18] An alternative encoding, also South Korean, named Johab did, and served as a competitor to Wansung for some time.[19] Unified Hangul Code (UHC), introduced by Microsoft with Windows 95, extended EUC-KR, allowing the use of invalid EUC double-byte codes to represent all other syllables available in Johab.[18] A similar approach was taken by the Mainland Chinese GBK encoding, extending GB 2312 with support for Traditional Chinese and for less common Chinese characters by encoding them to double-byte codes invalid in EUC-CN.[16]

South Korea was not the only country developing an ISO 2022 DBCS for Korean: the Mainland Chinese GB 12052 was published in 1989. This was not closely related to Wansung code, although it also included composed syllables. Instead, it corresponded to GB 2312 with Korean syllables (and 94 hanja) replacing the Chinese characters, except for the inclusion of a dollar sign in place of a yuan sign. It may have been developed for use by the Korean minority in north-eastern China.[2]

Likewise, North Korea developed KPS 9566. Although North Korea and South Korea both use Korean Chosŏn'gŭl (Hangul) as their primary writing system, they use different lexicographical orders.[21] Hence, character ordering differs between Wansung code and KPS 9566.[4]

KPS 9566 has undergone several revisions, including editions of 1997 and 2003,[22] mainly to enhance compatibility with Unicode. These are commonly indicated by specifying the year (e.g. KPS 9566-97, 9566-2003). The current edition as of the release of Red Star OS 3.0 appears to be KPS 9566-2011, which adds Kim Jong-un to the list of leaders.[3] The publicly available code chart for the 1997 edition of KPS 9566 shows a ISO 2022 94×94 plane.[23] The more recent editions, from what sources of information are available outside of North Korea itself, appear to define additional allocations outside of the EUC plane (similarly to GBK or UHC).[3]

Due to the interoperability issues arising from the use of multiple national standard and platform- or font-specific proprietary character encodings, the Unicode standard was developed with the intent of allowing all representable text to be interchanged in a single, universal format. The first edition of Unicode was published in 1991 and 1992,[24] and ISO/IEC 10646 was established in sync with Unicode in 1993.[25] Unicode formats are preferred for international use on the World Wide Web, where legacy character encodings are treated as partial encodings of Unicode by means of mapping files.[26][27]

Design[]

In principle, KPS 9566 is similar to the Wansung character set defined by the South Korean KS X 1001 standard, although the two are not compatible. Both encode a section of punctuation, symbols, jamo, kana and alphabetical characters, followed by a subset of the possible modern chosŏn'gŭl syllables, followed by a section of hanja.[2] However, KPS 9566 uses a different ordering of jamo and syllables to conform with North Korean lexicographical ordering standards.[4] KPS 9566 also includes 28 explicitly rotated punctuation characters for vertical typography, which KS X 1001 does not, and encodes each hanja only once, whereas KS X 1001 encodes several hanja with multiple readings multiple times.[2]

KPS 9566-97 encodes a total of 2679 chosŏn'gŭl syllables and 4653 hanja. This provides better coverage than the 2350 syllables encoded by Wansung code: for instance, the 똠 character used in the name of  [ko], a noted Korean literary work, does not have an assigned Wansung codepoint, but has one (38-02) in KPS 9566.[2] The hanja section includes 4652 characters from the Unified Repertoire and Ordering and one from CJK Unified Ideographs Extension A. The entirety of row 15, the latter half of row 44 (after the syllables block) and the latter half of row 94 (after the hanja block) may be used for user-defined purposes.[2]

KPS 9566 is especially distinguished by its inclusion of several special characters from North Korean political life. Specifically, it includes the hammer, sickle and brush emblem of the Workers' Party of Korea, both uncircled and circled[7] (code points 12-01 and 12-02),[23] and two groups of three special-purpose characters which spell out the names of the North Korean leaders Kim Il-sung (김일성) and Kim Jong-il (김정일) in a special decorative font (code points 04-72 to 04-74 and 04-75 to 04-77, respectively).[28] The syllables for Kim and Il, which are identical in the spelling of both names, are encoded twice. KPS 9566-2011 additionally includes the name of Kim Jong-un (김정은) as code points 04-78 to 04-80.[3][5]

Due to these special characters, there is currently no full round-trip compatibility between KPS 9566 and Unicode, unless unsupported characters are mapped to the Private Use Area.[1]

KPS 10721[]

North Korea also developed a second character set, KPS 10721 "Code of the supplementary Korean Hanja Set for Information Interchange", which was published in 2000. KPS 10721 encodes a set of at least 19469 hanja[2] additional to those included in KPS 9566. As of 2009, these did not all have mappings to Unicode, but included 10358 from the Unified Repertoire and Ordering, 3187 from CJK Unified Ideographs Extension A and 107 from CJK Compatibility Ideographs (all in the Basic Multilingual Plane), as well as 5767 from CJK Unified Ideographs Extension B and 50 from CJK Compatibility Ideographs Supplement (in the Supplementary Ideographic Plane).[2]

Besides the mapping of these hanja to Unicode, little is known about the KPS 10721 standard outside of North Korea.[2][5] North Korean reference glyphs are not provided for these hanja in the Unicode code charts, due to a lack of suitable font data available to the Unicode Consortium.[29] Unicode hanja characters with KPS 9566 or KPS 10721 sources are nonetheless cross-referenced to their KPS codes in the Unihan database with the key kIRG_KPSource.[30]

Documentation and relationship to Unicode[]

Unicode's initial coverage of Korean syllables, added in version 1.0, was based on Wansung code. In Unicode version 2.0, a new block of Korean syllables (the present Hangul Syllables block) was added, based on the syllable repertoire available in Johab, and the previous block was deleted (it is now occupied by CJK Unified Ideographs Extension A). This was done under the assumption that no Unicode-encoded Korean data existed yet, but became known as the "Korean mess", and the responsible committees pledged not to make such an incompatible change in the future,[31] a pledge codified by the Unicode Stability Policy.[32]

The code chart for KPS 9566-97, published April 1997,[2] was submitted to the ISO International Register of Coded Character Sets for registration for use with ISO/IEC 2022. It was registered in June 1998 with the number ISO-IR-202. This code chart is publicly available from the Information Processing Society of Japan.[23]

In August 1999, the North Korean national body submitted a document to WG2 (ISO/IEC JTC 1/SC 2 Working Group 2), the ISO body responsible for ISO/IEC 10646, the international standard corresponding to Unicode. This document requested the addition of the KPS 9566 codes to the existing cross-references from the CJK Unified Ideographs charts, the addition of 80 symbol characters from KPS 9566 which did not have existing Unicode mappings, a resolution to the difference in collation order between KPS 9566 and Unicode (due to the order of the characters in Unicode following the South Korean encodings) and the addition of 8 combining jamo. It also requested for WG2 to edit the existing Unicode character and block names to use the term "Korean character" rather than "Hangul".[33] An expanded version of this proposal, broken into several documents, was submitted as a work item in December 1999.[34]

A detailed response was submitted by the Swedish representative in March 2000, opposing several of the points and elaborating on Sweden's vote against the proposal. This response stated that changing the encoding of the Korean characters again would cause major disruption, even more so than the first time, which was done when comparatively few implementations existed, but which in retrospect should not have been done. It explained that that few or no languages can be collated correctly by code point value, and that a tailoring for the Unicode Collation Algorithm or ISO/IEC 14651 (then being drafted) should be used for that purpose, and that normative names of characters already assigned cannot be changed, due to the stability policy, although non-normative translations to other languages can be employed. It suggested that a machine-readable mapping file between Unicode and KPS 9566 could be provided by the North Korean body itself, and would be more useful than a printed cross-reference in the standard document. Regarding the proposed additional characters, the response stated that characters which would have compatibility decompositions in Unicode should not be added and that logos, including those of political parties, and special characters for names of particular persons should not be added.[35]

In July 2000, the North Korean body wrote to WG2, accusing them of developing both versions of the Unicode encoding for Korean on the basis of South Korean proposals only, without consulting North Korea, accusing them putting the commercial interests of companies and fears of international confusion over respect to North Korea's sovereignty, and stating that North Korea would regard further refusal to change the name and order of the Korean characters in Unicode as an insult to their sovereign dignity and as compromising the ISO's claims to impartiality. They re-iterated their demand for WG2 and Unicode to "correct" the order of the Korean characters, and to "correct" the names "Hangul Jamo" and "Hangul Syllable" to "Korean Alphabet" and "Korean Syllable".[4]

In August 2000, the North Korean national body submitted a more detailed version of their requests in a series of five consecutive proposals. These requested the addition of 14 additional jamo characters,[36] the addition of 82 symbol characters,[37] and the use of the term "Korean alphabet" instead of "Hangul",[38] provided supporting evidence for the North Korean collation order,[21] and requested addition of the North Korean hanja repertoire.[39] These proposals were discussed in two meetings between North Korean, South Korean, Swedish and other WG2 representatives in September 2000, in which the North Korean body was asked to provide manuscript evidence for the additional jamo characters, to resubmit their symbols proposal with symbols which had already been accepted into Unicode removed, and to consider using ISO/IEC 14651, then at final draft stage, for collation purposes.[40]

In September 2001, the North Korean national body submitted a revised series of proposals requesting the addition of several KPS 9566 and KPS 10721 characters, including 70 symbol characters, to Unicode.[41][42] In this version of the proposal, a section of document excerpts demonstrating use of several characters and short explanations of their purpose was included. The Workers' Party of Korea symbol was named the "Hammer and Sickle and Brush",[41] renamed from "Mark of the Workers' Party of Korea" in earlier versions of the proposal,[37] and justified as being used as an identifying symbol on maps.[41] As justification for the proposed characters for leaders' names, they explained that the leaders' names often appear with a different size and font weight in North Korean publications for the purpose of emphasis.[41] A follow-up by South Korean WG2 representatives requested evidence, names in Korean and justifications for adding certain of these characters, and noted that non-emphasised versions of the characters for the leaders' names already existed.[43] A meeting of North and South Korean representatives from WG2 was convened in October 2001, which recommended 47 of the symbol characters for adding to Unicode, and suggested that the leaders' names and WPK symbols be raised for further discussion by WG2.[44]

A subsequent feedback document from February 2002 regarding the North Korean proposed additions requested that the "tea" symbol for a tea house be accepted as a more general "hot beverage" symbol, equating it with symbols used in guidebooks to denote hot or non-alcoholic beverages. It also recommended that the reference glyph for the existing codepoint for an umbrella without rain be modified to harmonise with the proposed reference glyph for the umbrella with rain, equating them to the "keep dry" symbols used on packaging, and raised the question of which lightning bolt and high voltage warning symbols in existing symbol collections could be unified with the proposed "high voltage" character.[45] All three of these characters were accepted into Unicode in version 4.0.[46] It also recommended that the horizontal-barred fractions and the left-up pointing scissors be encoded using a variation selector, since the scissors did not accompany a differently-oriented pair of scissors, and since the existing Unicode fraction codepoints unified the skewed and horizontal forms.[45]

In November 2002, the South Korean body published a set of three-way tables mapping characters between the KPS 9566, KS X 1001 (as EUC-KR) and ISO/IEC 10646 standards as they existed in 2000. These tables had been prepared without input from North Korea.[47]

In August 2004, a pair of mapping tables between KPS 9566-2003 and Unicode were submitted to the OpenOffice.org project by an individual using the name "ooprojlover", who stated that they represented the updated version of the KPS 9566 standard and requested that support be added.[22] These files mapped the characters unavailable in Unicode to the Private Use Area, and included additional encoded forms for other syllable blocks outside of the main ISO-IR-202 plane. A mapping table was later published by the Unicode Consortium in 2011, based on this mapping data but with errors corrected with reference to the ISO-IR chart.[1]

Copies of Red Star OS 3.0 include fonts for a more recent edition of KPS 9566, appearing to be KPS 9566-2011. The mapping table used by Red Star OS internally has been successfully extracted. Besides adding Kim Jong-un to the list of leaders, KPS 9566-2011 amends the mappings of certain vertical forms compared to the 2003 mappings (taking advantage of the Vertical Forms block added in Unicode 4.1), and also includes several additional hanja and symbols encoded outside of the ISO-IR-202 plane. Several of these additional symbols are also mapped to the Private Use Area; however, their identity is not known, since no names or reference glyphs for those characters are known outside of North Korea.[3]

Impact on Unicode today[]

Several current Unicode characters were added to Unicode 4.0 as a result of the North Korean proposals, although not always at the original proposed codepoints. These include HOT BEVERAGE (☕, proposed as TEA SYMBOL), which was proposed as a map symbol for marking a tea house, and the flag symbols WHITE FLAG (⚐) and BLACK FLAG (⚑), which were proposed as map symbols for sites of battles and military victories.[6] These characters were proposed for the provisional code points U+270A, U+268E and U+268F respectively,[44] but encoded at the final code points U+2615, U+2690 and U+2691 respectively.[48] They also include a series of directional bold arrows in the range U+2B05 through U+2B0D,[44] excluding a rightward arrow, which was mapped to an existing character in the Dingbats block,[49] which were added at the same code points they were proposed for, besides the north-east and north-west arrows being swapped compared to the proposal.[50]

Other pictographic characters which were included in the North Korean proposal include the umbrella with raindrops (☔), the lightning bolt for high voltage (⚡) and the warning triangle (⚠).[44] Following some discussion about which other high voltage symbol glyphs in use represented the same character as the one from the North Korean proposal,[45] and which glyph would be best to include for it in the Unicode code chart,[51] and following modification of the code chart glyph of the existing umbrella character without rain (U+2602, ☂) to harmonise with the new umbrella with raindrops from the North Korean proposal,[45][53] these characters were also added in Unicode 4.0, at the same time as the flags and the beverage symbol.[46][48][51] Although proposed for the provisional code points U+2618, U+267F and U+267E,[44] they were given the final code points U+2614, U+26A1 and U+26A0 respectively.[48]

Of these characters, the hot beverage, umbrella with raindrops, lightning bolt and warning triangle, and the upward, downward and leftward arrows were subsequently selected as mappings from the Japanese cellular emoji sets,[54] making a total of seven current Unicode emoji which were originally added to Unicode at the request of North Korea. The umbrella with raindrops and the upward, downward and leftward arrows were also unified with characters from the ARIB extensions used in Japanese broadcasting,[55] which include several characters now classified as emoji,[56] and was mapped to Unicode in Unicode 5.2.[57] However, the pair of white and black flags used as emoji or in emoji regional and identity flag sequences is a different, "waving" set added in Unicode 7.0 (U+1F3F3