KS X 1001

From Wikipedia, the free encyclopedia
KS X 1001
MIME / IANAks_c_5601-1987
Alias(es)KS C 5601
Language(s)Korean, English, Russian
Partial support:
Greek, Japanese
StandardKS X 1001
ClassificationISO-2022-compatible DBCS, CJK encoding
Encoding formatsEUC-KR, ISO 2022, UHC, Johab
Preceded byN-byte Hangul code (KS C 5601-1974)
Other related encoding(s)KS X 1002, KPS 9566, JIS X 0208, GB 2312, GB 12052

KS X 1001, "Code for Information Interchange (Hangul and Hanja)",[a][1] formerly called KS C 5601, is a South Korean coded character set standard to represent hangul and hanja characters on a computer.

KS X 1001 is encoded by the most common legacy (pre-Unicode) character encodings for Korean, including EUC-KR and Microsoft's Unified Hangul Code (UHC). It contains Korean Hangul syllables, CJK ideographs (Hanja), Greek, Cyrillic, Japanese (Hiragana and Katakana) and some other characters.

KS X 1001 is arranged as a 94×94 table, following the structure of 2-byte code words in ISO 2022 and EUC. Therefore, its code points are pairs of integers 1–94. However, some encodings (UHC and Johab), in addition to providing codes for every code point, provide additional codes for characters otherwise representable only as code point sequences.

History[]

This standard was previously known as KS C 5601. There have been several revisions of this standard. For example, there were revisions in 1987, 1992, 1998 and 2002.

The present, double-byte, Wansung (완성, Wanseong, 'precomposing')[1] character set was standardised by the third edition of KS C 5601,[2] which was published in 1986.[3] It is an ISO 2022 compatible encoding, typically used in EUC form, which assigns double-byte codes for non-Hangul, Hangul jamo, and the most common Hangul syllables, in contrast to Johab (조합, Johap, 'combining')[1] which is not compatible with ISO 2022, but assigns double-byte codes to all Hangul syllables using modern jamo.[2] Wansung is technically a variable-length encoding, allowing other syllables to be represented with eight-byte sequences (using the jamo and Hangul Filler character), but this feature is not always implemented.[4]

The earliest edition of KS C 5601, published in 1974,[2] defined a variable-length[2] 7-bit character set which assigned single-byte code points to 51[3] basic Hangul jamo, somewhat analogously to JIS C 6220, in an encoding known as "N-byte Hangul".[5] The second edition, published in 1982, retained the main character set from the 1974 edition but defined two supplementary sets, including a version of Johab. Neither edition was adopted as widely as intended.[2]

Wansung was kept unchanged in the 1987 and 1992 editions. In the 1992 edition, additional annex material was added,[3] including the definition of the Johab encoding[6] in annex 3, and the older N-byte Hangul encoding in annex 4.[1][5] It was published in response to industry use of Johab as a competing encoding to Wansung, being used at the time by Hangul Word Processor. Following the introduction of Unified Hangul Code by Microsoft in Windows 95, and Hangul Word Processor abandoning Johab in favour of Unicode in 2000, Johab ceased to be commonly used.[2]

Encodings[]

(A screenshot of an old version of Firefox showing Big5, GB2312, GBK, GB18030, HZ, ISO-2022-CN, Big5-HKSCS, EUC-TW, EUC-JP, ISO-2022-JP, Shift_JIS, EUC-KR, UHC, Johab and ISO-2022-KR as available encodings under the CJK sub-menu.)
Various CJK encodings, including four based on KS X 1001, supported by Mozilla Firefox as of 2004. (This support has been reduced in later versions to avoid certain cross site scripting attacks.)

Encoding schemes of KS X 1001 include EUC-KR (in both ASCII and ISO 646-KR based variants, the latter of which includes a won currency sign () at byte 0x5C rather than a backslash) and ISO-2022-KR,[7] as well as ISO-2022-JP-2 (which also encodes JIS X 0208 and JIS X 0212). These all have the drawback that they only assign codes for the 2350 precomposed Hangul syllables which have their own KS X 1001 codepoints (out of 11172 in total, not counting those using obsolete jamo), and require others to use eight-byte composition sequences, which are not supported by some partial implementations of the standard.[4]

The Johab encoding (stipulated in annex 3 of the 1992 version of the standard) and the EUC-KR superset known as Unified Hangul Code (UHC, also called Windows-949) provide single codes for all 11172 Hangul syllables.[7][6] ISO-2022-KR and Johab are rarely used. Some operating systems extend this standard in other non-uniform ways, e.g. the EUC-KR extensions MacKorean on the classic Mac OS, and IBM-949 by IBM.

Hangul Filler[]

The Hangul Filler character is used to introduce eight-byte Hangul composition sequences[8][9] and to stand in for an absent element (usually an empty final) in such a sequence.[9]

Unicode includes the Wansung code Hangul Filler in the Hangul Compatibility Jamo block for round-trip compatibility, but uses its own system (with its own, differently used, filler characters) for composing Hangul. The KS X 1001 Hangul composition system is not used in Unicode, and the filler renders merely as an empty space; KS X 1001 composition sequences using modern jamo may be mapped to precomposed characters in Unicode.[9] This is not usually done with Unified Hangul Code.

For round-trip compatibility, Unicode also includes the N-byte Hangul code Hangul Filler separately in the Halfwidth and Fullwidth Forms block, named the "Halfwidth Hangul Filler".

N-byte Hangul code[]

This is the N-byte Hangul code,[5] as specified by KS C 5601-1974 and by annex 4 of KS C 5601-1992. The second half of IBM's Code page 1040[10] is a superset of this, assigning the characters ¢¬\~ (although not £) to the same locations as in Code page 1041, while the unextended N-Byte Hangul (besides C0 control code replacement graphics in some usage contexts, shared with IBM-1040) is Code page 891.[11] Character 0x40/0xC0 is a Hangul Filler (see above), used in combining sequences.

Similarly to its Japanese counterpart JIS C 6220 (JIS X 0201), N-byte Hangul code could be used as a 7-bit encoding, with character allocations over the range 0x40 through 0x7C.[5] The chart below shows the code in an 8-bit environment with the high bit set (i.e. over 0xC0 through 0xFC), as it is used in e.g. code page 891 or 1040.

KS C 5601-1974 / N-byte Hangul[12]
0 1 2 3 4 5 6 7 8 9 A B C D E F
8x
9x
Ax
Bx
Cx HWHF
Dx
Ex
Fx

Wansung code charts[]

Following are the code charts for KS X 1001 in Wansung layout. Where a pair of hexadecimal numbers is given, the smaller is used when encoded over GL (0x21-0x7E), as in ISO-2022-KR when the Korean set has been shifted to, and the larger is used in the more typical case of it being encoded over GR (0xA1-0xFE), as in EUC-KR or UHC. Johab changes the arrangement to encode all 11172 Hangul clusters separately and in order.

To illustrate vendor differences in implementation, multiple Unicode mappings are shown for some characters. Apple's HangulTalk extensions to the Wansung plane (i.e. where both bytes are in the 0xA1-0xFE range) are shown, but other HangulTalk extension ranges are not. The additional codes for composed syllables in Unified Hangul Code, and IBM's extensions in IBM-949, are also not shown, since both fall outside of the Wansung plane.

Lead bytes[]

KS X 1001 (Wansung code)
0 1 2 3 4 5 6 7 8 9 A B C D E F
2x/Ax SP[b] 1-_ 2-_ 3-_ 4-_ 5-_ 6-_ 7-_ 8-_ 9-_ 10-_ 11-_ 12-_ 13-_ 14-_ 15-_
3x/Bx 16-_ 17-_ 18-_ 19-_ 20-_ 21-_ 22-_ 23-_ 24-_ 25-_ 26-_ 27-_ 28-_ 29-_ 30-_ 31-_
4x/Cx 32-_ 33-_ 34-_ 35-_ 36-_ 37-_ 38-_ 39-_ 40-_ 41-_ 42-_ 43-_ 44-_ 45-_ 46-_ 47-_
5x/Dx 48-_ 49-_ 50-_ 51-_ 52-_ 53-_ 54-_ 55-_ 56-_ 57-_ 58-_ 59-_ 60-_ 61-_ 62-_ 63-_
6x/Ex 64-_ 65-_ 66-_ 67-_ 68-_ 69-_ 70-_ 71-_ 72-_ 73-_ 74-_ 75-_ 76-_ 77-_ 78-_ 79-_
7x/Fx 80-_ 81-_ 82-_ 83-_ 84-_ 85-_ 86-_ 87-_ 88-_ 89-_ 90-_ 91-_ 92-_ 93-_ 94-_ DEL[b]

Non-Hanja non-precomposed sets[]

Character set 0x21 / 0xA1 (row number 1, special characters)[]

This set contains punctuation and other symbols, excluding punctuation present in KS X 1003 (which is included in row 3). Encodings which combine KS X 1001 with single-byte ASCII may use alternative Unicode mapping to the Halfwidth and Fullwidth Forms block for the backslash. Unicode mapping of the wave dash (tilde dash) also differs between vendors, and may be U+301C (favoured by IBM and Apple)[13][14][15] or U+223C (favoured by Microsoft).[16][17] Compare the similar but not identical handling of the JIS wave dash, and the handling of the tilde in the next row.

Except for the backslash, if two mappings are shown below, the first is used by Apple and the second is used by Microsoft.[15][17]

KS X 1001 (prefixed with 0x21 / 0xA1)
0 1 2 3 4 5 6 7 8 9 A B C D E F
2x/Ax IDSP · ¨ SHY / / \/ /
3x/Bx ± ×
4x/Cx ÷ ° ¢/ £/ ¥/
5x/Dx §
6x/Ex
7x/Fx ¬/

Character set 0x22 / 0xA2 (row number 2, special characters)[]

This set contains additional punctuation and symbols. Similarly to the tilde character in the previous row, different mappings are used by Apple and Microsoft for the tilde character in this row (U+02DC by Apple, FF5E by Microsoft),[15][17] which is intended to be shown as a raised tilde, whereas the tilde in the previous row is intended to be shown in-line at dash height.[18] Mapping of the circled dot also differs.[15][17]

The euro and registered trademark sign were added to the standard in 1998, while the Korean postal mark (㉾) was added in 2002.[1] These three code points, as with the still-unused code points, have been put to use for other, non-standard, purposes by vendors, e.g. for boxed list markers by Apple.[19] Microsoft updated its Unified Hangul Code implementation to add the 1998 additions including the euro sign, but did not add the Korean postal mark when it was added to the standard.[20]

KS X 1001 (prefixed with 0x22 / 0xA2)
0 1 2 3 4 5 6 7 8 9 A B C D E F
2x/Ax ´ ˜/ ˇ ˘ ˝ ˚ ˙ ¸ ˛ ¡ ¿
3x/Bx ː ¤
4x/Cx /
5x/Dx
6x/Ex /1[c] ®/2[c] /3[c] 4[c] 5[c] 6[c] 7[c] 8[c] 9[c] [10][d]
7x/Fx [11][d] [12][d] [13][d] [14][d] [15][d] [16][d] [17][d] [18][d] [19][d] [20][d] [e] [f] [g]
  Additions by Apple
  Later standard additions colliding with Apple additions

Character set 0x23 / 0xA3 (row number 3, basic Latin / ISO 646-KR)[]

This set corresponds to KS X 1003 (the ISO 646 variant for Korean, a similar set to ASCII), but as two-byte codes preceded by 0x23 (or 0xA3 in GR-delegated (EUC) form). It includes the English alphabet / Basic Latin alphabet, western Arabic numerals and punctuation.

Compare the Roman set of JIS X 0201, which differs by including a Yen sign rather than a Won sign. Contrast the third rows of KPS 9566 and of JIS X 0208, which follow the ISO 646 layout but only include letters and digits.

KS X 1001 (prefixed with 0x23 / 0xA3); non-fullwidth mappings
0 1 2 3 4 5 6 7 8 9 A B C D E F
2x/Ax ! " # $ % & ' ( ) * + , - . /
3x/Bx 0 1 2 3 4 5 6 7 8 9 : ; < = > ?
4x/Cx @ A B C D E F G H I J K L M N O
5x/Dx P Q R S T U V W X Y Z [ ] ^ _
6x/Ex ` a b c d e f g h i j k l m n o
7x/Fx p q r s t u v w x y z { | }

Encodings such as EUC-KR and UHC combine KS X 1001 with single-byte ASCII or KS X 1003, and hence use alternative Unicode mappings to the Halfwidth and Fullwidth Forms block for the double-byte representations of these characters.

KS X 1001 (prefixed with 0x23 / 0xA3); fullwidth mappings
0 1 2 3 4 5 6 7 8 9 A B C D E F
2x/Ax
3x/Bx
4x/Cx
5x/Dx _
6x/Ex
7x/Fx

Character set 0x24 / 0xA4 (row number 4, Hangul jamo)[]

This set includes modern Hangul consonants, followed by vowels, both ordered by South Korean collation customs, followed by obsolete consonants. When used individually, these characters map to the Unicode Hangul Compatibility Jamo block, and do not have a one-to-one mapping with the position-specific characters in the Hangul Jamo block. Compare with row 4 of the North Korean KPS 9566. Character 04-52 is a Hangul Filler (see above), used in combining sequences.

KS X 1001 (prefixed with 0x24 / 0xA4)
0 1 2 3 4 5 6 7 8 9 A B C D E F
2x/Ax
3x/Bx
4x/Cx
5x/Dx HF
6x/Ex
7x/Fx

Character set 0x25 / 0xA5 (row number 5, Roman numerals and Greek)[]

This set contains Roman numerals and basic support for the Greek alphabet, without diacritics or the final sigma. Apple includes some additional punctuation in this row, as well as some black circled list markers continuing from those in row 6.[19]

Contrast row 6 of KPS 9566, which includes the same characters but in a different layout.

KS X 1001 (prefixed with 0x25 / 0xA5)
0 1 2 3 4 5 6 7 8 9 A B C D E F
2x/Ax
3x/Bx
4x/Cx Α Β Γ Δ Ε Ζ Η Θ Ι Κ Λ Μ Ν Ξ Ο
5x/Dx Π Ρ Σ Τ Υ Φ Χ Ψ Ω !︀[h] 。︀[i] [j] [j]
6x/Ex α β γ δ ε ζ η θ ι κ λ μ ν ξ ο
7x/Fx π ρ σ τ υ φ χ ψ ω (27)[k] (28)[l] (29)[m] (30)[n]
  Additions by Apple

Character set 0x26 / 0xA6 (row number 6, box drawing)[]

This row contains characters for drawing boxes in a semigraphic context. Apple also includes some black circled list markers.[19]

KS X 1001 (prefixed with 0x26 / 0xA6)
0 1 2 3 4 5 6 7 8 9 A B C D E F
2x/Ax
3x/Bx ���
4x/Cx
5x/Dx
6x/Ex
7x/Fx (21)[o] (22)[p] (23)[q] (24)[r] (25)[s] (26)[t]
  Additions by Apple

Character set 0x27 / 0xA7 (row number 7, unit symbols)[]

This row contains unit symbols as single characters, including those which consist of multiple letters. Apple also includes some circled list markers continuing from those in row 8.[19]

Compare and contrast with the repertoire of unit symbols included in row 8 of KPS 9566.

KS X 1001 (prefixed with 0x27 / 0xA7)
0 1 2 3 4 5 6 7 8 9 A B C D E F
2x/Ax
3x/Bx
4x/Cx
5x/Dx
6x/Ex
7x/Fx
  Additions by Apple

Character set 0x28 / 0xA8 (row number 8, extended Latin, encircled, fractions)[]

KS X 1001 (prefixed with 0x28 / 0xA8)
0 1 2 3 4 5 6 7 8 9 A B C D E F
2x/Ax Æ Ð ª Ħ IJ Ŀ Ł Ø Œ º Þ Ŧ Ŋ
3x/Bx
4x/Cx
5x/Dx
6x/Ex
7x/Fx ½ ¼ ¾

Character set 0x29 / 0xA9 (row number 9, extended Latin, encircled, superscript and subscript)[]

KS X 1001 (prefixed with 0x29 / 0xA9)
0 1 2 3 4 5 6 7 8 9 A B C D E F
2x/Ax æ đ ð ħ ı ij ĸ ŀ ł ø œ ß þ ŧ ŋ
3x/Bx ʼn
4x/Cx
5x/Dx
6x/Ex
7x/Fx ¹ ² ³

Character set 0x2A / 0xAA (row number 10, Hiragana)[]

This set contains Hiragana for writing the Japanese language. Apple also includes some bracketed list markers continuing from those in row 9.[19]

Compare row 10 of KPS 9566, which uses the same layout. Compare and contrast row 4 of JIS X 0208, which also uses the same layout, but in a different row.

KS X 1001 (prefixed with 0x2A / 0xAA)
0 1 2 3 4 5 6 7 8 9 A B C D E F
2x/Ax
3x/Bx
4x/Cx
5x/Dx
6x/Ex
7x/Fx (21)[u] (22)[u] (23)[u] (24)[u] (25)[u] (26)[u]
  Additions by Apple

Character set 0x2B / 0xAB (row number 11, Katakana)[]

This set contains Katakana for writing the Japanese language. However, the Japanese long vowel mark, which is used in katakana text and included in row 1 of JIS X 0208, is not included.[23] Apple also includes some bracketed list markers continuing from those in rows 9 and 10.[19]

Compare row 11 of KPS 9566, which uses the same layout. Compare and contrast row 5 of JIS X 0208, which also uses the same layout, but in a different row.

KS X 1001 (prefixed with 0x2B / 0xAB)
0 1 2 3 4 5 6 7 8 9 A B C D E F
2x/Ax
3x/Bx
4x/Cx
5x/Dx
6x/Ex
7x/Fx (27)[u] (28)[u] (29)[u] (30)[u]
  Additions by Apple

Character set 0x2C / 0xAC (row number 12, Cyrillic)[]

This set contains the modern Russian alphabet, and is not necessarily sufficient to represent other forms of the Cyrillic script. Apple also includes some black boxed list markers.[19]

Compare row 5 of KPS 9566 and row 7 of JIS X 0208, which use the same layout (but in a different row).

KS X 1001 (prefixed with 0x2C / 0xAC)
0 1 2 3 4 5 6 7 8 9 A B C D E F
2x/Ax А Б В Г Д Е Ё Ж З И Й К Л М Н
3x/Bx О П Р С Т У Ф Х Ц Ч Ш Щ Ъ Ы Ь Э
4x/Cx Ю Я 1[v] 2[v] 3[v] 4[v] 5[v] 6[v] 7[v] 8[v] 9[v] [10][w] [11][w] [12][w] [13][w] [14][w]
5x/Dx [15][w] а б в г д е ё ж з и й к л м н
6x/Ex о п р с т у ф х ц ч ш щ ъ ы ь э
7x/Fx ю я [16][w] [17][w] [18][w] [19][w] [20][w]
  Additions by Apple

Extended character set 0x2D / 0xAD (row number 13, Apple additional punctuation)[]

Apple additions to KS X 1001 (prefixed with 0x2D / 0xAD)[19]
0 1 2 3 4 5 6 7 8 9 A B C D E F
2x/Ax [x] [x] [x] [x] [y] [y] [y] [y] [z] [z]
WIKI