Chinese character description languages

From Wikipedia, the free encyclopedia

The Chinese character description languages are several proposed languages to most accurately and completely describe Chinese (or CJK) characters and information such as their list of components, list of strokes (basic and complex), their order, and the location of each of them on a background empty square. They are designed to overcome the inherent lack of information within a bitmap description. This enriched information can be used to identify variants of characters that are unified into one code point by Unicode and ISO/IEC 10646, as well as to provide an alternative form of representation for rare characters that do not yet have a standardized encoding in Unicode or ISO/IEC 10646. Many aim to work for Kaishu style and Song style, as well as to provide the character's internal structure which can be used for easier look-up of a character by indexing the character's internal make-up and cross-referencing among similar characters.

CDL[]

CDL of cascading components approach.

Character Description Language is a font technology, based on XML, co-created by Tom Bishop and Richard Cook for Wenlin Institute, Inc, designed for describing any CJK character, but suitable for describing any glyph.

This XML-based declarative language defines the stroke order of each component (a subunit of the glyph similar to a radical, but not necessarily bearing the semantic significance of a true radical), as well as assembly of previously defined components to build up ever more complex characters. Many of these components are characters in their own right, in addition to serving as building-block components.

The background looks like a square of 128 pixels on each side. In this background:

  1. Each of about 50 strokes can be drawn in SVG.
  2. A basic component is composed by calling several strokes. In this component, each stroke is described by its bottom-left and top-right corner. Transformations are possible (reduction, enlargement, etc.). There are more than 1,000 basic components.
  3. A character is composed by calling several components. In this character, each component is described by its bottom-left and top-right corner. In order for a component to fit into its proper portion of the Chinese character's rectangular block, a component may be transformed (e.g., horizontal or vertical reduction or enlargement) upon its use as a building-block embedded within a containing more-complex character.

Accordingly, a set of less than 50 strokes[1] allow one to construct a set of about 1,000 components[2] which may in turn be embedded within tens of thousands of characters' descriptions.[2] A change in the shape of one of the 50 basic strokes is implicitly applied within each character that embeds that stroke. Likewise, a change to a component is implicitly applied within each and all characters whose assemblage uses that component.[2]

T. Bishop and R. Cook explain this as follows:

The stroke count of one character is generally related to the stroke counts of other characters. Most characters are built from components, and as long as the stroke counts of those components are defined, there is rarely any difficulty in adding them together to obtain the combined stroke count. Therefore, if a standard defines the strokes of a few thousand characters, it implicitly defines the strokes of many thousands of additional characters.[3]

As of 2020, nearly 100,000 Chinese characters have been described via CDL.[4]

HanGlyph[]

A character description language intended for supplying missing rare characters in documents (addressing the Chinese equivalent of the gaiji problem).[5] Documents can contain markup for missing characters, which will automatically trigger the generation of small fonts to provide the characters. The language itself is a simple postfix notation describing strokes and ways to combine them. The prototype software uses Metapost to render the characters and embed them in LaTeX documents. The language was presented by Wai Wong in 1997,[6] and papers about its implementation in Metapost and LaTeX appeared at TeX user group conferences in 2003.[7][8]

Ideographic Description Sequences[]

Chapter 12 of the Unicode specification[9] defines a syntax for "Ideographic Description Sequences" (IDSes) intended for use in describing characters not included in the standard in terms of combinations of components that do have code points. Twelve special characters in the range U+2FF0 to U+2FFB act as prefix operators to combine other characters or sequences to form larger characters.

Ideographic Description Characters in Unicode
Character Unicode Character Number Full Unicode Name
U+2FF0 Ideographic description character left to right
U+2FF1 Ideographic description character above to below
U+2FF2 Ideographic description character left to middle and right
U+2FF3 Ideographic description character above to middle and below
U+2FF4 Ideographic description character full surround
U+2FF5 Ideographic description character surround from above
U+2FF6 Ideographic description character surround from below
U+2FF7 Ideographic description character surround from left
U+2FF8 Ideographic description character surround from upper left
U+2FF9 Ideographic description character surround from upper right
U+2FFA Ideographic description character surround from lower left
U+2FFB Ideographic description character overlaid

These sequences are useful in describing to the reader a character that is not directly printable, either because it is absent in a given font, or is absent from the Unicode standard altogether. For example, the Sawndip character "        </div>
    </div>
</section>

<footer>
    <div class= WIKI