Universal Character Set

From Academic Kids

(Redirected from Universal character set)
Unicode
Encodings
Bi-directional text
BOM
Han unification
Unicode and HTML
Unicode and Email

The Universal Character Set (UCS) is a character encoding that is defined by the international standard ISO/IEC 10646. It maps hundreds of thousands of abstract characters, each identified by an unambiguous name, to integers, called numeric code points.

Since 1991, the Unicode Consortium has been working with ISO to develop the Unicode Standard and ISO/IEC 10646 in tandem. The repertoire, character names, and code points of Version 2.0 of the Unicode Standard are identical to those of ISO/IEC 10646-1:1993 with its first seven published amendments. After Unicode 3.0 was published in February 2000, the new and updated characters were brought into the UCS via ISO/IEC 10646-1:2000.

The UCS has over 1.1 million code points, but only the first 65536 (the Basic Multilingual Plane, or BMP) were commonly used before 2000. This situation began changing with mandate by the People's Republic of China in 2000 that computer systems sold there must support GB18030, which required that computer systems intended for sale in the PRC must move beyond the BMP.

Many code points, even in the BMP, are deliberately not assigned to characters, to allow for future expansion or to minimize conflicts with other encoding forms.

Contents

Encoding Forms of the Universal Character Set

There are several character encoding forms defined by ISO 10646 for the Universal Character Set. The simplest is UCS-2, which uses a single code value between 0 and 65535 for each character, and allowing that value to be represented as exactly two bytes (one 16-bit word). UCS-2 thereby permits a binary representation of every code point in the BMP, as long as the code point represents a character. Code points outside the BMP can be represented by pairs of special characters from what is called the S (Special) Zone of the BMP, each pair consisting of what is called an RC-element from the high-half zone and an RC-element from the low-half zone.

In Unicode terminology these characters are called high surrogates and low surrogates respectively and UTF-16 is the Unicode terminology for UCS-2.

Another encoding is UCS-4, which uses a single code value between 0 and, theoretically, hexadecimal FFFFFFFF for each character (although the UCS stops at 10FFFF and ISO/IEC 10646 has stated that all future assignments of characters will also be in that range), and allowing that value to be represented as exactly four bytes (one 32-bit word). UCS-4 thereby permits a binary representation of every code point in the UCS, including those outside the BMP. Like UCS-2, every encoded character has a fixed length in bytes, which makes it simple to manipulate, but of course it requires twice as much storage as UCS-2. ISO/IEC 10646

Occasionally, articles about Unicode will mistakenly refer to UCS-2 as "UCS-16". There is no UCS-16; the authors who make this error usually intended to refer to UCS-2 or UTF-16.

History of ISO 10646

ISO set out to compose the universal character set in 1989, and the draft of ISO 10646 was published in 1990. That standard was very different from the current one. It defined 128 groups of 256 planes of 256 rows of 256 cells, for an apparent total of 2,147,483,648 characters, but actually only 679,477,248 characters could be coded in the standard, as the policy was that there should not appear byte values of control characters (0x00 to 0x1F and 0x80 to 0x9F, in hexadecimal notation) anywhere. The Latin capital letter A, for example, was situated in group 0x20, plane 0x20, row 0x20, cell 0x41.

The characters of this primordial ISO 10646 standard could be coded in three ways: UCS-4, four bytes for every character, enabling the simple encoding of all characters; UCS-2, two bytes for every character, enabling the encoding of the first plane, 0x20, the Basic Multilingual Plane, containing the first 36,864 codepoints, straightforwardly, and other planes and groups by switching to them with ISO 2022 escape sequences; and UTF-1, in which all the characters are encoded in sequences of bytes of varying length (1 to 5 bytes, each of which contain no control characters).

In 1990, therefore, there were two initiatives for a universal character set: Unicode, with 16 bits for every character (65,536 possible characters), and ISO 10646. The software companies refused to accept the complexity and size requirement of the ISO standard and voted against it. The ISO standardisers realised they could not continue to support the standard in its present state and negotatied the unification of their standard with Unicode. Two changes were made: the limitation upon characters (prohibition of control character values) was lifted, so that a character like 0x0000101F would be permissible; and the repertoire of the Basic Multilingual Plane was synchronised with that of Unicode.

Meanwhile, in the passage of time, the situation changed in the Unicode standard itself: 65,536 characters were not considered sufficient, and the standard from version 2.0 and onwards supports encoding of 1,112,064 characters by means of the UTF-16 surrogate mechanism. For that reason, ISO 10646 was limited to contain as many characters as could be encoded by UTF-16 and no more, that is, a little over a million characters instead of over 2,000 million. The UCS-4 encoding of ISO 10646 was incorporated into the Unicode standard with the limitation to the UTF-16 range and under the name UTF-32. As for UTF-1, no-one used it, because of its bad design (no way of distinguishing between single bytes, lead bytes and trail bytes, a problem similar to that of the Shift-JIS encoding of Japanese) and its poor performance (many division operations). Rob Pike and Ken Thompson, the designers of the Plan 9 operating system, devised a new, fast and well-designed mixed width encoding, which came to be called UTF-8.

Differences between ISO 10646 and Unicode

ISO 10646 and Unicode have an identical repertoire and numbers — the same characters with the same numbers exist on both standards. The difference between them is that Unicode adds rules and specifications that are lacking in ISO 10646. ISO 10646 is a simple character map, an extension of previous standards like ISO 8859. In contrast, Unicode adds rules for collation, normalisation of forms, and the bidirectional algorithm for scripts like Hebrew and Arabic. For interoperability between platforms, especially if bidirectional scripts are used, it is not enough to support ISO 10646; Unicode must be implemented.

There are some applications that support ISO 10646 but not Unicode, such as Linux xterm, which can display all ISO 10646 characters properly that have a one-to-one character to glyph mapping and a single directionality, but cannot display Hebrew (bidirectional), Devanagari (one character to many glyphs) or Arabic (both features).

Citing the Universal Character Set

ISO 10646 is a general, informal citation for the ISO/IEC 10646 family of standards, and is acceptable in most prose. And even though it is a separate standard, the term Unicode is used just as often, informally, when discussing the UCS. However, any normative references to the UCS as a publication should cite a particular part and version, using the form ISO/IEC 10646-{part}:{year}; for example: ISO/IEC 10646-1:1993.

Correlation to Unicode

  • ISO/IEC 10646-1:1993 ≈ Unicode 1.1
  • ISO/IEC 10646-1:2000 ≈ Unicode 3.0
  • ISO/IEC 10646-2:2001 ≈ Unicode 3.2
  • ISO/IEC 10646-3:2003 ≈ Unicode 4.0

External links

Related ISO

Related ISO standards from the List of ISO standards are: ISO 646, ISO 2022, ISO 6429, ISO 8859, ISO 14651

See also Unicode, UTF-16, UTF-8de:Universal Character Set ko:UCS ja:ISO/IEC 10646 zh:通用字符集

Navigation

Academic Kids Menu

  • Art and Cultures
    • Art (http://www.academickids.com/encyclopedia/index.php/Art)
    • Architecture (http://www.academickids.com/encyclopedia/index.php/Architecture)
    • Cultures (http://www.academickids.com/encyclopedia/index.php/Cultures)
    • Music (http://www.academickids.com/encyclopedia/index.php/Music)
    • Musical Instruments (http://academickids.com/encyclopedia/index.php/List_of_musical_instruments)
  • Biographies (http://www.academickids.com/encyclopedia/index.php/Biographies)
  • Clipart (http://www.academickids.com/encyclopedia/index.php/Clipart)
  • Geography (http://www.academickids.com/encyclopedia/index.php/Geography)
    • Countries of the World (http://www.academickids.com/encyclopedia/index.php/Countries)
    • Maps (http://www.academickids.com/encyclopedia/index.php/Maps)
    • Flags (http://www.academickids.com/encyclopedia/index.php/Flags)
    • Continents (http://www.academickids.com/encyclopedia/index.php/Continents)
  • History (http://www.academickids.com/encyclopedia/index.php/History)
    • Ancient Civilizations (http://www.academickids.com/encyclopedia/index.php/Ancient_Civilizations)
    • Industrial Revolution (http://www.academickids.com/encyclopedia/index.php/Industrial_Revolution)
    • Middle Ages (http://www.academickids.com/encyclopedia/index.php/Middle_Ages)
    • Prehistory (http://www.academickids.com/encyclopedia/index.php/Prehistory)
    • Renaissance (http://www.academickids.com/encyclopedia/index.php/Renaissance)
    • Timelines (http://www.academickids.com/encyclopedia/index.php/Timelines)
    • United States (http://www.academickids.com/encyclopedia/index.php/United_States)
    • Wars (http://www.academickids.com/encyclopedia/index.php/Wars)
    • World History (http://www.academickids.com/encyclopedia/index.php/History_of_the_world)
  • Human Body (http://www.academickids.com/encyclopedia/index.php/Human_Body)
  • Mathematics (http://www.academickids.com/encyclopedia/index.php/Mathematics)
  • Reference (http://www.academickids.com/encyclopedia/index.php/Reference)
  • Science (http://www.academickids.com/encyclopedia/index.php/Science)
    • Animals (http://www.academickids.com/encyclopedia/index.php/Animals)
    • Aviation (http://www.academickids.com/encyclopedia/index.php/Aviation)
    • Dinosaurs (http://www.academickids.com/encyclopedia/index.php/Dinosaurs)
    • Earth (http://www.academickids.com/encyclopedia/index.php/Earth)
    • Inventions (http://www.academickids.com/encyclopedia/index.php/Inventions)
    • Physical Science (http://www.academickids.com/encyclopedia/index.php/Physical_Science)
    • Plants (http://www.academickids.com/encyclopedia/index.php/Plants)
    • Scientists (http://www.academickids.com/encyclopedia/index.php/Scientists)
  • Social Studies (http://www.academickids.com/encyclopedia/index.php/Social_Studies)
    • Anthropology (http://www.academickids.com/encyclopedia/index.php/Anthropology)
    • Economics (http://www.academickids.com/encyclopedia/index.php/Economics)
    • Government (http://www.academickids.com/encyclopedia/index.php/Government)
    • Religion (http://www.academickids.com/encyclopedia/index.php/Religion)
    • Holidays (http://www.academickids.com/encyclopedia/index.php/Holidays)
  • Space and Astronomy
    • Solar System (http://www.academickids.com/encyclopedia/index.php/Solar_System)
    • Planets (http://www.academickids.com/encyclopedia/index.php/Planets)
  • Sports (http://www.academickids.com/encyclopedia/index.php/Sports)
  • Timelines (http://www.academickids.com/encyclopedia/index.php/Timelines)
  • Weather (http://www.academickids.com/encyclopedia/index.php/Weather)
  • US States (http://www.academickids.com/encyclopedia/index.php/US_States)

Information

  • Home Page (http://academickids.com/encyclopedia/index.php)
  • Contact Us (http://www.academickids.com/encyclopedia/index.php/Contactus)

  • Clip Art (http://classroomclipart.com)
Toolbox
Personal tools