Home
X-Fonter : Font Manager

X-Fonter 14.0

Character Encoding

In the early sixties it became obvious that computers needed some kind of character encoding in order to communicate with each other. A choice that was made back then was to use bytes of 8 bits to represent data. This would give a byte 256 possible values (from 0 to 255).
The most popular character encoding was Bob Bemers' ASCII (American Standard Code for Information Interchange), also known as ANSI X3.4. This encoding uses only 7 bits however, and thus only consisted out of 128 characters.
These characters include letters, digits, punctuation marks, and nonprintable control characters such as the backspace, tab, carriage return, etc.

    0   1   2   3   4   5   6   7   8   9   A   B   C   D   E   F
0  NUL SOH STX ETX EOT ENQ ACK BEL BS  HT  LF  VT  FF  CR  SO  SI
1  DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM  SUB ESC FS  GS  RS  US
2   SP  !   "   #   $   %   &   '   (   )   *   +   ,   -   .   /
3   0   1   2   3   4   5   6   7   8   9   :   ;   <   =   >   ?
4   @   A   B   C   D   E   F   G   H   I   J   K   L   M   N   O
5   P   Q   R   S   T   U   V   W   X   Y   Z   [   \   ]   ^   _
6   `   a   b   c   d   e   f   g   h   i   j   k   l   m   n   o
7   p   q   r   s   t   u   v   w   x   y   z   {   |   }   ~ DEL

A problem with this basic ASCII encoding is that it's based on the English language. Some languages use different characters, like accented ones, or completely different signs like Greek, Cyrillic, Hebrew and Arabic.

This problem was solved by using the 8th bit for the Extended ASCII Character Set. The additional 128 characters slots, ranging from 128 through 255, were used to represent additional special, mathematical, graphic, and foreign characters.
There are several possible codepages, each with its own extended character set. (the first 128 characters are always the same) Some codepages contain accented roman characters, and others contain non-roman character sets like Greek, Cyrillic, Hebrew or Arabic.

These are the most commonly used codepages: 1250 (Central Europe) 1251 (Cyrillic(Russian)) 1252 (ANSI (Western)) 1253 (Greek) 1254 (Turkish) 1255 (Hebrew) 1256 (Arabic) 1257 (Baltic) 1361 (Korean) 874 (Thai) 932 (Japanese Shift-JIS) 936 (Simplif. Chinese GBK) 949 (Korean Wansung) 950 (Tradit. Chinese Big5) OEM (DOS Compatible)

ANSI : The ISO Latin character set includes the ASCII characters, with codes 0 to 127, plus 128 more characters with codes 128 to 255. The ISO Latin character set has become the first 256 characters of the Unicode character set. Yet another character set is the ANSI set, whose first 256 ANSI characters are the same as the ISO Latin characters, which are also the first 256 characters of the Unicode set. But the characters with codes higher than 255 are different between ANSI and Unicode.

Unicode

Unicode is an international standard for representing a character set using two-byte encoding (16 bits), which makes it possible to store 65.536 different characters (instead of 256 for one-byte encoding like ANSI). This makes UNICODE the perfect encoding schema for international use. The first 128 Unicode characters are the same as the ASCII characters, but with an extra leading zero byte in front of them.

However, Asian languages, which are word-based, rather than character-based, often have more words than 8 bits can represent. In particular, 8 bits can only represent 256 words, which is far smaller than the number of words in natural languages.

Thus, a new character set called Unicode is now becoming more prevalent. This is a 16 bit code, which allows for about 65,000 different representations. This is enough to encode the popular Asian languages (Chinese, Korean, Japanese, etc.). It also turns out that ASCII codes are preserved. What does this mean? To convert ASCII to Unicode, take all one byte ASCII codes, and zero-extend them to 16 bits. That should be the Unicode version of the ASCII characters.

EBCDIC

EBCDIC stands for "Extended Binary Coded Decimal Interchange Code" and is developed by IBM for their mainframes (legacy) EBCDIC uses the full 8 bits available to it, so parity checking cannot be used on an 8 bit system. Also, EBCDIC has a wider range of control characters than ASCII.

EBCDIC is not supported by X-Fonter (and by Windows in general).

More Information

More information regarding character encoding can be found on the following pages :

https://gnosis.cx/publish/programming/unicode_primer.html
https://scripts.sil.org/cms/scripts/page.php?site_id=nrsi&item_id=WindowsCodepages