The Czech and Slovak Character Encoding Mess Explained

Naposledy změněno Lukas Petrlik 2012/02/22 20:45

This page is in construction!!!

This text was created in 1996 to explain to the author of GNU Recode why his utility should support Kamenický encoding even though it had not been standardized. The Kamenický encoding eventually was supported in GNU Recode, while several locally created character recoding utilities had become more widely used.

Currently, most operating systems and programs support UNICODE, which makes contents of this document obsolete. I still retain it here for historical reference.

The Czech and Slovak Alphabets

The Czech and Slovak languages were always written using the Latin alphabet. (Czech and Slovak were never written in Cyrillic. See Wikipedia for the 6 Slavonic languages which use Cyrillic.)

Because the plain Latin alphabet does not have enough letters to cover all phonemes used in Czech and Slovak, some of the phonemes are denoted by letters with diacritic marks. For example, "long a" is denoted by á, and the phoneme sh is denoted by š.

The following table lists English, Czech and Slovak names of the diacritic marks.

	English name	Czech Name (ČSN 36 9103)	Slovak Name
acute accent	čárka nad písmenem, silný přízvuk	dĺžeň, čiarka, akút
breve	breve
caron	háček	mäkčeň
cedilla	háček pod písmenem, cedilie
circumflex accent	vokáň	vokáň, cirkumflex
diaeresis	dvě tečky nad písmenem, přehláska	trema, dve bodky
dot above	tečka nad písmenem	bodka
double acute accent	dvojčárka
ogonek	ocásek
ring above	kroužek nad písmenem	krúžok
stroke	přeškrtnutí

[Missing: some Slovak accent names]

The Czech alphabet contains the following letters:

a á b c č d ď e é ě f g h ch i í j k l m n ň o ó p q r ř s š t ť u ú ů v w x y ý z ž

The digraph "ch" is treated as a single character.

The Czech characters "ř", "ě" and "ů" are not used in the Slovak language.

Slovak alphabet contains the following letters:

a á ä b c č d ď dz dž e é f g h ch i í j k l ĺ ľ m n ň o ó ô p q r ŕ s š t ť u ú v w x y ý z ž

The digraphs "ch", "dz" and "dž" are treated as single characters in the Slovak language. The Slovak characters "ä", "ô", "ŕ", "ĺ" and "ľ" are not used in the Czech language.

Which Encodings do Czechs and Slovaks Actually Use?

Most computers now use Windows, Linux or MacOS X, so we use the encodings supported here:

UNICODE and cp1250 on Windows,
UNICODE and Macintosh CE encoding on Mac,
UNICODE and ISO Latin2 on Linux.

Some legacy programs still use encodings from the PC era such as PC Latin 2 or Kamenický, but these are not widely used.

The remaining ones - KOI-8 CS2 and Cork - are included here only for completeness.

All encodings covered in this document (except for Cork) are ASCII extensions. There are other encodings which can be used with Czech and Slovak, esp. several EBCDIC-based encodings for mainframes. These are not described in this document.

Kamenický

Kamenický encoding (aka KEYBCS2) was used on IBM compatible PC's running MS DOS.

The encoding was defined by the behavior of the Public Domain "KEYBCS2" utility, written in 1986 by the Kamenický brothers. When the utility was activated, it uploaded a new font into the VGA character generator. The utility also allowed the user to change the keyboard layout by pressing a combination of keys.

For a long time Kamenický encoding was the most popular encoding on PC's, because it saved all important graphical symbols used in MS DOS programs (frames and boxes). The encoding was supported by the popular T602 text editor, it could be printed directly to many matrix printers, and it was used by people running the FidoNET network.
When IBM and Microsoft came with PC Latin 2 (cp852), the situation slowly changed towards its acceptance.

Some of the local software vendors used the cp895 for Kamenický encoding (the first localized FoxPro used it), but this code page was not defined at the time by either IBM or Microsoft. Some MS DOS software came in both cp852 (PC Latin 2) and cp895 (Kamenický) versions.

(Code page 895 is now used by IBM for a Japanese Latin encoding.)

PC Latin 2

PC Latin 2 (alias PC L2) was the first encoding which covered both Czech and Slovak and was officially supported by IBM and Microsoft on MS DOS. Most late MS DOS and OS/2 programs used it by default or had an option for using it. The Czechoslovak standard ČSN 36 9103 recommended its use on PC's.
The PC Latin 2 encoding has all of the ISO 8859-2 printable characters, but the accented letters have different positions.

The encoding is defined by IBM as code page 852. MS DOS manuals describe cp852 as "Slavic (Latin II)" code page. Note that some of the languages covered by cp852 are not Slavonic languages, eg. Hungarian.

Most Czech and Slovak users knew it only as Latin 2 (which is the name used by IBM) and these users did not even know that PC Latin 2 is very different from ISO Latin 2.

ISO Latin 2

ISO Latin 2 is the ISO 8859-2 (1987) standard. It is recommended by ISO for use with modern Albanian, Croatian, Czech, English, German, Hungarian, Polish, Romanian, Slovak and Slovene. It is used mostly on Unices and other Nice Systems. IBM code page 912 is the same as ISO 8859-2.
A character encoding almost conforming to ISO 8859-2 is defined in ČSN 36 9103 under the name KOI-8 L2 (see "The ČSN 36 9103 Standard" below). The KOI-8 L2 encoding is registered by ISO under the registration number 139.
See also the ISO Latin 2 table.

KOI-8 CS2

This encoding is defined in ČSN 36 9103. It treats "ch" and "CH" as single letters (as used in the Czech alphabet) and you can get the most often used accented characters simply by setting the sign bit. This encoding was used in old terminals, but it did not last long. Some well known MS DOS software (the T602 text editor) had options for using it.

CP1250

MS Windows use cp1250 which contains all the printable characters of ISO Latin 2.
Some people believe that cp1250 is a superset of ISO 8859-2, but it is not. Most accented letters of cp1250 are at the same character positions as in ISO Latin 2, but not all of them. Total of 14 characters are in different positions, 8 of these are used in Czech/Slovak.

The cp1250 encoding also uses the positions 128-159 for printable characters (these positions form the C1 area used for control purposes in ISO Latin 2 and other ISO 2022 conforming codes).

Cp1250 was introduces in Windows 3.1 and it was the default in CS, EE, Hungarian and Polish editions of Windows.

MacOS CentralEurope

The MacOS CE (aka Macintosh CE) character set is intended for use with Czech, Estonian, Hungarian, Latvian, Lithuanian, Polish, Slovak and Slovenian. The encoding was used in Czech, Polish and Hungarian MacOS localizations.
See also the MacOS CE table.

Cork

The Cork (aka T1) encoding is used by most European TUGs (national TeX Users Groups) for TeX internal T1 font encoding. The encoding was defined in 1990 at the TUG meeting held in Cork. The TeX DC font family is T1-encoded.
This encoding is not an ASCII extension, because it contains printable characters in the lowest 32 positions (0 - 31) used for control purposes in ASCII.
See also the Cork encoding table.

The ČSN 36 9103 Standard

The standard uses an obscure language and requires careful reading. If you cannot understand the following text, it is because I followed its "good" example. emoticon_smile
The Czechoslovak standard ČSN 36 9103 defines the following character encodings: KOI-8 K1, KOI-8 L2, KOI-8 CS2, DKOI K1, DKOI K2, DKOI L2 and DKOI CS2. KOI-8 codes conform to ISO recommendations (they are ASCII-based) and DKOI don't (DKOI are EBCDIC-based encodings).
The standard is Czechoslovak extension of the CMEA (Council of Mutual Economic Assistance) standard ST SEV 358-88. The new encodings (which aren't defined in the original CMEA standard) are KOI-8 L2, KOI-8 CS2, DKOI L2 and DKOI CS2. The remaining encodings are for the Cyrillic alphabet that was used for communication within CMEA - these were never in regular use in our country.

The definition of KOI-8 L2 is stated to conform to ISO 8859-2 (1987), except for the characters $, _ and the currency symbol (164), which have different graphic representations. KOI-8 L2 is also known as "charset CSN_369103" by RFC 1345, because it is the only character encoding registered by ISO (ISO IR 139).
The Appendix 5 "8-bit Codes for Personal Computers" contains an informative description of the character encoding PC Latin 2 defined by IBM. This encoding is known as IBM Code Page 852, but the cp number is not mentioned in the standard.
The ČSN 36 9103 standard was valid until 1996 and at this time it was replaced in the Czech republic with ČSN ISO/IEC 4873. The new standard was created by translating the international standard ISO/IEC 4873: 1991 to Czech.
See also the KOI-8 L2 table.

Charset Tables

The character set tables are presented in the format described in RFC 1345, Section 2. Most Latin latin characters are denoted using the following two-character mnemonics:

English name	TeX	RFC1345 mnemonic
acute accent	\'{x}, \'{i}	x' ''
breve	\u{x}	x( '(
caron	\v{x}	x< '<
cedilla	\c{x}	x, ',
circumflex accent	\^{x}	x> '>
diaeresis	\"{x}	x: ':
dot above	\.{x}	x. '.
double acute accent	\H{x}	x" '"
grave accent	\`{x}	x! '!
macron	\={x}	x- 'm
ogonek	\k{x} (LaTeX)	x; ';
ring above	\accent23x, \aa	x0 aa '0
stroke	\l, \L, \o, \O	x/
tilde	\~{x}	x? '?

The following additional mnemonics are used for characters missing in RFC 1345:

character mnemonic	descriptive name
@CH	CAPITAL CZECH LETTER CH (the digraph "CH") [ČSN]
@ch	SMALL CZECH LETTER CH (the digraph "ch") [ČSN]
@I,	LATIN CAPITAL LETTER I WITH CEDILLA
@i,	LATIN SMALL LETTER I WITH CEDILLA
@j.	LATIN SMALL LETTER I DOTLESS
@SS	LATIN CAPITAL LETTER SHARP S (German) (the digraph "SS")
@U,	LATIN CAPITAL LETTER U WITH CEDILLA
@u,	LATIN SMALL LETTER U WITH CEDILLA

Acknowledgments

Thanks to all the readers who contributed to the improvement of this FAQ.

Special thanks to Josef Tkadlec who reviewed the charset tables, and to Jiřina Chudková for Slovak accent names, and last but not least to Anthony Cirelli for helping me to improve the English of this text.

Sources Used

ČSN 36 9103. Systémy zpracování informací: 8bitově kódované soubory symbolů (Information processing: 8-bit code for information interchange.) Vydavatelství norem Praha, 1989.

Gašparíková, Z. - Kamiš, A.: Slovensko-český slovník. SPN Praha 1987. (The Slovak-Czech Dictionary.)

IBM: IBM OS/2 Warp 4. Klávesnice a kódové stránky. (Keyboards and Code Pages.) IBM, 1996.

Knuth, D. E.: The TeXbook. Addison - Wesley, Reading, Massachusetts, 1986.

Lamport, L.: LaTeX. Addison - Wesley, Reading, Massachusetts, 1986.

List of IANA Registered Character Sets.

RFC 1345. Character Mnemonics & Character Sets. [Tables for ISO Latin 2 (ISO_8859-2:1987), PC Latin 2 (IBM852) and KOI-8 L2 (CSN_369103).]

The cp1250_WinLatin2 to Unicode table, 2.00.

The MacOS_CentralEurope to Unicode table, 0.2. [This table also contains verbal description of the code.]