几何尺寸与公差论坛

 找回密码
 注册
查看: 4551|回复: 2

Unicode、UCS、UTF、BMP、BOM

[复制链接]
发表于 2007-1-22 17:42:25 | 显示全部楼层 |阅读模式
一篇来自CSDN的文章(作者:fmddlmyy)

这是一篇程序员写给程序员的趣味读物。所谓趣味是指可以比较轻松地了解一些原来不清楚的概念,增进知识,类似于打RPG游戏的升级。整理这篇文章的动机是两个问题:

问题一:
使用Windows记事本的“另存为”,可以在GBK、Unicode、Unicode big endian和UTF-8这几种编码方式间相互转换。同样是txt文件,Windows是怎样识别编码方式的呢?

我很早前就发现Unicode、Unicode big endian和UTF-8编码的txt文件的开头会多出几个字节,分别是FF、FE(Unicode),FE、FF(Unicode big endian),EF、BB、BF(UTF-8)。但这些标记是基于什么标准呢?

问题二:
最近在网上看到一个 ConvertUTF.c,实现了UTF-32、UTF-16和UTF-8这三种编码方式的相互转换。对于Unicode(UCS2)、GBK、UTF- 8这些编码方式,我原来就了解。但这个程序让我有些糊涂,想不起来UTF-16和UCS2有什么关系。

查了查相关资料,总算将这些问题弄清楚了,顺带也了解了一些Unicode的细节。作者写成一篇文章,送给有过类似疑问的朋友。本文在写作时尽量做到通俗易懂,但要求读者知道什么是字节,什么是十六进制。

0、big endian和little endian
big endian和little endian是CPU处理多字节数的不同方式。例如“汉”字的Unicode编码是6C49。那么写到文件里时,究竟是将6C写在前面,还是将49写在前面?如果将6C写在前面,就是big endian。如果将49写在前面,就是little endian。

“endian”这个词出自《格列佛游记》。小人国的内战就源于吃鸡蛋时是究竟从大头(Big-Endian)敲开还是从小头(Little-Endian)敲开,由此曾发生过六次叛乱,一个皇帝送了命,另一个丢了王位。

我们一般将endian翻译成“字节序”,将big endian和little endian称作“大尾”和“小尾”。

1、字符编码、内码,顺带介绍汉字编码

字符必须编码后才能被计算机处理。计算机使用的缺省编码方式就是计算机的内码。早期的计算机使用7位的ASCII编码,为了处理汉字,程序员设计了用于简体中文的GB2312和用于繁体中文的big5。

GB2312(1980年)一共收录了7445个字符,包括6763个汉字和682个其它符号。汉字区的内码范围高字节从B0-F7,低字节从A1-FE,占用的码位是72*94=6768。其中有5个空位是D7FA-D7FE。

GB2312支持的汉字太少。1995年的汉字扩展规范GBK1.0收录了21886个符号,它分为汉字区和图形符号区。汉字区包括 21003个字符。2000年的GB18030是取代GBK1.0的正式国家标准。该标准收录了27484个汉字,同时还收录了藏文、蒙文、维吾尔文等主要的少数民族文字。现在的PC平台必须支持GB18030,对嵌入式产品暂不作要求。所以手机、MP3一般只支持GB2312。

从ASCII、GB2312、GBK到GB18030,这些编码方法是向下兼容的,即同一个字符在这些方案中总是有相同的编码,后面的标准支持更多的字符。在这些编码中,英文和中文可以统一地处理。区分中文编码的方法是高字节的最高位不为0。按照程序员的称呼,GB2312、GBK到 GB18030都属于双字节字符集 (DBCS)。

有的中文Windows的缺省内码还是GBK,可以通过GB18030升级包升级到GB18030。不过GB18030相对GBK增加的字符,普通人是很难用到的,通常我们还是用GBK指代中文Windows内码。

这里还有一些细节:

GB2312的原文还是区位码,从区位码到内码,需要在高字节和低字节上分别加上A0。

在DBCS中,GB内码的存储格式始终是big endian,即高位在前。

GB2312的两个字节的最高位都是1。但符合这个条件的码位只有128*128=16384个。所以GBK和GB18030的低字节最高位都可能不是1。不过这不影响DBCS字符流的解析:在读取DBCS字符流时,只要遇到高位为1的字节,就可以将下两个字节作为一个双字节编码,而不用管低字节的高位是什么。

2、Unicode、UCS和UTF

前面提到从ASCII、GB2312、GBK到GB18030的编码方法是向下兼容的。而Unicode只与ASCII兼容(更准确地说,是与ISO-8859-1兼容),与GB码不兼容。例如“汉”字的Unicode编码是6C49,而GB码是BABA。

Unicode也是一种字符编码方法,不过它是由国际组织设计,可以容纳全世界所有语言文字的编码方案。Unicode的学名是 "Universal Multiple-Octet Coded Character Set",简称为UCS。UCS可以看作是"Unicode Character Set"的缩写。

根据维基百科全书(http://zh.wikipedia.org/wiki/ )的记载:历史上存在两个试图独立设计Unicode的组织,即国际标准化组织(ISO)和一个软件制造商的协会(unicode.org)。ISO开发了ISO 10646项目,Unicode协会开发了Unicode项目。

在1991年前后,双方都认识到世界不需要两个不兼容的字符集。于是它们开始合并双方的工作成果,并为创立一个单一编码表而协同工作。从Unicode2.0开始,Unicode项目采用了与ISO 10646-1相同的字库和字码。

目前两个项目仍都存在,并独立地公布各自的标准。Unicode协会现在的最新版本是2005年的Unicode 4.1.0。ISO的最新标准是ISO 10646-3:2003。

UCS只是规定如何编码,并没有规定如何传输、保存这个编码。例如“汉”字的UCS编码是6C49,我可以用4个ascii数字来传输、保存这个编码;也可以用utf-8编码:3个连续的字节E6 B1 89来表示它。关键在于通信双方都要认可。UTF-8、UTF-7、UTF-16都是被广泛接受的方案。UTF-8的一个特别的好处是它与ISO- 8859-1完全兼容。UTF是“UCS Transformation Format”的缩写。

IETF的RFC2781和RFC3629以RFC的一贯风格,清晰、明快又不失严谨地描述了UTF-16和UTF-8的编码方法。我总是记不得IETF是Internet Engineering Task Force的缩写。但IETF负责维护的RFC是Internet上一切规范的基础。

2.1、内码和code page

目前Windows的内核已经采用Unicode编码,这样在内核上可以支持全世界所有的语言文字。但是由于现有的大量程序和文档都采用了某种特定语言的编码,例如GBK,Windows不可能不支持现有的编码,而全部改用Unicode。

Windows使用代码页(code page)来适应各个国家和地区。code page可以被理解为前面提到的内码。GBK对应的code page是CP936。

微软也为GB18030定义了code page:CP54936。但是由于GB18030有一部分4字节编码,而Windows的代码页只支持单字节和双字节编码,所以这个code page是无法真正使用的。

3、UCS-2、UCS-4、BMP

UCS有两种格式:UCS-2和UCS-4。顾名思义,UCS-2就是用两个字节编码,UCS-4就是用4个字节(实际上只用了31位,最高位必须为0)编码。下面让我们做一些简单的数学游戏:

UCS-2有2^16=65536个码位,UCS-4有2^31=2147483648个码位。

UCS-4根据最高位为0的最高字节分成2^7=128个group。每个group再根据次高字节分为256个plane。每个plane根据第3个字节分为256行(rows),每行包含256个cells。当然同一行的cells只是最后一个字节不同,其余都相同。

group 0的plane 0被称作Basic Multilingual Plane, 即BMP。或者说UCS-4中,高两个字节为0的码位被称作BMP。

将UCS-4的BMP去掉前面的两个零字节就得到了UCS-2。在UCS-2的两个字节前加上两个零字节,就得到了UCS-4的BMP。而目前的UCS-4规范中还没有任何字符被分配在BMP之外。

4、UTF编码

UTF-8就是以8位为单元对UCS进行编码。从UCS-2到UTF-8的编码方式如下:

UCS-2编码(16进制) UTF-8 字节流(二进制) 0000 - 007F 0xxxxxxx 0080 - 07FF 110xxxxx 10xxxxxx 0800 - FFFF 1110xxxx 10xxxxxx 10xxxxxx

例如“汉”字的Unicode编码是6C49。6C49在0800-FFFF之间,所以肯定要用3字节模板了:1110xxxx 10xxxxxx 10xxxxxx。将6C49写成二进制是:0110 110001 001001,用这个比特流依次代替模板中的x,得到:11100110 10110001 10001001,即E6 B1 89。

读者可以用记事本测试一下我们的编码是否正确。需要注意,UltraEdit在打开utf-8编码的文本文件时会自动转换为UTF-16,可能产生混淆。你可以在设置中关掉这个选项。更好的工具是Hex Workshop。

UTF-16以16位为单元对UCS进行编码。对于小于0x10000的UCS码,UTF-16编码就等于UCS码对应的16位无符号整数。对于不小于0x10000的UCS码,定义了一个算法。不过由于实际使用的UCS2,或者UCS4的BMP必然小于0x10000,所以就目前而言,可以认为UTF-16和UCS-2基本相同。但UCS-2只是一个编码方案,UTF-16却要用于实际的传输,所以就不得不考虑字节序的问题。

5、UTF的字节序和BOM

UTF-8以字节为编码单元,没有字节序的问题。UTF-16以两个字节为编码单元,在解释一个UTF-16文本前,首先要弄清楚每个编码单元的字节序。例如“奎”的Unicode编码是594E,“乙”的Unicode编码是4E59。如果我们收到UTF-16字节流“594E”,那么这是“奎” 还是“乙”?

Unicode规范中推荐的标记字节顺序的方法是BOM。BOM不是“Bill Of Material”的BOM表,而是Byte Order Mark。BOM是一个有点小聪明的想法:

在UCS编码中有一个叫做"ZERO WIDTH NO-BREAK SPACE"的字符,它的编码是FEFF。而FFFE在UCS中是不存在的字符,所以不应该出现在实际传输中。UCS规范建议我们在传输字节流前,先传输字符"ZERO WIDTH NO-BREAK SPACE"。

这样如果接收者收到FEFF,就表明这个字节流是Big-Endian的;如果收到FFFE,就表明这个字节流是Little-Endian的。因此字符"ZERO WIDTH NO-BREAK SPACE"又被称作BOM。

UTF-8不需要BOM来表明字节顺序,但可以用BOM来表明编码方式。字符"ZERO WIDTH NO-BREAK SPACE"的UTF-8编码是EF BB BF(读者可以用我们前面介绍的编码方法验证一下)。所以如果接收者收到以EF BB BF开头的字节流,就知道这是UTF-8编码了。

Windows就是使用BOM来标记文本文件的编码方式的。

6、进一步的参考资料

本文主要参考的资料是 "Short overview of ISO-IEC 10646 and Unicode" (http://www.nada.kth.se/i18n/ucs/unicode-iso10646-oview.html )。

我还找了两篇看上去不错的资料,不过因为我开始的疑问都找到了答案,所以就没有看:

"Understanding Unicode A general introduction to the Unicode Standard" (http://scripts.sil.org/cms/scripts/page.php?site_id=nrsi&item_id=IWS-Chapter04a ) "Character set encoding basics Understanding character set encodings and legacy encodings" (http://scripts.sil.org/cms/scripts/page.php?site_id=nrsi&item_id=IWS-Chapter03 ) 我写过UTF-8、UCS-2、GBK相互转换的软件包,包括使用Windows API和不使用Windows API的版本。以后有时间的话,我会整理一下放到我的个人主页上( http://fmddlmyy.home4u.china.com )。

附录1 再说说区位码、GB2312、内码和代码页

有的朋友对文章中这句话还有疑问: “GB2312的原文还是区位码,从区位码到内码,需要在高字节和低字节上分别加上A0。”

我再详细解释一下:

“GB2312的原文”是指国家1980年的一个标准《中华人民共和国国家标准 信息交换用汉字编码字符集 基本集 GB 2312-80》。这个标准用两个数来编码汉字和中文符号。第一个数称为“区”,第二个数称为“位”。所以也称为区位码。1-9区是中文符号,16-55 区是一级汉字,56-87区是二级汉字。现在Windows也还有区位输入法,例如输入1601得到“啊”。

内码是指操作系统内部的字符编码。早期操作系统的内码是与语言相关的.现在的Windows在内部统一使用Unicode,然后用代码页适应各种语言,“内码”的概念就比较模糊了。微软一般将缺省代码页指定的编码说成是内码,在特殊的场合也会说自己的内码是Unicode,例如在 GB18030问题的处理上。

所谓代码页(code page)就是针对一种语言文字的字符编码。例如GBK的code page是CP936,BIG5的code page是CP950,GB2312的code page是CP20936。

Windows中有缺省代码页的概念,即缺省用什么编码来解释字符。例如Windows的记事本打开了一个文本文件,里面的内容是字节流:BA、BA、D7、D6。Windows应该去怎么解释它呢?

是按照Unicode编码解释、还是按照GBK解释、还是按照BIG5解释,还是按照ISO8859-1去解释?如果按GBK去解释,就会得到“汉字”两个字。按照其它编码解释,可能找不到对应的字符,也可能找到错误的字符。所谓“错误”是指与文本作者的本意不符,这时就产生了乱码。

答案是Windows按照当前的缺省代码页去解释文本文件里的字节流。缺省代码页可以通过控制面板的区域选项设置。记事本的另存为中有一项ANSI,其实就是按照缺省代码页的编码方法保存。

Windows的内码是Unicode,它在技术上可以同时支持多个代码页。只要文件能说明自己使用什么编码,用户又安装了对应的代码页,Windows就能正确显示,例如在HTML文件中就可以指定charset。

有的HTML文件作者,特别是英文作者,认为世界上所有人都使用英文,在文件中不指定charset。如果他使用了0x80-0xff之间的字符,中文Windows又按照缺省的GBK去解释,就会出现乱码。这时只要在这个html文件中加上指定charset的语句,例如:如果原作者使用的代码页和ISO8859-1兼容,就不会出现乱码了。

再说区位码,啊的区位码是1601,写成16进制是0x10,0x01。这和计算机广泛使用的ASCII编码冲突。为了兼容00-7f的 ASCII编码,我们在区位码的高、低字节上分别加上A0。这样“啊”的编码就成为B0A1。我们将加过两个A0的编码也称为GB2312编码,虽然 GB2312的原文根本没提到这一点。

http://fmddlmyy.home4u.china.com /text6.html

-- FlyingFang - 24 Jun 2005
 楼主| 发表于 2007-1-22 17:42:57 | 显示全部楼层

回复: Unicode、UCS、UTF、BMP、BOM

A short overview of
ISO/IEC 10646 and Unicode
By Olle J鋜nefors <ojarnef@admin.kth.se>

Summary
The purpose of this text is to give a brief technical overview of the new character set standard ISO/IEC 10646 and the nearly related Unicode standard. I have omitted descriptions of the history of the standard as well as general talk about why a standard of this type is badly needed.

Previous knowledge
The reader should have some knowledge about coded character sets, have seen an ASCII table, and know of some 8-bit character sets, like Latin-1 (ISO/IEC 8859-1).

Document history
Various drafts of this text have previously been available over Internet, the latest of which is version Ap4 (from 1993-09-14).

1993-09-14, version Ap4: Last draft

1996-02-24, version A: Final document, prepared for the IAB character set workshop 1996-02-29/1996-03-01

1996-02-26, version Ar1: Added one item to the author presentation. HTML home added. Section 1: added three limitations of plain text removed by UCS. Section 5: paragraph about privat

About the author
Having joined SIS-ITS/AG2 (the Swedish standardization working group corresponding to ISO/IEC JTC1/SC2 -- Character sets and information coding) in 1988, I made contributions to the Swedish comments on several drafts of the ISO/IEC 10646 standard. I also had the pleasure to take part in the big merger of Unicode and ISO/IEC 10646 that was accomplished at three meetings during 1991 in San Francisco, Geneva and Paris, representing Sweden on the ISO side. I have also worked with character set standardization in European standardization (CEN/TC304) and within IETF. Lately, I have provided character set knowledge to and edited the first proposal for extending ISO/IEC 10646 with a major historical script, the Runic script.

Original home
The latest version of this text is available at
<URL:ftp://ftp.admin.kth.se/pub/misc/ucs/unicode-iso10646-oview.txta;type=A>

HTML home
An HTML version of this text is available at
<URL:http://www.nada.kth.se/i18n/ucs/unicode-iso10646-oview.html>
Table of content with synopsis
1. Most important facts
ISO/IEC 10646 = UCS. Universal in scope. Multi-octet character set. Relation to Unicode. Plain and rich text

2. The structure of the coding space
The half-filled UCS-2. The unused UCS-4. Cell, row, plane, group. Relation to ISO/IEC 8859-1. UCS-2 = BMP = plane 0 of group 0.

3. Implementation levels Level 1 (enough for Europe, the Middle East, East Asia). Level 2 (needed for South Asia). Bi-directional text. Precomposed characters, combining characters, composite sequences.

4. Adaptation to data communication needs
UCS transformation formats. UTF-8: UCS represented in 8-bit text. UTF-7: UCS-2 represented in 7-bit text. UTF-16: Part of UCS-4 represented in UCS-2.

5. What is accepted as a character in UCS?
Existing coded character sets amalgamated. CJK unification. Characters not shapes, not meanings. Compatibility characters. Private use characters.

6. References

7. Annex: Overview of the BMP (group=00, plane=00)
1. Most important facts
ISO/IEC 10646 is a relatively new character set standard, published in 1993 by the International Organization for Standardization (ISO). Its name is "Universal Multiple-Octet Coded Character Set". Troughout this overview I use its acronym, UCS.

UCS is the first offcially standardized coded character set with the purpose to eventually include all characters used in all the written languages in the world (and, in addition, all mathematical and other symbols). This is certainly a very ambitious goal, but the current first edition at least covers all major languages and all commercially important languages.

To be able to give every character of this grand repertoire a unique coded representation, the designers of UCS chose a uniform encoding, using bit sequences consisting of 16 or 31 bits (in the two coding forms, UCS-2 and UCS-4). This is the reason for the phrase "multi-octet" in the name of the standard.

Unicode is a coded character set specified by a consortium of major American computer manufacturers, primarily to overcome the chaos of different coded character sets in use when creating multilingual programs and internationalizing software. From version 1.1 on, Unicode is scrupulously kept compatible with ISO/IEC 10646 and its extensions. The consortium is also an important contributor to the ISO work to further develop ISO/IEC 10646.

In short, Unicode can be characterized as the (restricted) 2-octet form of UCS on (the most general) implementation level 3, with addition of a more precise specification of the bi-directional behavior of characters, when used in the Arabic and Hebrew scripts. Unicode is presently at version 1.1. Extensions in the soon forthcoming version 2.0 will make it possible to access also the wider coding space of UCS-4, within this 16-bit encoding.

UCS is intended to be usable both for internal data representation in computer systems and in data communication. UCS is already employed in commercial products from Microsoft, Novell, Apple and others. It is implemented in free software like Linux, and is proposed for inclusion in advanced data communication standards like HTML.

Strong but in my opinion ill-founded criticism has met UCS from programmer groups in Japan. It has, however, recently been adopted as a Japanese national standard.

ISO/IEC 10646 is a fundamental standard, potentially affecting almost all parts of information technology. But it specifies only a coded character set, not a complete system for text representation. It provides the basis for internationalization, but does not in itself give a complete solution of the problems in this field.

The simple kind of text for whose representation a coded character set standard is sufficient, plain text, is essentially only a linear sequence of graphic characters, with a fixed division into lines and possibly pages.

ISO/IEC 10646 and Unicode removes some assumptions often made about plain text, which simplifies implementations but are untenable in multilingual text and monolingual text in some languages:

* Plain text does not need to be monospaced. (Proportional plain text in the Latin script has existed in the Apple Macintosh computers since the middle of the 80's.)
* Characters cannot be identified with glyphs. Different graphic forms to be used in different situations are needed for some characters, e.g. Arabic letters.
* Characters do not in general specify the language of the text. UCS is a completely language-neutral standard.

For several important aspects of text, as treated in modern text processing programs, UCS needs to be supplemented by further standards or rules, so-called higher-level text protocols. Some examples of these aspects are tables, mathematical formulas, information about the language of text fragments, text variations like italic text and different text sizes, choice of particular fonts, content mark-up, document structure, hyperlinks. This is called rich text. (Some standards for rich text are HTML, SGML, Microsoft RTF.)

The evolution of ISO/IEC 10646 and, in parallel, Unicode will continue for a long period of time, mostly by additions of scripts and symbol collections. This overview describes the first edition of the standard from 1993, but some of the extensions that are about to be adopted are also touched upon.
2. The structure of the coding space
In the first version of UCS 34203 different characters are included. Of these 21204 are ideographic characters used in Chinese, Japanese and Korean, and 6656 are Korean Hangul syllabograms. To guarantee that the coding space will not be filled up even in the future -- 2 octets give 65536 different character positions -- a 4-octet form of UCS (UCS-4) is also definied.

The 65536 positions in the 2-octet form of UCS are divided into 256 rows with 256 cells in each. The first octet of a character representation gives the row number, the second the cell number. The first row, row 0, contains exactly the same characters as ISO/IEC 8859-1. The first 128 characters are thus the ASCII characters. The octet representing an ISO/IEC 8859-1 character is easily transformed to the representation in UCS, by putting a 0 octet in front of it. UCS includes the same control characters as ISO/IEC 8859 and these are also in row 0. An overview of the content of all rows are found in the annex.

In the 4-octet form more than 2 billion (2147483648) different characters can be represented. (The first bit of the first octet must be 0 so only 31 of the 32 bits are used by UCS.) This coding space is subdivided into 128 groups, each containing 256 planes. The first octet in a character representation indicates the group number and the second the plane number. The third and fourth octets gives the row number and the cell number of the character. Those characters that can be represented by the 2-octet form of UCS belong to plane 0 of group 0, which is called the Basic Multilingual Plane, BMP. The 4-octet representation of a character in the BMP is produced by putting two 0 octets before its 2-octet representation.

Still no characters have been allocated to positions outside the BMP, and only the 2-octet form is used in practice.
3. Implementation levels
Independently of the two encoding forms of UCS, the standard ISO/IEC 10646 also draws a distinction between three different implementation levels. The full coded character set is available on level 3. On the lower levels certain subsets of the characters are not usable. This restricts the range of langauges that can be coded on these levels. On the other hand it makes simpler implementations possible.

A full implementation of the Unicode standard amounts to an implementation at level 3 of UCS.

* The simplest implementation level 1 works exactly like the older simple coded character sets, such as ASCII and ISO/IEC 8859-1: Each graphic character occupies one position and moves the active position one step in the writing direction (even though the movement need not be constant; it is not if a proportional font is used). This model works well for among others the Latin, Greek, and Cyrillic scripts. On this level the composite letters, consisting of a base letter and one or more diacritical marks, which are used in certain languages, are included as single characters in their own right. UCS includes the composite letters of all official languages and also of most other languages with a well-established orthography using these scripts.

Also the Arabic and Hebrew scripts are handled on this level, but they introduce an extra complication: Arabic and Hebrew are normally written from right to left, but when words in e.g. the Latin script are included within such text, these are written in their normal direction, from left to right. In computer memory all characters are stored in the logical order, i.e. the order in which the sounds of the words are pronounced and the letters normally are input. When displayed, the writing direction of some words must be changed, relative to the order in memory. Two alternative methods to handle bi-directional text can be used together with UCS, one based on the international standard ISO/IEC 6429 and one defined for Unicode.

Other languages for which implementation level 1 is sufficient are Japanese and Chinese. These are not affected by any of the two complications noted above. For these languages it is the big number of different characters that make implementations difficult.

* On implementation level 2 also the South-Asian scripts, e.g. Devanagari used on the Indian subcontinent, can be handled. These causes further complications of display software, since in many cases both the appearance and the relative position of a certain letter is determined by which the nearest surrounding letters are.

* On the full implementation level 3 conforming programs also must be able to handle independent combining characters, e.g. accents and other diacritical marks that are printed over, under or through ordinary letters. Such characters can be freely combined with other characters and UCS sets no limit on the number of combining characters attached to a base character. A difference compared to some other coded character sets is that the codes for combining characters are stored after the code of the base character, not before it.

A complication for programming is that on this level some composite characters can each be coded in several different ways. As an example, the Danish letter "A with ring above and acute accent" can be represented in three different ways:

01FA
(the simple representation that must be used on level 1 and 2)

00C5 0301
("A with ring above" + combining acute accent)

0041 030A 0301
("A" + combining ring above + combining acute accent)

(The code positions in UCS are usually given in hexadecimal notation. 01FA indicates two octets, first the octet with the value 1, corresponding to row 1, then the octet with the hexadecimal value FA, corresponding to cell 250 in that row.)

Formally, the first alternative above is considered as a representation of a single precomposed character, while the second and third alternatives represent different composite sequences of several characters. Programs on implementation level 3 should, however, treat these three alternatives as fully equivalent representations of the same thing.

Implementation level 3 is necessary for full support of the Korean Hangul script and also for full support of IPA, the International Phonetic Alphabet. It also removes artificial restrictions on the possibilities of combining accents and similar marks with in ways not anticipated when the composite characters of implementation level 1 were chosen.

4. Adaptation to data communication needs
Many data communication protocols treat octets with values in the hexadecimal range 00-1F specially; they represent control characters in most 7-bit and 8-bit character sets. It is even the case that the most used protocol for electronic mail, classical SMTP, explicitly forbids the 128 octets > hex 7F. In certain datatypes used in data communication, e.g. domain names on Internet, even harder restrictions are imposed an allowed octets. In some important operating systems, notably Unix, even some octets that in ASCII represents graphic characters can not be used in file names.

When UCS is used in these contexts, the simple solution to just partition the 16-bit or 31-bit codes into 2 or 4 octets does not work. For many graphic characters this will produce octets in the ranges forbidden by the above mentioned protocols and operating system designs.

For these reasons, several algorithmic transformation methods have been defined for UCS data. The UTF-1 method (UCS Transformation Format No. 1), defined in an annex to ISO/IEC 10646, is of little interest and will be withdrawn. More important are the following:

* UTF-8: The codes in the first half of the first row of the BMP, i.e. the characters that also can be found in ASCII, are in this transformation format replaced by their ASCII codes, which are octets in the range hex 00-7F. The other codes of UCS are transformed to between two and six octets in the range hex 80-FF. A text only containing characters in the BMP is transformed to the same octet sequence, irrespective of whether it was coded with UCS-2 or UCS-4.

* UTF-7: This is a transformation format specially designed for the extreme requirements of Internet e-mail using the classical SMTP protocol. It transforms UCS-2-coded text to a sequence of octets that all are <= 7F. In this encoding most ASCII characters of the UCS-2 text are replaced by their ASCII octet. All other characters are transformed to a representation using around 2,7 octets per character.

* UTF-16: Unlike UTF-8 and UTF-7, this transformation reduces UCS-4-coded text to a UCS-2-based encoding and the result can only be used by so called 8-bit safe programs and processes, where all octet values are allowed. All UCS-4 codes in the BMP are reduced to the corresponding code in UCS-2. In addition, UCS-4 codes in the 10 following planes of group 0 are transformed to two UCS-2 codes. 4096 codes in the BMP are reserved for this. This makes the characters that in the future may be allocated to 1048576 code positions of UCS-4 outside the BMP available in the 16-bit UCS-2 coded character set. The other code positions in UCS-4 are still unusable in the UTF-16 transformation format. One motivation for defining UTF-16 has been that it will make it possible for software implementing Unicode to cope with the expansion of UCS outside the BMP for the foreseeable future.

UTF-8 and UTF-16 will be added to ISO/IEC 10646 in the next revision of the standard, and are included in the forthcoming Unicode version 2.0. UTF-7 is a specification of IETF, the Internet Engineering Task Force, and formally unrelated to ISO/IEC 10646.
5. What is accepted as a character in UCS?
The character repertoire of the first version of UCS is based on an amalgamation of all internationally standardized coded character sets and the most important company-defined de facto standards for coded character sets that existed in 1991. Whenever what was deemed as the same character was found in different coded character sets, these were unified into one character with one code in UCS. But two different characters in the same coded character set was never unified. Also the letters of some scripts with no existing standard coded character set, and vast collections of mathematical symbols, technical symbols, geometric shapes, dingbats and other conventional signs were included in the repertoire of UCS.

When deciding on whether a graphic character should be added to UCS, the most important principle have been that a new character must differ from all already included characters both in meaning and in appearance to be accepted.

Alternative graphic forms of existing characters (font variants, glyphs) are consequently not given UCS codes of their own. In Chinese, Japanese and Korean there is a very big number of ideographic characters which have the same historical origin and only minor differences in appearance between the three languages. These national variants of the same ideographic character have been given a joint UCS code, a solution which is known as CJK unification.

On the other hand, not even a completely new way of using an existing character -- the same appearance but different meanings -- is sufficient justification to get it included in UCS as a separate character. For example the punctuation mark asterisk, "*", of considerable age in itself, has in recent years also been used as multiplication sign in different programming languages. This case is regarded as two different uses of the same character, which is given only one UCS representation.

There are two important exceptions from the criteria for character sameness outlined above:

* Letters with exactly the same appearance that occur in scripts are given different codes. There are for example one Latin "", one Greek "" (capital rho), and one Cyrillic "" (Cyrillic R).

* A comparatively small number of characters have been accepted in UCS only because they occur in other, practically important coded character sets. This is to make possible the fully reversible conversion of data coded in these coded character sets to UCS and back again to the original character set (round-trip convertibility). Such characters are called compatibility characters. One example is the character SUPERSCRIPT TWO which can be found in UCS only because it is included in the coded character set ISO/IEC 8859-1.

What is said here is only a general outline of the principles used to identify individual characters to be given a code position in UCS and Unicode. These are unfortunately not described at all in the text of ISO/IEC 10646. In many specific cases it is of course not at all clear how to apply them. Quite a number of the decisions made are fairly arbitrary.

On important feature of UCS is that a large number of code positions are reserved for private use characters. No future revision of ISO/IEC 10646 will use these positions. There is room for 6400 private characters i the 2-octet form, and more in the 4-octet form.
6. References

UCS is defined in:
ISO/IEC International Standard 10646-1:1993(E): Information technology -- Universal Multiple-Octet Coded Character Set (UCS) -- Part 1: Arcitecture and Basic Multilingual Plane. International Organization for Standardization, Geneva, 1993.

Unicode version 1.0 is defined in two books:
The Unicode Consortium: The Unicode Standard Worldwide Character Encoding. Version 1.0. Volume 1 (Arcitecture, non-ideographic characters) Addison-Wesley, 1991

The Unicode Consortium: The Unicode Standard Worldwide Character Encoding. Version 1.0. Volume 2 (Ideographic characters) Addison-Wesley, 1992

The changes made between version 1.0 and version 1.1 are specified in:
Unicode Technical Report #4: The Unicode Standard, Version 1.1 The Unicode Consortium, 1993

Definitions of the various transformation formats proposed to be included in ISO/IEC 10646 and Unicode 2.0 are available on the Internet:
UTF-7 Encoding Form
[HTML-version of RFC 1642]
http://www.stonehand.com/unicode/standard/utf7.html

UCS Transformation Format 8 (UTF-8) [HTML-version of ISO-document ISO/IEC JTC1/SC2/WG2 N1036]http://www.stonehand.com/unicode/standard/wg2n1036.html

UCS Transformation Format 16 (UTF-16) [HTML-version of ISO-document ISO/IEC JTC1/SC2/WG2 N1035]http://www.stonehand.com/unicode/standard/wg2n1035.html

Internet sites with much information about Unicode:
http://www.stonehand.com/unicode/

ftp://ftp.stonehand.com/pub/

ftp://unicode.org/pub/


A good account of the history of ISO work on multi-octet character sets and the merger between ISO/IEC 10646 and Unicode can be found in:
Michael Y. Ksar: Untying tongues. ISO/IEC breaks down computer barriers in processing worldwide languages ISO Bulletin, No. 6 (June 1993)

Annex: Overview of the BMP (group=00, plane=00)

_______ ___________________________________________________________________

Row(s) Content (script, other groups of characters, reserved area)
_______ ___________________________________________________________________

======= A-ZONE (alphabetical characters and symbols) =======================
00 (Control characters,) Basic Latin, Latin-1 Supplement (=ISO/IEC 8859-1)
01 Latin Extended-A, Latin Extended-B
02 Latin Extended-B, IPA Extensions, Spacing Modifier Letters
03 Combining Diacritical Marks, Basic Greek, Greek Symbols and Coptic
04 Cyrillic
05 Armenian, Hebrew
06 Basic Arabic, Arabic Extended
07--08 (Reserved for future standardization)
09 Devanagari, Bengali
0A Gumukhi, Gujarati
0B Oriya, Tamil
0C Telugu, Kannada
0D Malayalam
0E Thai, Lao
0F (Reserved for future standardization)
10 Georgian
11 Hangul Jamo
12--1D (Reserved for future standardization)
1E Latin Extended Additional
1F Greek Extended
20 General Punctuation, Super/subscripts, Currency, Combining Symbols
21 Letterlike Symbols, Number Forms, Arrows
22 Mathematical Operators
23 Miscellaneous Technical Symbols
24 Control Pictures, OCR, Enclosed Alphanumerics
25 Box Drawing, Block Elements, Geometric Shapes
26 Miscellaneous Symbols
27 Dingbats
28--2F (Reserved for future standardization)
30 CJK Symbols and Punctuation, Hiragana, Katakana
31 Bopomofo, Hangul Compatibility Jamo, CJK Miscellaneous
32 Enclosed CJK Letters and Months
33 CJK Compatibility
34--4D Hangul

======= I-ZONE (ideographic characters) ===================================
4E--9F CJK Unified Ideographs

======= O-ZONE (open zone) ================================================
A0--DF (Reserved for future standardization)

======= R-ZONE (restricted use zone) ======================================
E0--F8 (Private Use Area)
F9--FA CJK Compatibility Ideographs
FB Alphabetic Presentation Forms, Arabic Presentation Forms-A
FC--FD Arabic Presentation Forms-A
FE Combining Half Marks, CJK Compatibility Forms, Small Forms, Arabic-B
FF Halfwidth and Fullwidth Forms, Specials


^ Up to the KTH/NADA collection of information resources about character sets and the Internet IAB-charsets page.

Author: Olle J鋜nefors <ojarnef@admin.kth.se>
Maintainer: Peter Svanberg <psv@nada.kth.se> Organization: Royal Institute of Technology (KTH), Stockholm, Sweden
Version: Ar1
Document type: overview
Newest version at:ftp://ftp.admin.kth.se/pub/misc/ucs/unicode-iso10646-oview.txta
URL:http://www.nada.kth.se/i18n/unicode-iso10646-oview.html
This version updated: 1996-02-26
 楼主| 发表于 2007-1-22 17:43:47 | 显示全部楼层

回复: Unicode、UCS、UTF、BMP、BOM

Network Working Group F. Yergeau
Request for Comments: 3629 Alis Technologies
STD: 63 November 2003
Obsoletes: 2279
Category: Standards Track


UTF-8, a transformation format of ISO 10646

Status of this Memo

This document specifies an Internet standards track protocol for the
Internet community, and requests discussion and suggestions for
improvements. Please refer to the current edition of the "Internet
Official Protocol Standards" (STD 1) for the standardization state
and status of this protocol. Distribution of this memo is unlimited.

Copyright Notice

Copyright (C) The Internet Society (2003). All Rights Reserved.

Abstract

ISO/IEC 10646-1 defines a large character set called the Universal
Character Set (UCS) which encompasses most of the world's writing
systems. The originally proposed encodings of the UCS, however, were
not compatible with many current applications and protocols, and this
has led to the development of UTF-8, the object of this memo. UTF-8
has the characteristic of preserving the full US-ASCII range,
providing compatibility with file systems, parsers and other software
that rely on US-ASCII values but are transparent to other values.
This memo obsoletes and replaces RFC 2279.

Table of Contents

1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 2
2. Notational conventions . . . . . . . . . . . . . . . . . . . . 3
3. UTF-8 definition . . . . . . . . . . . . . . . . . . . . . . . 4
4. Syntax of UTF-8 Byte Sequences . . . . . . . . . . . . . . . . 5
5. Versions of the standards . . . . . . . . . . . . . . . . . . 6
6. Byte order mark (BOM) . . . . . . . . . . . . . . . . . . . . 6
7. Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
8. MIME registration . . . . . . . . . . . . . . . . . . . . . . 9
9. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 10
10. Security Considerations . . . . . . . . . . . . . . . . . . . 10
11. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 11
12. Changes from RFC 2279 . . . . . . . . . . . . . . . . . . . . 11
13. Normative References . . . . . . . . . . . . . . . . . . . . . 12



Yergeau Standards Track [Page 1]

RFC 3629 UTF-8 November 2003


14. Informative References . . . . . . . . . . . . . . . . . . . . 12
15. URI's . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
16. Intellectual Property Statement . . . . . . . . . . . . . . . 13
17. Author's Address . . . . . . . . . . . . . . . . . . . . . . . 13
18. Full Copyright Statement . . . . . . . . . . . . . . . . . . . 14

1. Introduction

ISO/IEC 10646 [ISO.10646] defines a large character set called the
Universal Character Set (UCS), which encompasses most of the world's
writing systems. The same set of characters is defined by the
Unicode standard [UNICODE], which further defines additional
character properties and other application details of great interest
to implementers. Up to the present time, changes in Unicode and
amendments and additions to ISO/IEC 10646 have tracked each other, so
that the character repertoires and code point assignments have
remained in sync. The relevant standardization committees have
committed to maintain this very useful synchronism.

ISO/IEC 10646 and Unicode define several encoding forms of their
common repertoire: UTF-8, UCS-2, UTF-16, UCS-4 and UTF-32. In an
encoding form, each character is represented as one or more encoding
units. All standard UCS encoding forms except UTF-8 have an encoding
unit larger than one octet, making them hard to use in many current
applications and protocols that assume 8 or even 7 bit characters.

UTF-8, the object of this memo, has a one-octet encoding unit. It
uses all bits of an octet, but has the quality of preserving the full
US-ASCII [US-ASCII] range: US-ASCII characters are encoded in one
octet having the normal US-ASCII value, and any octet with such a
value can only stand for a US-ASCII character, and nothing else.

UTF-8 encodes UCS characters as a varying number of octets, where the
number of octets, and the value of each, depend on the integer value
assigned to the character in ISO/IEC 10646 (the character number,
a.k.a. code position, code point or Unicode scalar value). This
encoding form has the following characteristics (all values are in
hexadecimal):

o Character numbers from U+0000 to U+007F (US-ASCII repertoire)
correspond to octets 00 to 7F (7 bit US-ASCII values). A direct
consequence is that a plain ASCII string is also a valid UTF-8
string.








Yergeau Standards Track [Page 2]

RFC 3629 UTF-8 November 2003


o US-ASCII octet values do not appear otherwise in a UTF-8 encoded
character stream. This provides compatibility with file systems
or other software (e.g., the printf() function in C libraries)
that parse based on US-ASCII values but are transparent to other
values.

o Round-trip conversion is easy between UTF-8 and other encoding
forms.

o The first octet of a multi-octet sequence indicates the number of
octets in the sequence.

o The octet values C0, C1, F5 to FF never appear.

o Character boundaries are easily found from anywhere in an octet
stream.

o The byte-value lexicographic sorting order of UTF-8 strings is the
same as if ordered by character numbers. Of course this is of
limited interest since a sort order based on character numbers is
almost never culturally valid.

o The Boyer-Moore fast search algorithm can be used with UTF-8 data.

o UTF-8 strings can be fairly reliably recognized as such by a
simple algorithm, i.e., the probability that a string of
characters in any other encoding appears as valid UTF-8 is low,
diminishing with increasing string length.

UTF-8 was devised in September 1992 by Ken Thompson, guided by design
criteria specified by Rob Pike, with the objective of defining a UCS
transformation format usable in the Plan9 operating system in a non-
disruptive manner. Thompson's design was stewarded through
standardization by the X/Open Joint Internationalization Group XOJIG
(see [FSS_UTF]), bearing the names FSS-UTF (variant FSS/UTF), UTF-2
and finally UTF-8 along the way.

2. Notational conventions

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in [RFC2119].

UCS characters are designated by the U+HHHH notation, where HHHH is a
string of from 4 to 6 hexadecimal digits representing the character
number in ISO/IEC 10646.





Yergeau Standards Track [Page 3]

RFC 3629 UTF-8 November 2003


3. UTF-8 definition

UTF-8 is defined by the Unicode Standard [UNICODE]. Descriptions and
formulae can also be found in Annex D of ISO/IEC 10646-1 [ISO.10646]

In UTF-8, characters from the U+0000..U+10FFFF range (the UTF-16
accessible range) are encoded using sequences of 1 to 4 octets. The
only octet of a "sequence" of one has the higher-order bit set to 0,
the remaining 7 bits being used to encode the character number. In a
sequence of n octets, n>1, the initial octet has the n higher-order
bits set to 1, followed by a bit set to 0. The remaining bit(s) of
that octet contain bits from the number of the character to be
encoded. The following octet(s) all have the higher-order bit set to
1 and the following bit set to 0, leaving 6 bits in each to contain
bits from the character to be encoded.

The table below summarizes the format of these different octet types.
The letter x indicates bits available for encoding bits of the
character number.

Char. number range | UTF-8 octet sequence
(hexadecimal) | (binary)
--------------------+---------------------------------------------
0000 0000-0000 007F | 0xxxxxxx
0000 0080-0000 07FF | 110xxxxx 10xxxxxx
0000 0800-0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx
0001 0000-0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

Encoding a character to UTF-8 proceeds as follows:

1. Determine the number of octets required from the character number
and the first column of the table above. It is important to note
that the rows of the table are mutually exclusive, i.e., there is
only one valid way to encode a given character.

2. Prepare the high-order bits of the octets as per the second
column of the table.

3. Fill in the bits marked x from the bits of the character number,
expressed in binary. Start by putting the lowest-order bit of
the character number in the lowest-order position of the last
octet of the sequence, then put the next higher-order bit of the
character number in the next higher-order position of that octet,
etc. When the x bits of the last octet are filled in, move on to
the next to last octet, then to the preceding one, etc. until all
x bits are filled in.





Yergeau Standards Track [Page 4]

RFC 3629 UTF-8 November 2003


The definition of UTF-8 prohibits encoding character numbers between
U+D800 and U+DFFF, which are reserved for use with the UTF-16
encoding form (as surrogate pairs) and do not directly represent
characters. When encoding in UTF-8 from UTF-16 data, it is necessary
to first decode the UTF-16 data to obtain character numbers, which
are then encoded in UTF-8 as described above. This contrasts with
CESU-8 [CESU-8], which is a UTF-8-like encoding that is not meant for
use on the Internet. CESU-8 operates similarly to UTF-8 but encodes
the UTF-16 code values (16-bit quantities) instead of the character
number (code point). This leads to different results for character
numbers above 0xFFFF; the CESU-8 encoding of those characters is NOT
valid UTF-8.

Decoding a UTF-8 character proceeds as follows:

1. Initialize a binary number with all bits set to 0. Up to 21 bits
may be needed.

2. Determine which bits encode the character number from the number
of octets in the sequence and the second column of the table
above (the bits marked x).

3. Distribute the bits from the sequence to the binary number, first
the lower-order bits from the last octet of the sequence and
proceeding to the left until no x bits are left. The binary
number is now equal to the character number.

Implementations of the decoding algorithm above MUST protect against
decoding invalid sequences. For instance, a naive implementation may
decode the overlong UTF-8 sequence C0 80 into the character U+0000,
or the surrogate pair ED A1 8C ED BE B4 into U+233B4. Decoding
invalid sequences may have security consequences or cause other
problems. See Security Considerations (Section 10) below.

4. Syntax of UTF-8 Byte Sequences

For the convenience of implementors using ABNF, a definition of UTF-8
in ABNF syntax is given here.

A UTF-8 string is a sequence of octets representing a sequence of UCS
characters. An octet sequence is valid UTF-8 only if it matches the
following syntax, which is derived from the rules for encoding UTF-8
and is expressed in the ABNF of [RFC2234].

UTF8-octets = *( UTF8-char )
UTF8-char = UTF8-1 / UTF8-2 / UTF8-3 / UTF8-4
UTF8-1 = %x00-7F
UTF8-2 = %xC2-DF UTF8-tail



Yergeau Standards Track [Page 5]

RFC 3629 UTF-8 November 2003


UTF8-3 = %xE0 %xA0-BF UTF8-tail / %xE1-EC 2( UTF8-tail ) /
%xED %x80-9F UTF8-tail / %xEE-EF 2( UTF8-tail )
UTF8-4 = %xF0 %x90-BF 2( UTF8-tail ) / %xF1-F3 3( UTF8-tail ) /
%xF4 %x80-8F 2( UTF8-tail )
UTF8-tail = %x80-BF

NOTE -- The authoritative definition of UTF-8 is in [UNICODE]. This
grammar is believed to describe the same thing Unicode describes, but
does not claim to be authoritative. Implementors are urged to rely
on the authoritative source, rather than on this ABNF.

5. Versions of the standards

ISO/IEC 10646 is updated from time to time by publication of
amendments and additional parts; similarly, new versions of the
Unicode standard are published over time. Each new version obsoletes
and replaces the previous one, but implementations, and more
significantly data, are not updated instantly.

In general, the changes amount to adding new characters, which does
not pose particular problems with old data. In 1996, Amendment 5 to
the 1993 edition of ISO/IEC 10646 and Unicode 2.0 moved and expanded
the Korean Hangul block, thereby making any previous data containing
Hangul characters invalid under the new version. Unicode 2.0 has the
same difference from Unicode 1.1. The justification for allowing
such an incompatible change was that there were no major
implementations and no significant amounts of data containing Hangul.
The incident has been dubbed the "Korean mess", and the relevant
committees have pledged to never, ever again make such an
incompatible change (see Unicode Consortium Policies [1]).

New versions, and in particular any incompatible changes, have
consequences regarding MIME charset labels, to be discussed in MIME
registration (Section 8).

6. Byte order mark (BOM)

The UCS character U+FEFF "ZERO WIDTH NO-BREAK SPACE" is also known
informally as "BYTE ORDER MARK" (abbreviated "BOM"). This character
can be used as a genuine "ZERO WIDTH NO-BREAK SPACE" within text, but
the BOM name hints at a second possible usage of the character: to
prepend a U+FEFF character to a stream of UCS characters as a
"signature". A receiver of such a serialized stream may then use the
initial character as a hint that the stream consists of UCS
characters and also to recognize which UCS encoding is involved and,
with encodings having a multi-octet encoding unit, as a way to





Yergeau Standards Track [Page 6]

RFC 3629 UTF-8 November 2003


recognize the serialization order of the octets. UTF-8 having a
single-octet encoding unit, this last function is useless and the BOM
will always appear as the octet sequence EF BB BF.

It is important to understand that the character U+FEFF appearing at
any position other than the beginning of a stream MUST be interpreted
with the semantics for the zero-width non-breaking space, and MUST
NOT be interpreted as a signature. When interpreted as a signature,
the Unicode standard suggests than an initial U+FEFF character may be
stripped before processing the text. Such stripping is necessary in
some cases (e.g., when concatenating two strings, because otherwise
the resulting string may contain an unintended "ZERO WIDTH NO-BREAK
SPACE" at the connection point), but might affect an external process
at a different layer (such as a digital signature or a count of the
characters) that is relying on the presence of all characters in the
stream. It is therefore RECOMMENDED to avoid stripping an initial
U+FEFF interpreted as a signature without a good reason, to ignore it
instead of stripping it when appropriate (such as for display) and to
strip it only when really necessary.

U+FEFF in the first position of a stream MAY be interpreted as a
zero-width non-breaking space, and is not always a signature. In an
attempt at diminishing this uncertainty, Unicode 3.2 adds a new
character, U+2060 "WORD JOINER", with exactly the same semantics and
usage as U+FEFF except for the signature function, and strongly
recommends its exclusive use for expressing word-joining semantics.
Eventually, following this recommendation will make it all but
certain that any initial U+FEFF is a signature, not an intended "ZERO
WIDTH NO-BREAK SPACE".

In the meantime, the uncertainty unfortunately remains and may affect
Internet protocols. Protocol specifications MAY restrict usage of
U+FEFF as a signature in order to reduce or eliminate the potential
ill effects of this uncertainty. In the interest of striking a
balance between the advantages (reduction of uncertainty) and
drawbacks (loss of the signature function) of such restrictions, it
is useful to distinguish a few cases:

o A protocol SHOULD forbid use of U+FEFF as a signature for those
textual protocol elements that the protocol mandates to be always
UTF-8, the signature function being totally useless in those
cases.

o A protocol SHOULD also forbid use of U+FEFF as a signature for
those textual protocol elements for which the protocol provides
character encoding identification mechanisms, when it is expected
that implementations of the protocol will be in a position to
always use the mechanisms properly. This will be the case when



Yergeau Standards Track [Page 7]

RFC 3629 UTF-8 November 2003


the protocol elements are maintained tightly under the control of
the implementation from the time of their creation to the time of
their (properly labeled) transmission.

o A protocol SHOULD NOT forbid use of U+FEFF as a signature for
those textual protocol elements for which the protocol does not
provide character encoding identification mechanisms, when a ban
would be unenforceable, or when it is expected that
implementations of the protocol will not be in a position to
always use the mechanisms properly. The latter two cases are
likely to occur with larger protocol elements such as MIME
entities, especially when implementations of the protocol will
obtain such entities from file systems, from protocols that do not
have encoding identification mechanisms for payloads (such as FTP)
or from other protocols that do not guarantee proper
identification of character encoding (such as HTTP).

When a protocol forbids use of U+FEFF as a signature for a certain
protocol element, then any initial U+FEFF in that protocol element
MUST be interpreted as a "ZERO WIDTH NO-BREAK SPACE". When a
protocol does NOT forbid use of U+FEFF as a signature for a certain
protocol element, then implementations SHOULD be prepared to handle a
signature in that element and react appropriately: using the
signature to identify the character encoding as necessary and
stripping or ignoring the signature as appropriate.

7. Examples

The character sequence U+0041 U+2262 U+0391 U+002E "A<NOT IDENTICAL
TO><ALPHA>." is encoded in UTF-8 as follows:

--+--------+-----+--
41 E2 89 A2 CE 91 2E
--+--------+-----+--

The character sequence U+D55C U+AD6D U+C5B4 (Korean "hangugeo",
meaning "the Korean language") is encoded in UTF-8 as follows:

--------+--------+--------
ED 95 9C EA B5 AD EC 96 B4
--------+--------+--------

The character sequence U+65E5 U+672C U+8A9E (Japanese "nihongo",
meaning "the Japanese language") is encoded in UTF-8 as follows:

--------+--------+--------
E6 97 A5 E6 9C AC E8 AA 9E
--------+--------+--------



Yergeau Standards Track [Page 8]

RFC 3629 UTF-8 November 2003


The character U+233B4 (a Chinese character meaning 'stump of tree'),
prepended with a UTF-8 BOM, is encoded in UTF-8 as follows:

--------+-----------
EF BB BF F0 A3 8E B4
--------+-----------

8. MIME registration

This memo serves as the basis for registration of the MIME charset
parameter for UTF-8, according to [RFC2978]. The charset parameter
value is "UTF-8". This string labels media types containing text
consisting of characters from the repertoire of ISO/IEC 10646
including all amendments at least up to amendment 5 of the 1993
edition (Korean block), encoded to a sequence of octets using the
encoding scheme outlined above. UTF-8 is suitable for use in MIME
content types under the "text" top-level type.

It is noteworthy that the label "UTF-8" does not contain a version
identification, referring generically to ISO/IEC 10646. This is
intentional, the rationale being as follows:

A MIME charset label is designed to give just the information needed
to interpret a sequence of bytes received on the wire into a sequence
of characters, nothing more (see [RFC2045], section 2.2). As long as
a character set standard does not change incompatibly, version
numbers serve no purpose, because one gains nothing by learning from
the tag that newly assigned characters may be received that one
doesn't know about. The tag itself doesn't teach anything about the
new characters, which are going to be received anyway.

Hence, as long as the standards evolve compatibly, the apparent
advantage of having labels that identify the versions is only that,
apparent. But there is a disadvantage to such version-dependent
labels: when an older application receives data accompanied by a
newer, unknown label, it may fail to recognize the label and be
completely unable to deal with the data, whereas a generic, known
label would have triggered mostly correct processing of the data,
which may well not contain any new characters.

Now the "Korean mess" (ISO/IEC 10646 amendment 5) is an incompatible
change, in principle contradicting the appropriateness of a version
independent MIME charset label as described above. But the
compatibility problem can only appear with data containing Korean
Hangul characters encoded according to Unicode 1.1 (or equivalently
ISO/IEC 10646 before amendment 5), and there is arguably no such data
to worry about, this being the very reason the incompatible change
was deemed acceptable.



Yergeau Standards Track [Page 9]

RFC 3629 UTF-8 November 2003


In practice, then, a version-independent label is warranted, provided
the label is understood to refer to all versions after Amendment 5,
and provided no incompatible change actually occurs. Should
incompatible changes occur in a later version of ISO/IEC 10646, the
MIME charset label defined here will stay aligned with the previous
version until and unless the IETF specifically decides otherwise.

9. IANA Considerations

The entry for UTF-8 in the IANA charset registry has been updated to
point to this memo.

10. Security Considerations

Implementers of UTF-8 need to consider the security aspects of how
they handle illegal UTF-8 sequences. It is conceivable that in some
circumstances an attacker would be able to exploit an incautious
UTF-8 parser by sending it an octet sequence that is not permitted by
the UTF-8 syntax.

A particularly subtle form of this attack can be carried out against
a parser which performs security-critical validity checks against the
UTF-8 encoded form of its input, but interprets certain illegal octet
sequences as characters. For example, a parser might prohibit the
NUL character when encoded as the single-octet sequence 00, but
erroneously allow the illegal two-octet sequence C0 80 and interpret
it as a NUL character. Another example might be a parser which
prohibits the octet sequence 2F 2E 2E 2F ("/../"), yet permits the
illegal octet sequence 2F C0 AE 2E 2F. This last exploit has
actually been used in a widespread virus attacking Web servers in
2001; thus, the security threat is very real.

Another security issue occurs when encoding to UTF-8: the ISO/IEC
10646 description of UTF-8 allows encoding character numbers up to
U+7FFFFFFF, yielding sequences of up to 6 bytes. There is therefore
a risk of buffer overflow if the range of character numbers is not
explicitly limited to U+10FFFF or if buffer sizing doesn't take into
account the possibility of 5- and 6-byte sequences.

Security may also be impacted by a characteristic of several
character encodings, including UTF-8: the "same thing" (as far as a
user can tell) can be represented by several distinct character
sequences. For instance, an e with acute accent can be represented
by the precomposed U+00E9 E ACUTE character or by the canonically
equivalent sequence U+0065 U+0301 (E + COMBINING ACUTE). Even though
UTF-8 provides a single byte sequence for each character sequence,
the existence of multiple character sequences for "the same thing"
may have security consequences whenever string matching, indexing,



Yergeau Standards Track [Page 10]

RFC 3629 UTF-8 November 2003


searching, sorting, regular expression matching and selection are
involved. An example would be string matching of an identifier
appearing in a credential and in access control list entries. This
issue is amenable to solutions based on Unicode Normalization Forms,
see [UAX15].

11. Acknowledgements

The following have participated in the drafting and discussion of
this memo: James E. Agenbroad, Harald Alvestrand, Andries Brouwer,
Mark Davis, Martin J. Duerst, Patrick Faltstrom, Ned Freed, David
Goldsmith, Tony Hansen, Edwin F. Hart, Paul Hoffman, David Hopwood,
Simon Josefsson, Kent Karlsson, Dan Kohn, Markus Kuhn, Michael Kung,
Alain LaBonte, Ira McDonald, Alexey Melnikov, MURATA Makoto, John
Gardiner Myers, Chris Newman, Dan Oscarsson, Roozbeh Pournader,
Murray Sargent, Markus Scherer, Keld Simonsen, Arnold Winkler,
Kenneth Whistler and Misha Wolf.

12. Changes from RFC 2279

o Restricted the range of characters to 0000-10FFFF (the UTF-16
accessible range).

o Made Unicode the source of the normative definition of UTF-8,
keeping ISO/IEC 10646 as the reference for characters.

o Straightened out terminology. UTF-8 now described in terms of an
encoding form of the character number. UCS-2 and UCS-4 almost
disappeared.

o Turned the note warning against decoding of invalid sequences into
a normative MUST NOT.

o Added a new section about the UTF-8 BOM, with advice for
protocols.

o Removed suggested UNICODE-1-1-UTF-8 MIME charset registration.

o Added an ABNF syntax for valid UTF-8 octet sequences

o Expanded Security Considerations section, in particular impact of
Unicode normalization









Yergeau Standards Track [Page 11]

RFC 3629 UTF-8 November 2003


13. Normative References

[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
Requirement Levels", BCP 14, RFC 2119, March 1997.

[ISO.10646] International Organization for Standardization,
"Information Technology - Universal Multiple-octet coded
Character Set (UCS)", ISO/IEC Standard 10646, comprised
of ISO/IEC 10646-1:2000, "Information technology --
Universal Multiple-Octet Coded Character Set (UCS) --
Part 1: Architecture and Basic Multilingual Plane",
ISO/IEC 10646-2:2001, "Information technology --
Universal Multiple-Octet Coded Character Set (UCS) --
Part 2: Supplementary Planes" and ISO/IEC 10646-
1:2000/Amd 1:2002, "Mathematical symbols and other
characters".

[UNICODE] The Unicode Consortium, "The Unicode Standard -- Version
4.0", defined by The Unicode Standard, Version 4.0
(Boston, MA, Addison-Wesley, 2003. ISBN 0-321-18578-1),
April 2003, <http://www.unicode.org/unicode/standard/
versions/enumeratedversions.html#Unicode_4_0_0>.

14. Informative References

[CESU-8] Phipps, T., "Unicode Technical Report #26: Compatibility
Encoding Scheme for UTF-16: 8-Bit (CESU-8)", UTR 26,
April 2002,
<http://www.unicode.org/unicode/reports/tr26/>.

[FSS_UTF] X/Open Company Ltd., "X/Open Preliminary Specification --
File System Safe UCS Transformation Format (FSS-UTF)",
May 1993, <http://wwwold.dkuug.dk/jtc1/sc22/wg20/docs/
N193-FSS-UTF.pdf>.

[RFC2045] Freed, N. and N. Borenstein, "Multipurpose Internet Mail
Extensions (MIME) Part One: Format of Internet Message
Bodies", RFC 2045, November 1996.

[RFC2234] Crocker, D. and P. Overell, "Augmented BNF for Syntax
Specifications: ABNF", RFC 2234, November 1997.

[RFC2978] Freed, N. and J. Postel, "IANA Charset Registration
Procedures", BCP 19, RFC 2978, October 2000.







Yergeau Standards Track [Page 12]

RFC 3629 UTF-8 November 2003


[UAX15] Davis, M. and M. Duerst, "Unicode Standard Annex #15:
Unicode Normalization Forms", An integral part of The
Unicode Standard, Version 4.0.0, April 2003, <http://
www.unicode.org/unicode/reports/tr15>.

[US-ASCII] American National Standards Institute, "Coded Character
Set - 7-bit American Standard Code for Information
Interchange", ANSI X3.4, 1986.

15. URIs

[1] <http://www.unicode.org/unicode/standard/policies.html>

16. Intellectual Property Statement

The IETF takes no position regarding the validity or scope of any
intellectual property or other rights that might be claimed to
pertain to the implementation or use of the technology described in
this document or the extent to which any license under such rights
might or might not be available; neither does it represent that it
has made any effort to identify any such rights. Information on the
IETF's procedures with respect to rights in standards-track and
standards-related documentation can be found in BCP-11. Copies of
claims of rights made available for publication and any assurances of
licenses to be made available, or the result of an attempt made to
obtain a general license or permission for the use of such
proprietary rights by implementors or users of this specification can
be obtained from the IETF Secretariat.

The IETF invites any interested party to bring to its attention any
copyrights, patents or patent applications, or other proprietary
rights which may cover technology that may be required to practice
this standard. Please address the information to the IETF Executive
Director.

17. Author's Address

Francois Yergeau
Alis Technologies
100, boul. Alexis-Nihon, bureau 600
Montreal, QC H4M 2P2
Canada

Phone: +1 514 747 2547
Fax: +1 514 747 2561
EMail: fyergeau@alis.com





Yergeau Standards Track [Page 13]

RFC 3629 UTF-8 November 2003


18. Full Copyright Statement

Copyright (C) The Internet Society (2003). All Rights Reserved.

This document and translations of it may be copied and furnished to
others, and derivative works that comment on or otherwise explain it
or assist in its implementation may be prepared, copied, published
and distributed, in whole or in part, without restriction of any
kind, provided that the above copyright notice and this paragraph are
included on all such copies and derivative works. However, this
document itself may not be modified in any way, such as by removing
the copyright notice or references to the Internet Society or other
Internet organizations, except as needed for the purpose of
developing Internet standards in which case the procedures for
copyrights defined in the Internet Standards process must be
followed, or as required to translate it into languages other than
English.

The limited permissions granted above are perpetual and will not be
revoked by the Internet Society or its successors or assignees.

This document and the information contained herein is provided on an
"AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING
TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING
BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION
HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF
MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.

Acknowledgement

Funding for the RFC Editor function is currently provided by the
Internet Society.
您需要登录后才可以回帖 登录 | 注册

本版积分规则

QQ|Archiver|小黑屋|几何尺寸与公差论坛

GMT+8, 2024-5-14 04:54 , Processed in 0.057921 second(s), 19 queries .

Powered by Discuz! X3.4 Licensed

© 2001-2023 Discuz! Team.

快速回复 返回顶部 返回列表