diff options
author | Tim J. Robbins <tjr@FreeBSD.org> | 2002-10-10 22:56:18 +0000 |
---|---|---|
committer | Tim J. Robbins <tjr@FreeBSD.org> | 2002-10-10 22:56:18 +0000 |
commit | 972baa3747e149e6345e27925bf2441a99cc1898 (patch) | |
tree | 818571da6c27befa8ace080e6465f3a6668ac52e /lib/libc/locale/utf8.5 | |
parent | 9b30d71989b1191abe1a5212fa0bf89bbe2524f1 (diff) | |
download | src-972baa3747e149e6345e27925bf2441a99cc1898.tar.gz src-972baa3747e149e6345e27925bf2441a99cc1898.zip |
Add a UTF-8 encoding method, which will eventually replace the antique
"UTF2" method. Although UTF-8 and the old UTF2 encoding are compatible
for 16-bit characters, the new UTF-8 implementation is much more strict
about rejecting malformed input and also handles the full 31 bit range
of characters.
Notes
Notes:
svn path=/head/; revision=104828
Diffstat (limited to 'lib/libc/locale/utf8.5')
-rw-r--r-- | lib/libc/locale/utf8.5 | 115 |
1 files changed, 115 insertions, 0 deletions
diff --git a/lib/libc/locale/utf8.5 b/lib/libc/locale/utf8.5 new file mode 100644 index 000000000000..079e9eabe8e1 --- /dev/null +++ b/lib/libc/locale/utf8.5 @@ -0,0 +1,115 @@ +.\" Copyright (c) 1993 +.\" The Regents of the University of California. All rights reserved. +.\" +.\" This code is derived from software contributed to Berkeley by +.\" Paul Borman at Krystal Technologies. +.\" +.\" Redistribution and use in source and binary forms, with or without +.\" modification, are permitted provided that the following conditions +.\" are met: +.\" 1. Redistributions of source code must retain the above copyright +.\" notice, this list of conditions and the following disclaimer. +.\" 2. Redistributions in binary form must reproduce the above copyright +.\" notice, this list of conditions and the following disclaimer in the +.\" documentation and/or other materials provided with the distribution. +.\" 3. All advertising materials mentioning features or use of this software +.\" must display the following acknowledgement: +.\" This product includes software developed by the University of +.\" California, Berkeley and its contributors. +.\" 4. Neither the name of the University nor the names of its contributors +.\" may be used to endorse or promote products derived from this software +.\" without specific prior written permission. +.\" +.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND +.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE +.\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE +.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL +.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS +.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) +.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT +.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY +.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF +.\" SUCH DAMAGE. +.\" +.\" @(#)utf2.4 8.1 (Berkeley) 6/4/93 +.\" $FreeBSD$ +.\" +.Dd October 10, 2002 +.Dt UTF8 5 +.Os +.Sh NAME +.Nm utf8 +.Nd "UTF-8, a transformation format of ISO 10646" +.Sh SYNOPSIS +.Nm ENCODING +.Qq UTF-8 +.Sh DESCRIPTION +The +.Nm UTF-8 +encoding represents UCS-4 characters as a sequence of octets, using +between 1 and 6 for each character. +It is backwards compatible with +.Tn ASCII , +so 0x00-0x7f refer to the +.Tn ASCII +character set. +The multibyte encoding of non- +.Tn ASCII +characters +consist entirely of bytes whose high order bit is set. +The actual +encoding is represented by the following table: +.Bd -literal +[0x00000000 - 0x0000007f] [00000000.0bbbbbbb] -> 0bbbbbbb +[0x00000080 - 0x000007ff] [00000bbb.bbbbbbbb] -> 110bbbbb, 10bbbbbb +[0x00000800 - 0x0000ffff] [bbbbbbbb.bbbbbbbb] -> + 1110bbbb, 10bbbbbb, 10bbbbbb +[0x00010000 - 0x001fffff] [00000000.000bbbbb.bbbbbbbb.bbbbbbbb] -> + 11110bbb, 10bbbbbb, 10bbbbbb, 10bbbbbb +[0x00200000 - 0x03ffffff] [000000bb.bbbbbbbb.bbbbbbbb.bbbbbbbb] -> + 111110bb, 10bbbbbb, 10bbbbbb, 10bbbbbb, 10bbbbbb +[0x04000000 - 0x7fffffff] [0bbbbbbb.bbbbbbbb.bbbbbbbb.bbbbbbbb] -> + 1111110b, 10bbbbbb, 10bbbbbb, 10bbbbbb, 10bbbbbb, 10bbbbbb +.Ed +.Pp +If more than a single representation of a value exists (for example, +0x00; 0xC0 0x80; 0xE0 0x80 0x80) the shortest representation is always +used. +Longer ones are detected as an error as they pose a potential +security risk, and destroy the 1:1 character:octet sequence mapping. +.Sh COMPATIBILITY +The +.Nm +encoding supersedes the +.Xr utf2 4 +encoding. +The only differences between the two are that +.Nm +handles the full 31-bit character set of +.Tn ISO +10646 +whereas +.Xr utf2 4 +is limited to a 16-bit character set, +and that +.Xr utf2 4 +accepts redundant, non-"shortest form" representations of characters. +.Sh SEE ALSO +.Xr euc 4 , +.Xr utf2 4 +.Rs +.%A "F. Yergeau" +.%T "UTF-8, a transformation format of ISO 10646" +.%O "RFC 2279" +.%D "January 1998" +.Re +.Sh STANDARDS +The +.Nm +encoding is compatible with RFC 2279. +.Sh BUGS +Byte order marker (BOM) characters are neither added nor removed +from UTF-8-encoded wide character +.Xr stdio 3 +streams. |