glib/tests/utf8.txt
Owen Taylor 956f00ed96 move $enable_debug down below checks for GCC to avoid setting CFLAGS
Fri Jan  5 11:25:42 2001  Owen Taylor  <otaylor@redhat.com>

	* configure.in (PACKAGE): move $enable_debug down below
	checks for GCC to avoid setting CFLAGS prematurely,
	change checks to avoid adding -g twice.

	* gutf8.c (g_ucs4_to_utf8): Support len < 0 to mean
	0 termination.

	* gutf8.c (g_utf8_to_ucs4): Terminate result with 0.

	* tests/mainloop-test.c (main): Fix uses of
	g_main_loop_destroy().

	* tests/unicode-encoding.c tests/Makefile.am tests/utf8.txt:
	Tests for unicode-conversion code.

	* gconvert.c (g_convert, g_convert_with_fallback): work around
	a couple of GNU libc bugs.

	* gconvert.[ch] (g_{locale,filename}_{to,from}_utf8): Standardize
	arguments to match g_convert(). Document.

	* gunicode.[ch]:
	  - Implement conversion functions to and from UTF-16
	  - Standardize unicode conversion functions on prototype like
	    g_convert.
	  - Add a lot of error checking to unicode conversion functions.

	* gunicode.[ch] (g_utf8_to_ucs4_fast): Add fast, non-checking
	variant of g_utf8_to_ucs4.

	* gutf8.c (g_utf8_validate):
	 - add g_return_if_fail (str != NULL).
	 - add checks for overlong strings, non-valid Unicode characters (>= 110000)
	   and single surrogates.
2001-01-05 21:22:47 +00:00

298 lines
3.2 KiB
Plaintext
Raw Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# This file is derived from
#
# http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt
#
# Which was created by Markus Kuhn <mkuhn@acm.org> - 2000-09-02
#
# lines begining with # and blank lines are ignored
#
# Beyond that, this file consists of a series of test cases. Each test case consists of
# 2 or 3 lines:
#
# 1. A UTF-8 string
# 2. A status
# VALID : The string is a valid UTF-8 representation of valid Unicode
# INCOMPLETE : The string has a partial character at the end
# NOTUNICODE : The string is valid UTF-8, but the characters represented
# are not valid unicode (
# OVERLONG : The string includes overlong sequences
# MALFORMED : The string is not valid UTF-8
# 3. If the status is VALID or NOTUNICODE, the UCS-4 representation of the string,
# as a series of hex numbers.
# 1 Some correct UTF-8 text
κόσμε
VALID
03ba 1f79 03c3 03bc 03b5
# 2.1 First possible sequence of a certain length
#
# FIXME - handle NULLS?
#
# [ NULL BYTE ]
#VALID
#0000
€
VALID
0080
à €
VALID
0800
ð<EFBFBD>€€
VALID
00010000
øˆ€€€
NOTUNICODE
00200000
ü„€€€€
NOTUNICODE
04000000

VALID
0000007f
ß¿
VALID
000007ff
ï¿¿
NOTUNICODE
0000ffff
÷¿¿¿
NOTUNICODE
001fffff
û¿¿¿¿
NOTUNICODE
03ffffff
ý¿¿¿¿¿
NOTUNICODE
7fffffff
# 2.3 Other boundary conditions
퟿
VALID
d7ff

VALID
e000
�
VALID
fffd
ô<EFBFBD>¿¿
VALID
0010ffff
ô<EFBFBD>€€
NOTUNICODE
00110000
# 3.1 Unexpected continuation bytes
MALFORMED
¿
MALFORMED
€¿
MALFORMED
€¿€
MALFORMED
€¿€¿
MALFORMED
€¿€¿€
MALFORMED
€¿€¿€¿
MALFORMED
€¿€¿€¿€
MALFORMED
<EFBFBD>ƒ„…†‡ˆ‰ŠŒ<EFBFBD>Ž<EFBFBD><EFBFBD>“”•˜™šœ<EFBFBD>žŸ¡¢£¤¥¦§¨©ª«¬­®¯°±²³´µ¶·¸¹º»¼½¾¿
MALFORMED
# 3.2 Lonely start characters
À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú Û Ü Ý Þ ß
MALFORMED
à á â ã ä å æ ç è é ê ë ì í î ï
MALFORMED
ð ñ ò ó ô õ ö ÷
MALFORMED
ø ù ú û
MALFORMED
ü ý
MALFORMED
# 3.3 Sequences with last continuation byte missing
À
INCOMPLETE
à€
INCOMPLETE
ð€€
INCOMPLETE
ø€€€
INCOMPLETE
ü€€€€
INCOMPLETE
ß
INCOMPLETE
ï¿
INCOMPLETE
÷¿¿
INCOMPLETE
û¿¿¿
INCOMPLETE
ý¿¿¿¿
INCOMPLETE
# 3.4 Concatenation of incomplete sequences
Àà€ð€€ø€€€ü€€€€ßï¿÷¿¿û¿¿¿ý¿¿¿¿
MALFORMED
# 3.5 Impossible bytes
þ
MALFORMED
ÿ
MALFORMED
þþÿÿ
MALFORMED
# Examples of an overlong ASCII character
À¯
OVERLONG
à€¯
OVERLONG
ð€€¯
OVERLONG
ø€€€¯
OVERLONG
ü€€€€¯
OVERLONG
# Maximum overlong sequences
Á¿
OVERLONG
àŸ¿
OVERLONG
ð<EFBFBD>¿¿
OVERLONG
ø‡¿¿¿
OVERLONG
üƒ¿¿¿¿
OVERLONG
# Overlong representation of the NUL character
À€
OVERLONG
à€€
OVERLONG
ð€€€
OVERLONG
ø€€€€
OVERLONG
ü€€€€€
OVERLONG
# Illegal code positions
# Single UTF-16 surrogates
í €
NOTUNICODE
d800
í­¿
NOTUNICODE
db7f
í®€
NOTUNICODE
db80
í¯¿
NOTUNICODE
dbff
í°€
NOTUNICODE
dc00
í¾€
NOTUNICODE
df80
í¿¿
NOTUNICODE
dfff
# Paired UTF-16 surrogates
𐀀
NOTUNICODE
d800 dc00
𐏿
NOTUNICODE
d800 dfff
í­¿í°€
NOTUNICODE
db7f dc00
í­¿í¿¿
NOTUNICODE
db7f dfff
󰀀
NOTUNICODE
db80 dc00
󰏿
NOTUNICODE
db80 dfff
􏰀
NOTUNICODE
dbff dc00
􏿿
NOTUNICODE
dbff dfff
# Other illegal code positions
￾
NOTUNICODE
fffe
ï¿¿
NOTUNICODE
ffff
################
#
# Some more tests, not from Markus Kuhn's file
#
# Mixed plane 0 and higher planes
<EFBFBD>€€Bô<EFBFBD>¿¿C
VALID
41 00010000 42 10ffff 43