From 956f00ed96228526cbeda1432df1f729e6f13322 Mon Sep 17 00:00:00 2001 From: Owen Taylor Date: Fri, 5 Jan 2001 21:22:47 +0000 Subject: [PATCH] move $enable_debug down below checks for GCC to avoid setting CFLAGS Fri Jan 5 11:25:42 2001 Owen Taylor * configure.in (PACKAGE): move $enable_debug down below checks for GCC to avoid setting CFLAGS prematurely, change checks to avoid adding -g twice. * gutf8.c (g_ucs4_to_utf8): Support len < 0 to mean 0 termination. * gutf8.c (g_utf8_to_ucs4): Terminate result with 0. * tests/mainloop-test.c (main): Fix uses of g_main_loop_destroy(). * tests/unicode-encoding.c tests/Makefile.am tests/utf8.txt: Tests for unicode-conversion code. * gconvert.c (g_convert, g_convert_with_fallback): work around a couple of GNU libc bugs. * gconvert.[ch] (g_{locale,filename}_{to,from}_utf8): Standardize arguments to match g_convert(). Document. * gunicode.[ch]: - Implement conversion functions to and from UTF-16 - Standardize unicode conversion functions on prototype like g_convert. - Add a lot of error checking to unicode conversion functions. * gunicode.[ch] (g_utf8_to_ucs4_fast): Add fast, non-checking variant of g_utf8_to_ucs4. * gutf8.c (g_utf8_validate): - add g_return_if_fail (str != NULL). - add checks for overlong strings, non-valid Unicode characters (>= 110000) and single surrogates. --- ChangeLog | 37 ++ ChangeLog.pre-2-0 | 37 ++ ChangeLog.pre-2-10 | 37 ++ ChangeLog.pre-2-12 | 37 ++ ChangeLog.pre-2-2 | 37 ++ ChangeLog.pre-2-4 | 37 ++ ChangeLog.pre-2-6 | 37 ++ ChangeLog.pre-2-8 | 37 ++ configure.in | 24 +- gconvert.c | 258 +++++++++--- gconvert.h | 24 +- glib/gconvert.c | 258 +++++++++--- glib/gconvert.h | 24 +- glib/gunicode.h | 45 ++- glib/gutf8.c | 840 ++++++++++++++++++++++++++++++++++++++- gunicode.h | 45 ++- gutf8.c | 840 ++++++++++++++++++++++++++++++++++++++- tests/Makefile.am | 4 +- tests/mainloop-test.c | 4 +- tests/unicode-encoding.c | 411 +++++++++++++++++++ tests/utf8.txt | 297 ++++++++++++++ 21 files changed, 3192 insertions(+), 178 deletions(-) create mode 100644 tests/unicode-encoding.c create mode 100644 tests/utf8.txt diff --git a/ChangeLog b/ChangeLog index 10269d507..6478f6a24 100644 --- a/ChangeLog +++ b/ChangeLog @@ -1,3 +1,40 @@ +Fri Jan 5 11:25:42 2001 Owen Taylor + + * configure.in (PACKAGE): move $enable_debug down below + checks for GCC to avoid setting CFLAGS prematurely, + change checks to avoid adding -g twice. + + * gutf8.c (g_ucs4_to_utf8): Support len < 0 to mean + 0 termination. + + * gutf8.c (g_utf8_to_ucs4): Terminate result with 0. + + * tests/mainloop-test.c (main): Fix uses of + g_main_loop_destroy(). + + * tests/unicode-encoding.c tests/Makefile.am tests/utf8.txt: + Tests for unicode-conversion code. + + * gconvert.c (g_convert, g_convert_with_fallback): work around + a couple of GNU libc bugs. + + * gconvert.[ch] (g_{locale,filename}_{to,from}_utf8): Standardize + arguments to match g_convert(). Document. + + * gunicode.[ch]: + - Implement conversion functions to and from UTF-16 + - Standardize unicode conversion functions on prototype like + g_convert. + - Add a lot of error checking to unicode conversion functions. + + * gunicode.[ch] (g_utf8_to_ucs4_fast): Add fast, non-checking + variant of g_utf8_to_ucs4. + + * gutf8.c (g_utf8_validate): + - add g_return_if_fail (str != NULL). + - add checks for overlong strings, non-valid Unicode characters (>= 110000) + and single surrogates. + 2001-01-05 Tor Lillqvist * testglib.c (main): Add test for g_path_skip_root(). diff --git a/ChangeLog.pre-2-0 b/ChangeLog.pre-2-0 index 10269d507..6478f6a24 100644 --- a/ChangeLog.pre-2-0 +++ b/ChangeLog.pre-2-0 @@ -1,3 +1,40 @@ +Fri Jan 5 11:25:42 2001 Owen Taylor + + * configure.in (PACKAGE): move $enable_debug down below + checks for GCC to avoid setting CFLAGS prematurely, + change checks to avoid adding -g twice. + + * gutf8.c (g_ucs4_to_utf8): Support len < 0 to mean + 0 termination. + + * gutf8.c (g_utf8_to_ucs4): Terminate result with 0. + + * tests/mainloop-test.c (main): Fix uses of + g_main_loop_destroy(). + + * tests/unicode-encoding.c tests/Makefile.am tests/utf8.txt: + Tests for unicode-conversion code. + + * gconvert.c (g_convert, g_convert_with_fallback): work around + a couple of GNU libc bugs. + + * gconvert.[ch] (g_{locale,filename}_{to,from}_utf8): Standardize + arguments to match g_convert(). Document. + + * gunicode.[ch]: + - Implement conversion functions to and from UTF-16 + - Standardize unicode conversion functions on prototype like + g_convert. + - Add a lot of error checking to unicode conversion functions. + + * gunicode.[ch] (g_utf8_to_ucs4_fast): Add fast, non-checking + variant of g_utf8_to_ucs4. + + * gutf8.c (g_utf8_validate): + - add g_return_if_fail (str != NULL). + - add checks for overlong strings, non-valid Unicode characters (>= 110000) + and single surrogates. + 2001-01-05 Tor Lillqvist * testglib.c (main): Add test for g_path_skip_root(). diff --git a/ChangeLog.pre-2-10 b/ChangeLog.pre-2-10 index 10269d507..6478f6a24 100644 --- a/ChangeLog.pre-2-10 +++ b/ChangeLog.pre-2-10 @@ -1,3 +1,40 @@ +Fri Jan 5 11:25:42 2001 Owen Taylor + + * configure.in (PACKAGE): move $enable_debug down below + checks for GCC to avoid setting CFLAGS prematurely, + change checks to avoid adding -g twice. + + * gutf8.c (g_ucs4_to_utf8): Support len < 0 to mean + 0 termination. + + * gutf8.c (g_utf8_to_ucs4): Terminate result with 0. + + * tests/mainloop-test.c (main): Fix uses of + g_main_loop_destroy(). + + * tests/unicode-encoding.c tests/Makefile.am tests/utf8.txt: + Tests for unicode-conversion code. + + * gconvert.c (g_convert, g_convert_with_fallback): work around + a couple of GNU libc bugs. + + * gconvert.[ch] (g_{locale,filename}_{to,from}_utf8): Standardize + arguments to match g_convert(). Document. + + * gunicode.[ch]: + - Implement conversion functions to and from UTF-16 + - Standardize unicode conversion functions on prototype like + g_convert. + - Add a lot of error checking to unicode conversion functions. + + * gunicode.[ch] (g_utf8_to_ucs4_fast): Add fast, non-checking + variant of g_utf8_to_ucs4. + + * gutf8.c (g_utf8_validate): + - add g_return_if_fail (str != NULL). + - add checks for overlong strings, non-valid Unicode characters (>= 110000) + and single surrogates. + 2001-01-05 Tor Lillqvist * testglib.c (main): Add test for g_path_skip_root(). diff --git a/ChangeLog.pre-2-12 b/ChangeLog.pre-2-12 index 10269d507..6478f6a24 100644 --- a/ChangeLog.pre-2-12 +++ b/ChangeLog.pre-2-12 @@ -1,3 +1,40 @@ +Fri Jan 5 11:25:42 2001 Owen Taylor + + * configure.in (PACKAGE): move $enable_debug down below + checks for GCC to avoid setting CFLAGS prematurely, + change checks to avoid adding -g twice. + + * gutf8.c (g_ucs4_to_utf8): Support len < 0 to mean + 0 termination. + + * gutf8.c (g_utf8_to_ucs4): Terminate result with 0. + + * tests/mainloop-test.c (main): Fix uses of + g_main_loop_destroy(). + + * tests/unicode-encoding.c tests/Makefile.am tests/utf8.txt: + Tests for unicode-conversion code. + + * gconvert.c (g_convert, g_convert_with_fallback): work around + a couple of GNU libc bugs. + + * gconvert.[ch] (g_{locale,filename}_{to,from}_utf8): Standardize + arguments to match g_convert(). Document. + + * gunicode.[ch]: + - Implement conversion functions to and from UTF-16 + - Standardize unicode conversion functions on prototype like + g_convert. + - Add a lot of error checking to unicode conversion functions. + + * gunicode.[ch] (g_utf8_to_ucs4_fast): Add fast, non-checking + variant of g_utf8_to_ucs4. + + * gutf8.c (g_utf8_validate): + - add g_return_if_fail (str != NULL). + - add checks for overlong strings, non-valid Unicode characters (>= 110000) + and single surrogates. + 2001-01-05 Tor Lillqvist * testglib.c (main): Add test for g_path_skip_root(). diff --git a/ChangeLog.pre-2-2 b/ChangeLog.pre-2-2 index 10269d507..6478f6a24 100644 --- a/ChangeLog.pre-2-2 +++ b/ChangeLog.pre-2-2 @@ -1,3 +1,40 @@ +Fri Jan 5 11:25:42 2001 Owen Taylor + + * configure.in (PACKAGE): move $enable_debug down below + checks for GCC to avoid setting CFLAGS prematurely, + change checks to avoid adding -g twice. + + * gutf8.c (g_ucs4_to_utf8): Support len < 0 to mean + 0 termination. + + * gutf8.c (g_utf8_to_ucs4): Terminate result with 0. + + * tests/mainloop-test.c (main): Fix uses of + g_main_loop_destroy(). + + * tests/unicode-encoding.c tests/Makefile.am tests/utf8.txt: + Tests for unicode-conversion code. + + * gconvert.c (g_convert, g_convert_with_fallback): work around + a couple of GNU libc bugs. + + * gconvert.[ch] (g_{locale,filename}_{to,from}_utf8): Standardize + arguments to match g_convert(). Document. + + * gunicode.[ch]: + - Implement conversion functions to and from UTF-16 + - Standardize unicode conversion functions on prototype like + g_convert. + - Add a lot of error checking to unicode conversion functions. + + * gunicode.[ch] (g_utf8_to_ucs4_fast): Add fast, non-checking + variant of g_utf8_to_ucs4. + + * gutf8.c (g_utf8_validate): + - add g_return_if_fail (str != NULL). + - add checks for overlong strings, non-valid Unicode characters (>= 110000) + and single surrogates. + 2001-01-05 Tor Lillqvist * testglib.c (main): Add test for g_path_skip_root(). diff --git a/ChangeLog.pre-2-4 b/ChangeLog.pre-2-4 index 10269d507..6478f6a24 100644 --- a/ChangeLog.pre-2-4 +++ b/ChangeLog.pre-2-4 @@ -1,3 +1,40 @@ +Fri Jan 5 11:25:42 2001 Owen Taylor + + * configure.in (PACKAGE): move $enable_debug down below + checks for GCC to avoid setting CFLAGS prematurely, + change checks to avoid adding -g twice. + + * gutf8.c (g_ucs4_to_utf8): Support len < 0 to mean + 0 termination. + + * gutf8.c (g_utf8_to_ucs4): Terminate result with 0. + + * tests/mainloop-test.c (main): Fix uses of + g_main_loop_destroy(). + + * tests/unicode-encoding.c tests/Makefile.am tests/utf8.txt: + Tests for unicode-conversion code. + + * gconvert.c (g_convert, g_convert_with_fallback): work around + a couple of GNU libc bugs. + + * gconvert.[ch] (g_{locale,filename}_{to,from}_utf8): Standardize + arguments to match g_convert(). Document. + + * gunicode.[ch]: + - Implement conversion functions to and from UTF-16 + - Standardize unicode conversion functions on prototype like + g_convert. + - Add a lot of error checking to unicode conversion functions. + + * gunicode.[ch] (g_utf8_to_ucs4_fast): Add fast, non-checking + variant of g_utf8_to_ucs4. + + * gutf8.c (g_utf8_validate): + - add g_return_if_fail (str != NULL). + - add checks for overlong strings, non-valid Unicode characters (>= 110000) + and single surrogates. + 2001-01-05 Tor Lillqvist * testglib.c (main): Add test for g_path_skip_root(). diff --git a/ChangeLog.pre-2-6 b/ChangeLog.pre-2-6 index 10269d507..6478f6a24 100644 --- a/ChangeLog.pre-2-6 +++ b/ChangeLog.pre-2-6 @@ -1,3 +1,40 @@ +Fri Jan 5 11:25:42 2001 Owen Taylor + + * configure.in (PACKAGE): move $enable_debug down below + checks for GCC to avoid setting CFLAGS prematurely, + change checks to avoid adding -g twice. + + * gutf8.c (g_ucs4_to_utf8): Support len < 0 to mean + 0 termination. + + * gutf8.c (g_utf8_to_ucs4): Terminate result with 0. + + * tests/mainloop-test.c (main): Fix uses of + g_main_loop_destroy(). + + * tests/unicode-encoding.c tests/Makefile.am tests/utf8.txt: + Tests for unicode-conversion code. + + * gconvert.c (g_convert, g_convert_with_fallback): work around + a couple of GNU libc bugs. + + * gconvert.[ch] (g_{locale,filename}_{to,from}_utf8): Standardize + arguments to match g_convert(). Document. + + * gunicode.[ch]: + - Implement conversion functions to and from UTF-16 + - Standardize unicode conversion functions on prototype like + g_convert. + - Add a lot of error checking to unicode conversion functions. + + * gunicode.[ch] (g_utf8_to_ucs4_fast): Add fast, non-checking + variant of g_utf8_to_ucs4. + + * gutf8.c (g_utf8_validate): + - add g_return_if_fail (str != NULL). + - add checks for overlong strings, non-valid Unicode characters (>= 110000) + and single surrogates. + 2001-01-05 Tor Lillqvist * testglib.c (main): Add test for g_path_skip_root(). diff --git a/ChangeLog.pre-2-8 b/ChangeLog.pre-2-8 index 10269d507..6478f6a24 100644 --- a/ChangeLog.pre-2-8 +++ b/ChangeLog.pre-2-8 @@ -1,3 +1,40 @@ +Fri Jan 5 11:25:42 2001 Owen Taylor + + * configure.in (PACKAGE): move $enable_debug down below + checks for GCC to avoid setting CFLAGS prematurely, + change checks to avoid adding -g twice. + + * gutf8.c (g_ucs4_to_utf8): Support len < 0 to mean + 0 termination. + + * gutf8.c (g_utf8_to_ucs4): Terminate result with 0. + + * tests/mainloop-test.c (main): Fix uses of + g_main_loop_destroy(). + + * tests/unicode-encoding.c tests/Makefile.am tests/utf8.txt: + Tests for unicode-conversion code. + + * gconvert.c (g_convert, g_convert_with_fallback): work around + a couple of GNU libc bugs. + + * gconvert.[ch] (g_{locale,filename}_{to,from}_utf8): Standardize + arguments to match g_convert(). Document. + + * gunicode.[ch]: + - Implement conversion functions to and from UTF-16 + - Standardize unicode conversion functions on prototype like + g_convert. + - Add a lot of error checking to unicode conversion functions. + + * gunicode.[ch] (g_utf8_to_ucs4_fast): Add fast, non-checking + variant of g_utf8_to_ucs4. + + * gutf8.c (g_utf8_validate): + - add g_return_if_fail (str != NULL). + - add checks for overlong strings, non-valid Unicode characters (>= 110000) + and single surrogates. + 2001-01-05 Tor Lillqvist * testglib.c (main): Add test for g_path_skip_root(). diff --git a/configure.in b/configure.in index f68f79423..58562bed4 100644 --- a/configure.in +++ b/configure.in @@ -114,15 +114,6 @@ if test "x$enable_threads" != "xyes"; then enable_threads=no fi -if test "x$enable_debug" = "xyes"; then - test "$cflags_set" = set || CFLAGS="$CFLAGS -g" - GLIB_DEBUG_FLAGS="-DG_ENABLE_DEBUG" -else - if test "x$enable_debug" = "xno"; then - GLIB_DEBUG_FLAGS="-DG_DISABLE_ASSERT -DG_DISABLE_CHECKS" - fi -fi - AC_DEFINE_UNQUOTED(G_COMPILED_WITH_DEBUGGING, "${enable_debug}", [Whether glib was compiled with debugging enabled]) @@ -154,6 +145,21 @@ AC_PROG_CC AM_PROG_CC_STDC AC_PROG_INSTALL +if test "x$enable_debug" = "xyes"; then + if test x$cflags_set != xset ; then + case " $CFLAGS " in + *[[\ \ ]]-g[[\ \ ]]*) ;; + *) CFLAGS="$CFLAGS -g" ;; + esac + fi + + GLIB_DEBUG_FLAGS="-DG_ENABLE_DEBUG" +else + if test "x$enable_debug" = "xno"; then + GLIB_DEBUG_FLAGS="-DG_DISABLE_ASSERT -DG_DISABLE_CHECKS" + fi +fi + # define a MAINT-like variable REBUILD which is set if Perl # and awk are found, so autogenerated sources can be rebuilt AC_PROG_AWK diff --git a/gconvert.c b/gconvert.c index 2169b6d4e..344902f44 100644 --- a/gconvert.c +++ b/gconvert.c @@ -170,7 +170,11 @@ g_convert (const gchar *str, p = str; inbytes_remaining = len; - outbuf_size = len + 1; /* + 1 for nul in case len == 1 */ + + /* Due to a GLIBC bug, round outbuf_size up to a multiple of 4 */ + /* + 1 for nul in case len == 1 */ + outbuf_size = ((len + 3) & ~3) + 1; + outbytes_remaining = outbuf_size - 1; /* -1 for nul */ outp = dest = g_malloc (outbuf_size); @@ -188,11 +192,20 @@ g_convert (const gchar *str, case E2BIG: { size_t used = outp - dest; - outbuf_size *= 2; - dest = g_realloc (dest, outbuf_size); - outp = dest + used; - outbytes_remaining = outbuf_size - used - 1; /* -1 for nul */ + /* glibc's iconv can return E2BIG even if there is space + * remaining if an internal buffer is exhausted. The + * folllowing is a heuristic to catch this. The 16 is + * pretty arbitrary. + */ + if (used + 16 > outbuf_size) + { + outbuf_size = (outbuf_size - 1) * 2 + 1; + dest = g_realloc (dest, outbuf_size); + + outp = dest + used; + outbytes_remaining = outbuf_size - used - 1; /* -1 for nul */ + } goto again; } @@ -353,7 +366,9 @@ g_convert_with_fallback (const gchar *str, * for the original string while we are converting the fallback */ p = utf8; - outbuf_size = len + 1; /* + 1 for nul in case len == 1 */ + /* Due to a GLIBC bug, round outbuf_size up to a multiple of 4 */ + /* + 1 for nul in case len == 1 */ + outbuf_size = ((len + 3) & ~3) + 1; outbytes_remaining = outbuf_size - 1; /* -1 for nul */ outp = dest = g_malloc (outbuf_size); @@ -373,11 +388,20 @@ g_convert_with_fallback (const gchar *str, case E2BIG: { size_t used = outp - dest; - outbuf_size *= 2; - dest = g_realloc (dest, outbuf_size); - - outp = dest + used; - outbytes_remaining = outbuf_size - used - 1; /* -1 for nul */ + + /* glibc's iconv can return E2BIG even if there is space + * remaining if an internal buffer is exhausted. The + * folllowing is a heuristic to catch this. The 16 is + * pretty arbitrary. + */ + if (used + 16 > outbuf_size) + { + outbuf_size = (outbuf_size - 1) * 2 + 1; + dest = g_realloc (dest, outbuf_size); + + outp = dest + used; + outbytes_remaining = outbuf_size - used - 1; /* -1 for nul */ + } break; } @@ -458,18 +482,44 @@ g_convert_with_fallback (const gchar *str, /* * g_locale_to_utf8 * + * + */ + +/** + * g_locale_to_utf8: + * @opsysstring: a string in the encoding of the current locale + * @len: the length of the string, or -1 if the string is + * NULL-terminated. + * @bytes_read: location to store the number of bytes in the + * input string that were successfully converted, or %NULL. + * Even if the conversion was succesful, this may be + * less than len if there were partial characters + * at the end of the input. If the error + * G_CONVERT_ERROR_ILLEGAL_SEQUENCE occurs, the value + * stored will the byte fofset after the last valid + * input sequence. + * @bytes_written: the stored in the output buffer (not including the + * terminating nul. + * @error: location to store the error occuring, or %NULL to ignore + * errors. Any of the errors in #GConvertError may occur. + * * Converts a string which is in the encoding used for strings by * the C runtime (usually the same as that used by the operating * system) in the current locale into a UTF-8 string. - */ - + * + * Return value: The converted string, or %NULL on an error. + **/ gchar * -g_locale_to_utf8 (const gchar *opsysstring, GError **error) +g_locale_to_utf8 (const gchar *opsysstring, + gint len, + gint *bytes_read, + gint *bytes_written, + GError **error) { #ifdef G_OS_WIN32 - gint i, clen, wclen, first; - const gint len = strlen (opsysstring); + gint i, clen, total_len, wclen, first; + const gint len = len < 0 ? strlen (opsysstring) : len; wchar_t *wcs, wc; gchar *result, *bp; const wchar_t *wcp; @@ -478,26 +528,26 @@ g_locale_to_utf8 (const gchar *opsysstring, GError **error) wclen = MultiByteToWideChar (CP_ACP, 0, opsysstring, len, wcs, len); wcp = wcs; - clen = 0; + total_len = 0; for (i = 0; i < wclen; i++) { wc = *wcp++; if (wc < 0x80) - clen += 1; + total_len += 1; else if (wc < 0x800) - clen += 2; + total_len += 2; else if (wc < 0x10000) - clen += 3; + total_len += 3; else if (wc < 0x200000) - clen += 4; + total_len += 4; else if (wc < 0x4000000) - clen += 5; + total_len += 5; else - clen += 6; + total_len += 6; } - result = g_malloc (clen + 1); + result = g_malloc (total_len + 1); wcp = wcs; bp = result; @@ -553,6 +603,11 @@ g_locale_to_utf8 (const gchar *opsysstring, GError **error) g_free (wcs); + if (bytes_read) + *bytes_read = len; + if (bytes_written) + *bytes_written = total_len; + return result; #else @@ -562,26 +617,48 @@ g_locale_to_utf8 (const gchar *opsysstring, GError **error) if (g_get_charset (&charset)) return g_strdup (opsysstring); - str = g_convert (opsysstring, strlen (opsysstring), - "UTF-8", charset, NULL, NULL, error); + str = g_convert (opsysstring, len, + "UTF-8", charset, bytes_read, bytes_written, error); return str; #endif } -/* - * g_locale_from_utf8 - * - * The reverse of g_locale_to_utf8. - */ - +/** + * g_locale_from_utf8: + * @utf8string: a UTF-8 encoded string + * @len: the length of the string, or -1 if the string is + * NULL-terminated. + * @bytes_read: location to store the number of bytes in the + * input string that were successfully converted, or %NULL. + * Even if the conversion was succesful, this may be + * less than len if there were partial characters + * at the end of the input. If the error + * G_CONVERT_ERROR_ILLEGAL_SEQUENCE occurs, the value + * stored will the byte fofset after the last valid + * input sequence. + * @bytes_written: the stored in the output buffer (not including the + * terminating nul. + * @error: location to store the error occuring, or %NULL to ignore + * errors. Any of the errors in #GConvertError may occur. + * + * Converts a string from UTF-8 to the encoding used for strings by + * the C runtime (usually the same as that used by the operating + * system) in the current locale. + * + * Return value: The converted string, or %NULL on an error. + **/ gchar * -g_locale_from_utf8 (const gchar *utf8string, GError **error) +g_locale_from_utf8 (const gchar *utf8string, + gint len, + gint *bytes_read, + gint *bytes_written, + GError **error) { #ifdef G_OS_WIN32 gint i, mask, clen, mblen; - const gint len = strlen (utf8string); + const gint len = len < 0 ? strlen (utf8string) : len; wchar_t *wcs, *wcp; gchar *result; guchar *cp, *end, c; @@ -671,6 +748,11 @@ g_locale_from_utf8 (const gchar *utf8string, GError **error) result[mblen] = 0; g_free (wcs); + if (bytes_read) + *bytes_read = len; + if (bytes_written) + *bytes_written = mblen; + return result; #else @@ -681,39 +763,123 @@ g_locale_from_utf8 (const gchar *utf8string, GError **error) return g_strdup (utf8string); str = g_convert (utf8string, strlen (utf8string), - charset, "UTF-8", NULL, NULL, error); + charset, "UTF-8", bytes_read, bytes_written, error); return str; #endif } -/* Filenames are in UTF-8 unless specificially requested otherwise */ - +/** + * g_filename_to_utf8: + * @opsysstring: a string in the encoding for filenames + * @len: the length of the string, or -1 if the string is + * NULL-terminated. + * @bytes_read: location to store the number of bytes in the + * input string that were successfully converted, or %NULL. + * Even if the conversion was succesful, this may be + * less than len if there were partial characters + * at the end of the input. If the error + * G_CONVERT_ERROR_ILLEGAL_SEQUENCE occurs, the value + * stored will the byte fofset after the last valid + * input sequence. + * @bytes_written: the stored in the output buffer (not including the + * terminating nul. + * @error: location to store the error occuring, or %NULL to ignore + * errors. Any of the errors in #GConvertError may occur. + * + * Converts a string which is in the encoding used for filenames + * into a UTF-8 string. + * + * Return value: The converted string, or %NULL on an error. + **/ gchar* -g_filename_to_utf8 (const gchar *string, GError **error) - +g_filename_to_utf8 (const gchar *opsysstring, + gint len, + gint *bytes_read, + gint *bytes_written, + GError **error) { #ifdef G_OS_WIN32 - return g_locale_to_utf8 (string, error); + return g_locale_to_utf8 (opsysstring, len, + bytes_read, bytes_written, + error); #else if (getenv ("G_BROKEN_FILENAMES")) - return g_locale_to_utf8 (string, error); + return g_locale_to_utf8 (opsysstring, len, + bytes_read, bytes_written, + error); - return g_strdup (string); + if (bytes_read || bytes_written) + { + gint len = strlen (opsysstring); + + if (bytes_read) + *bytes_read = len; + if (bytes_written) + *bytes_written = len; + } + + if (len < 0) + return g_strdup (opsysstring); + else + return g_strndup (opsysstring, len); #endif } +/** + * g_filename_from_utf8: + * @utf8string: a UTF-8 encoded string + * @len: the length of the string, or -1 if the string is + * NULL-terminated. + * @bytes_read: location to store the number of bytes in the + * input string that were successfully converted, or %NULL. + * Even if the conversion was succesful, this may be + * less than len if there were partial characters + * at the end of the input. If the error + * G_CONVERT_ERROR_ILLEGAL_SEQUENCE occurs, the value + * stored will the byte fofset after the last valid + * input sequence. + * @bytes_written: the stored in the output buffer (not including the + * terminating nul. + * @error: location to store the error occuring, or %NULL to ignore + * errors. Any of the errors in #GConvertError may occur. + * + * Converts a string from UTF-8 to the encoding used for filenames. + * + * Return value: The converted string, or %NULL on an error. + **/ gchar* -g_filename_from_utf8 (const gchar *string, GError **error) +g_filename_from_utf8 (const gchar *utf8string, + gint len, + gint *bytes_read, + gint *bytes_written, + GError **error) { #ifdef G_OS_WIN32 - return g_locale_from_utf8 (string, error); + return g_locale_from_utf8 (utf8string, len, + bytes_read, bytes_written, + error); #else if (getenv ("G_BROKEN_FILENAMES")) - return g_locale_from_utf8 (string, error); + return g_locale_from_utf8 (utf8string, len, + bytes_read, bytes_written, + error); - return g_strdup (string); + if (bytes_read || bytes_written) + { + gint len = strlen (utf8string); + + if (bytes_read) + *bytes_read = len; + if (bytes_written) + *bytes_written = len; + } + + if (len < 0) + return g_strdup (utf8string); + else + return g_strndup (utf8string, len); #endif } diff --git a/gconvert.h b/gconvert.h index ce19b3672..e11a10609 100644 --- a/gconvert.h +++ b/gconvert.h @@ -76,14 +76,30 @@ gchar* g_convert_with_fallback (const gchar *str, /* Convert between libc's idea of strings and UTF-8. */ -gchar* g_locale_to_utf8 (const gchar *opsysstring, GError **error); -gchar* g_locale_from_utf8 (const gchar *utf8string, GError **error); +gchar* g_locale_to_utf8 (const gchar *opsysstring, + gint len, + gint *bytes_read, + gint *bytes_written, + GError **error); +gchar* g_locale_from_utf8 (const gchar *utf8string, + gint len, + gint *bytes_read, + gint *bytes_written, + GError **error); /* Convert between the operating system (or C runtime) * representation of file names and UTF-8. */ -gchar* g_filename_to_utf8 (const gchar *opsysstring, GError **error); -gchar* g_filename_from_utf8 (const gchar *utf8string, GError **error); +gchar* g_filename_to_utf8 (const gchar *opsysstring, + gint len, + gint *bytes_read, + gint *bytes_written, + GError **error); +gchar* g_filename_from_utf8 (const gchar *utf8string, + gint len, + gint *bytes_read, + gint *bytes_written, + GError **error); G_END_DECLS diff --git a/glib/gconvert.c b/glib/gconvert.c index 2169b6d4e..344902f44 100644 --- a/glib/gconvert.c +++ b/glib/gconvert.c @@ -170,7 +170,11 @@ g_convert (const gchar *str, p = str; inbytes_remaining = len; - outbuf_size = len + 1; /* + 1 for nul in case len == 1 */ + + /* Due to a GLIBC bug, round outbuf_size up to a multiple of 4 */ + /* + 1 for nul in case len == 1 */ + outbuf_size = ((len + 3) & ~3) + 1; + outbytes_remaining = outbuf_size - 1; /* -1 for nul */ outp = dest = g_malloc (outbuf_size); @@ -188,11 +192,20 @@ g_convert (const gchar *str, case E2BIG: { size_t used = outp - dest; - outbuf_size *= 2; - dest = g_realloc (dest, outbuf_size); - outp = dest + used; - outbytes_remaining = outbuf_size - used - 1; /* -1 for nul */ + /* glibc's iconv can return E2BIG even if there is space + * remaining if an internal buffer is exhausted. The + * folllowing is a heuristic to catch this. The 16 is + * pretty arbitrary. + */ + if (used + 16 > outbuf_size) + { + outbuf_size = (outbuf_size - 1) * 2 + 1; + dest = g_realloc (dest, outbuf_size); + + outp = dest + used; + outbytes_remaining = outbuf_size - used - 1; /* -1 for nul */ + } goto again; } @@ -353,7 +366,9 @@ g_convert_with_fallback (const gchar *str, * for the original string while we are converting the fallback */ p = utf8; - outbuf_size = len + 1; /* + 1 for nul in case len == 1 */ + /* Due to a GLIBC bug, round outbuf_size up to a multiple of 4 */ + /* + 1 for nul in case len == 1 */ + outbuf_size = ((len + 3) & ~3) + 1; outbytes_remaining = outbuf_size - 1; /* -1 for nul */ outp = dest = g_malloc (outbuf_size); @@ -373,11 +388,20 @@ g_convert_with_fallback (const gchar *str, case E2BIG: { size_t used = outp - dest; - outbuf_size *= 2; - dest = g_realloc (dest, outbuf_size); - - outp = dest + used; - outbytes_remaining = outbuf_size - used - 1; /* -1 for nul */ + + /* glibc's iconv can return E2BIG even if there is space + * remaining if an internal buffer is exhausted. The + * folllowing is a heuristic to catch this. The 16 is + * pretty arbitrary. + */ + if (used + 16 > outbuf_size) + { + outbuf_size = (outbuf_size - 1) * 2 + 1; + dest = g_realloc (dest, outbuf_size); + + outp = dest + used; + outbytes_remaining = outbuf_size - used - 1; /* -1 for nul */ + } break; } @@ -458,18 +482,44 @@ g_convert_with_fallback (const gchar *str, /* * g_locale_to_utf8 * + * + */ + +/** + * g_locale_to_utf8: + * @opsysstring: a string in the encoding of the current locale + * @len: the length of the string, or -1 if the string is + * NULL-terminated. + * @bytes_read: location to store the number of bytes in the + * input string that were successfully converted, or %NULL. + * Even if the conversion was succesful, this may be + * less than len if there were partial characters + * at the end of the input. If the error + * G_CONVERT_ERROR_ILLEGAL_SEQUENCE occurs, the value + * stored will the byte fofset after the last valid + * input sequence. + * @bytes_written: the stored in the output buffer (not including the + * terminating nul. + * @error: location to store the error occuring, or %NULL to ignore + * errors. Any of the errors in #GConvertError may occur. + * * Converts a string which is in the encoding used for strings by * the C runtime (usually the same as that used by the operating * system) in the current locale into a UTF-8 string. - */ - + * + * Return value: The converted string, or %NULL on an error. + **/ gchar * -g_locale_to_utf8 (const gchar *opsysstring, GError **error) +g_locale_to_utf8 (const gchar *opsysstring, + gint len, + gint *bytes_read, + gint *bytes_written, + GError **error) { #ifdef G_OS_WIN32 - gint i, clen, wclen, first; - const gint len = strlen (opsysstring); + gint i, clen, total_len, wclen, first; + const gint len = len < 0 ? strlen (opsysstring) : len; wchar_t *wcs, wc; gchar *result, *bp; const wchar_t *wcp; @@ -478,26 +528,26 @@ g_locale_to_utf8 (const gchar *opsysstring, GError **error) wclen = MultiByteToWideChar (CP_ACP, 0, opsysstring, len, wcs, len); wcp = wcs; - clen = 0; + total_len = 0; for (i = 0; i < wclen; i++) { wc = *wcp++; if (wc < 0x80) - clen += 1; + total_len += 1; else if (wc < 0x800) - clen += 2; + total_len += 2; else if (wc < 0x10000) - clen += 3; + total_len += 3; else if (wc < 0x200000) - clen += 4; + total_len += 4; else if (wc < 0x4000000) - clen += 5; + total_len += 5; else - clen += 6; + total_len += 6; } - result = g_malloc (clen + 1); + result = g_malloc (total_len + 1); wcp = wcs; bp = result; @@ -553,6 +603,11 @@ g_locale_to_utf8 (const gchar *opsysstring, GError **error) g_free (wcs); + if (bytes_read) + *bytes_read = len; + if (bytes_written) + *bytes_written = total_len; + return result; #else @@ -562,26 +617,48 @@ g_locale_to_utf8 (const gchar *opsysstring, GError **error) if (g_get_charset (&charset)) return g_strdup (opsysstring); - str = g_convert (opsysstring, strlen (opsysstring), - "UTF-8", charset, NULL, NULL, error); + str = g_convert (opsysstring, len, + "UTF-8", charset, bytes_read, bytes_written, error); return str; #endif } -/* - * g_locale_from_utf8 - * - * The reverse of g_locale_to_utf8. - */ - +/** + * g_locale_from_utf8: + * @utf8string: a UTF-8 encoded string + * @len: the length of the string, or -1 if the string is + * NULL-terminated. + * @bytes_read: location to store the number of bytes in the + * input string that were successfully converted, or %NULL. + * Even if the conversion was succesful, this may be + * less than len if there were partial characters + * at the end of the input. If the error + * G_CONVERT_ERROR_ILLEGAL_SEQUENCE occurs, the value + * stored will the byte fofset after the last valid + * input sequence. + * @bytes_written: the stored in the output buffer (not including the + * terminating nul. + * @error: location to store the error occuring, or %NULL to ignore + * errors. Any of the errors in #GConvertError may occur. + * + * Converts a string from UTF-8 to the encoding used for strings by + * the C runtime (usually the same as that used by the operating + * system) in the current locale. + * + * Return value: The converted string, or %NULL on an error. + **/ gchar * -g_locale_from_utf8 (const gchar *utf8string, GError **error) +g_locale_from_utf8 (const gchar *utf8string, + gint len, + gint *bytes_read, + gint *bytes_written, + GError **error) { #ifdef G_OS_WIN32 gint i, mask, clen, mblen; - const gint len = strlen (utf8string); + const gint len = len < 0 ? strlen (utf8string) : len; wchar_t *wcs, *wcp; gchar *result; guchar *cp, *end, c; @@ -671,6 +748,11 @@ g_locale_from_utf8 (const gchar *utf8string, GError **error) result[mblen] = 0; g_free (wcs); + if (bytes_read) + *bytes_read = len; + if (bytes_written) + *bytes_written = mblen; + return result; #else @@ -681,39 +763,123 @@ g_locale_from_utf8 (const gchar *utf8string, GError **error) return g_strdup (utf8string); str = g_convert (utf8string, strlen (utf8string), - charset, "UTF-8", NULL, NULL, error); + charset, "UTF-8", bytes_read, bytes_written, error); return str; #endif } -/* Filenames are in UTF-8 unless specificially requested otherwise */ - +/** + * g_filename_to_utf8: + * @opsysstring: a string in the encoding for filenames + * @len: the length of the string, or -1 if the string is + * NULL-terminated. + * @bytes_read: location to store the number of bytes in the + * input string that were successfully converted, or %NULL. + * Even if the conversion was succesful, this may be + * less than len if there were partial characters + * at the end of the input. If the error + * G_CONVERT_ERROR_ILLEGAL_SEQUENCE occurs, the value + * stored will the byte fofset after the last valid + * input sequence. + * @bytes_written: the stored in the output buffer (not including the + * terminating nul. + * @error: location to store the error occuring, or %NULL to ignore + * errors. Any of the errors in #GConvertError may occur. + * + * Converts a string which is in the encoding used for filenames + * into a UTF-8 string. + * + * Return value: The converted string, or %NULL on an error. + **/ gchar* -g_filename_to_utf8 (const gchar *string, GError **error) - +g_filename_to_utf8 (const gchar *opsysstring, + gint len, + gint *bytes_read, + gint *bytes_written, + GError **error) { #ifdef G_OS_WIN32 - return g_locale_to_utf8 (string, error); + return g_locale_to_utf8 (opsysstring, len, + bytes_read, bytes_written, + error); #else if (getenv ("G_BROKEN_FILENAMES")) - return g_locale_to_utf8 (string, error); + return g_locale_to_utf8 (opsysstring, len, + bytes_read, bytes_written, + error); - return g_strdup (string); + if (bytes_read || bytes_written) + { + gint len = strlen (opsysstring); + + if (bytes_read) + *bytes_read = len; + if (bytes_written) + *bytes_written = len; + } + + if (len < 0) + return g_strdup (opsysstring); + else + return g_strndup (opsysstring, len); #endif } +/** + * g_filename_from_utf8: + * @utf8string: a UTF-8 encoded string + * @len: the length of the string, or -1 if the string is + * NULL-terminated. + * @bytes_read: location to store the number of bytes in the + * input string that were successfully converted, or %NULL. + * Even if the conversion was succesful, this may be + * less than len if there were partial characters + * at the end of the input. If the error + * G_CONVERT_ERROR_ILLEGAL_SEQUENCE occurs, the value + * stored will the byte fofset after the last valid + * input sequence. + * @bytes_written: the stored in the output buffer (not including the + * terminating nul. + * @error: location to store the error occuring, or %NULL to ignore + * errors. Any of the errors in #GConvertError may occur. + * + * Converts a string from UTF-8 to the encoding used for filenames. + * + * Return value: The converted string, or %NULL on an error. + **/ gchar* -g_filename_from_utf8 (const gchar *string, GError **error) +g_filename_from_utf8 (const gchar *utf8string, + gint len, + gint *bytes_read, + gint *bytes_written, + GError **error) { #ifdef G_OS_WIN32 - return g_locale_from_utf8 (string, error); + return g_locale_from_utf8 (utf8string, len, + bytes_read, bytes_written, + error); #else if (getenv ("G_BROKEN_FILENAMES")) - return g_locale_from_utf8 (string, error); + return g_locale_from_utf8 (utf8string, len, + bytes_read, bytes_written, + error); - return g_strdup (string); + if (bytes_read || bytes_written) + { + gint len = strlen (utf8string); + + if (bytes_read) + *bytes_read = len; + if (bytes_written) + *bytes_written = len; + } + + if (len < 0) + return g_strdup (utf8string); + else + return g_strndup (utf8string, len); #endif } diff --git a/glib/gconvert.h b/glib/gconvert.h index ce19b3672..e11a10609 100644 --- a/glib/gconvert.h +++ b/glib/gconvert.h @@ -76,14 +76,30 @@ gchar* g_convert_with_fallback (const gchar *str, /* Convert between libc's idea of strings and UTF-8. */ -gchar* g_locale_to_utf8 (const gchar *opsysstring, GError **error); -gchar* g_locale_from_utf8 (const gchar *utf8string, GError **error); +gchar* g_locale_to_utf8 (const gchar *opsysstring, + gint len, + gint *bytes_read, + gint *bytes_written, + GError **error); +gchar* g_locale_from_utf8 (const gchar *utf8string, + gint len, + gint *bytes_read, + gint *bytes_written, + GError **error); /* Convert between the operating system (or C runtime) * representation of file names and UTF-8. */ -gchar* g_filename_to_utf8 (const gchar *opsysstring, GError **error); -gchar* g_filename_from_utf8 (const gchar *utf8string, GError **error); +gchar* g_filename_to_utf8 (const gchar *opsysstring, + gint len, + gint *bytes_read, + gint *bytes_written, + GError **error); +gchar* g_filename_from_utf8 (const gchar *utf8string, + gint len, + gint *bytes_read, + gint *bytes_written, + GError **error); G_END_DECLS diff --git a/glib/gunicode.h b/glib/gunicode.h index 93f368337..db4800a63 100644 --- a/glib/gunicode.h +++ b/glib/gunicode.h @@ -206,18 +206,39 @@ gchar *g_utf8_strchr (const gchar *p, gchar *g_utf8_strrchr (const gchar *p, gunichar c); -gunichar2 *g_utf8_to_utf16 (const gchar *str, - gint len); -gunichar * g_utf8_to_ucs4 (const gchar *str, - gint len); -gunichar * g_utf16_to_ucs4 (const gunichar2 *str, - gint len); -gchar * g_utf16_to_utf8 (const gunichar2 *str, - gint len); -gunichar * g_ucs4_to_utf16 (const gunichar *str, - gint len); -gchar * g_ucs4_to_utf8 (const gunichar *str, - gint len); +gunichar2 *g_utf8_to_utf16 (const gchar *str, + gint len, + gint *items_read, + gint *items_written, + GError **error); +gunichar * g_utf8_to_ucs4 (const gchar *str, + gint len, + gint *items_read, + gint *items_written, + GError **error); +gunichar * g_utf8_to_ucs4_fast (const gchar *str, + gint len, + gint *items_written); +gunichar * g_utf16_to_ucs4 (const gunichar2 *str, + gint len, + gint *items_read, + gint *items_written, + GError **error); +gchar * g_utf16_to_utf8 (const gunichar2 *str, + gint len, + gint *items_read, + gint *items_written, + GError **error); +gunichar2 *g_ucs4_to_utf16 (const gunichar *str, + gint len, + gint *items_read, + gint *items_written, + GError **error); +gchar * g_ucs4_to_utf8 (const gunichar *str, + gint len, + gint *items_read, + gint *items_written, + GError **error); /* Convert a single character into UTF-8. outbuf must have at * least 6 bytes of space. Returns the number of bytes in the diff --git a/glib/gutf8.c b/glib/gutf8.c index f584080bc..788b74ad9 100644 --- a/glib/gutf8.c +++ b/glib/gutf8.c @@ -33,6 +33,8 @@ #include #endif +#define _(s) (s) + #define UTF8_COMPUTE(Char, Mask, Len) \ if (Char < 128) \ { \ @@ -67,6 +69,14 @@ else \ Len = -1; +#define UTF8_LENGTH(Char) \ + ((Char) < 0x80 ? 1 : \ + ((Char) < 0x800 ? 2 : \ + ((Char) < 0x10000 ? 3 : \ + ((Char) < 0x200000 ? 4 : \ + ((Char) < 0x4000000 ? 5 : 6))))) + + #define UTF8_GET(Result, Chars, Count, Mask, Len) \ (Result) = (Chars)[0] & (Mask); \ for ((Count) = 1; (Count) < (Len); ++(Count)) \ @@ -79,6 +89,13 @@ (Result) <<= 6; \ (Result) |= ((Chars)[(Count)] & 0x3f); \ } + +#define UNICODE_VALID(Char) \ + ((Char) < 0x110000 && \ + ((Char) < 0xD800 || (Char) >= 0xE000) && \ + (Char) != 0xFFFE && (Char) != 0xFFFF) + + gchar g_utf8_skip[256] = { 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, @@ -473,33 +490,272 @@ unicode_strrchr (const char *p, gunichar c) #endif +/* Like g_utf8_get_char, but take a maximum length + * and return (gunichar)-2 on incomplete trailing character + */ +static inline gunichar +g_utf8_get_char_extended (const gchar *p, int max_len) +{ + gint i, len; + gunichar wc = (guchar) *p; + + if (wc < 0x80) + { + return wc; + } + else if (wc < 0xc0) + { + return (gunichar)-1; + } + else if (wc < 0xe0) + { + len = 2; + wc &= 0x1f; + } + else if (wc < 0xf0) + { + len = 3; + wc &= 0x0f; + } + else if (wc < 0xf8) + { + len = 4; + wc &= 0x07; + } + else if (wc < 0xfc) + { + len = 5; + wc &= 0x03; + } + else if (wc < 0xfe) + { + len = 6; + wc &= 0x01; + } + else + { + return (gunichar)-1; + } + + if (len == -1) + return (gunichar)-1; + if (max_len >= 0 && len > max_len) + { + for (i = 1; i < max_len; i++) + { + if ((((guchar *)p)[i] & 0xc0) != 0x80) + return (gunichar)-1; + } + return (gunichar)-2; + } + + for (i = 1; i < len; ++i) + { + gunichar ch = ((guchar *)p)[i]; + + if ((ch & 0xc0) != 0x80) + { + if (ch) + return (gunichar)-1; + else + return (gunichar)-2; + } + + wc <<= 6; + wc |= (ch & 0x3f); + } + + if (UTF8_LENGTH(wc) != len) + return (gunichar)-1; + + return wc; +} + /** - * g_utf8_to_ucs4: - * @str: a UTF-8 encoded strnig - * @len: the length of @ - * + * g_utf8_to_ucs4_fast: + * @str: a UTF-8 encoded string + * @len: the maximum length of @str to use. If < 0, then + * the string is %NULL terminated. + * @items_written: location to store the number of characters in the + * result, or %NULL. + * * Convert a string from UTF-8 to a 32-bit fixed width - * representation as UCS-4. + * representation as UCS-4, assuming valid UTF-8 input. + * This function is roughly twice as fast as g_utf8_to_ucs4() + * but does no error checking on the input. * * Return value: a pointer to a newly allocated UCS-4 string. * This value must be freed with g_free() **/ gunichar * -g_utf8_to_ucs4 (const char *str, int len) +g_utf8_to_ucs4_fast (const gchar *str, + gint len, + gint *items_written) { + gint j, charlen; gunichar *result; gint n_chars, i; const gchar *p; + + g_return_val_if_fail (str != NULL, NULL); + + p = str; + n_chars = 0; + if (len < 0) + { + while (*p) + { + p = g_utf8_next_char (p); + ++n_chars; + } + } + else + { + while (*p && p < str + len) + { + p = g_utf8_next_char (p); + ++n_chars; + } + } - n_chars = g_utf8_strlen (str, len); - result = g_new (gunichar, n_chars); + result = g_new (gunichar, n_chars + 1); p = str; for (i=0; i < n_chars; i++) { - result[i] = g_utf8_get_char (p); - p = g_utf8_next_char (p); + gunichar wc = ((unsigned char *)p)[0]; + + if (wc < 0x80) + { + result[i] = wc; + p++; + } + else + { + if (wc < 0xe0) + { + charlen = 2; + wc &= 0x1f; + } + else if (wc < 0xf0) + { + charlen = 3; + wc &= 0x0f; + } + else if (wc < 0xf8) + { + charlen = 4; + wc &= 0x07; + } + else if (wc < 0xfc) + { + charlen = 5; + wc &= 0x03; + } + else + { + charlen = 6; + wc &= 0x01; + } + + for (j = 1; j < charlen; j++) + { + wc <<= 6; + wc |= ((unsigned char *)p)[j] & 0x3f; + } + + result[i] = wc; + p += charlen; + } } + result[i] = 0; + + if (items_written) + *items_written = i; + + return result; +} + +/** + * g_utf8_to_ucs4: + * @str: a UTF-8 encoded string + * @len: the maximum length of @str to use. If < 0, then + * the string is %NULL terminated. + * @items_read: location to store number of bytes read, or %NULL. + * If %NULL, then %G_CONVERT_ERROR_PARTIAL_INPUT will be + * returned in case @str contains a trailing partial + * character. If an error occurs then the index of the + * invalid input is stored here. + * @items_written: location to store number of characters written or %NULL. + * The value here stored does not include the trailing 0 + * character. + * @error: location to store the error occuring, or %NULL to ignore + * errors. Any of the errors in #GConvertError other than + * %G_CONVERT_ERROR_NO_CONVERSION may occur. + * + * Convert a string from UTF-8 to a 32-bit fixed width + * representation as UCS-4. A trailing 0 will be added to the + * string after the converted text. + * + * Return value: a pointer to a newly allocated UCS-4 string. + * This value must be freed with g_free(). If an + * error occurs, %NULL will be returned and + * @error set. + **/ +gunichar * +g_utf8_to_ucs4 (const gchar *str, + gint len, + gint *items_read, + gint *items_written, + GError **error) +{ + gunichar *result = NULL; + gint n_chars, i; + const gchar *in; + + in = str; + n_chars = 0; + while ((len < 0 || str + len - in > 0) && *in) + { + gunichar wc = g_utf8_get_char_extended (in, str + len - in); + if (wc & 0x80000000) + { + if (wc == (gunichar)-2) + { + if (items_read) + break; + else + g_set_error (error, G_CONVERT_ERROR, G_CONVERT_ERROR_PARTIAL_INPUT, + _("Partial character sequence at end of input")); + } + else + g_set_error (error, G_CONVERT_ERROR, G_CONVERT_ERROR_ILLEGAL_SEQUENCE, + _("Invalid byte sequence in conversion input")); + + goto err_out; + } + + n_chars++; + + in = g_utf8_next_char (in); + } + + result = g_new (gunichar, n_chars + 1); + + in = str; + for (i=0; i < n_chars; i++) + { + result[i] = g_utf8_get_char (in); + in = g_utf8_next_char (in); + } + result[i] = 0; + + if (items_written) + *items_written = n_chars; + + err_out: + if (items_read) + *items_read = in - str; return result; } @@ -507,35 +763,569 @@ g_utf8_to_ucs4 (const char *str, int len) /** * g_ucs4_to_utf8: * @str: a UCS-4 encoded string - * @len: the length of @ - * + * @len: the maximum length of @str to use. If < 0, then + * the string is %NULL terminated. + * @items_read: location to store number of characters read read, or %NULL. + * @items_written: location to store number of bytes written or %NULL. + * The value here stored does not include the trailing 0 + * byte. + * @error: location to store the error occuring, or %NULL to ignore + * errors. Any of the errors in #GConvertError other than + * %G_CONVERT_ERROR_NO_CONVERSION may occur. + * * Convert a string from a 32-bit fixed width representation as UCS-4. - * to UTF-8. + * to UTF-8. The result will be terminated with a 0 byte. * * Return value: a pointer to a newly allocated UTF-8 string. - * This value must be freed with g_free() + * This value must be freed with g_free(). If an + * error occurs, %NULL will be returned and + * @error set. **/ gchar * -g_ucs4_to_utf8 (const gunichar *str, int len) +g_ucs4_to_utf8 (const gunichar *str, + gint len, + gint *items_read, + gint *items_written, + GError **error) { gint result_length; - gchar *result, *p; + gchar *result = NULL; + gchar *p; gint i; result_length = 0; - for (i = 0; i < len ; i++) - result_length += g_unichar_to_utf8 (str[i], NULL); + for (i = 0; len < 0 || i < len ; i++) + { + if (!str[i]) + break; - result_length++; + if (str[i] >= 0x80000000) + { + if (items_read) + *items_read = i; + + g_set_error (error, G_CONVERT_ERROR, G_CONVERT_ERROR_ILLEGAL_SEQUENCE, + _("Character out of range for UTF-8")); + goto err_out; + } + + result_length += UTF8_LENGTH (str[i]); + } result = g_malloc (result_length + 1); p = result; - for (i = 0; i < len ; i++) - p += g_unichar_to_utf8 (str[i], p); + i = 0; + while (p < result + result_length) + p += g_unichar_to_utf8 (str[i++], p); *p = '\0'; + if (items_written) + *items_written = p - result; + + err_out: + if (items_read) + *items_read = i; + + return result; +} + +#define SURROGATE_VALUE(h,l) (((h) - 0xd800) * 0x400 + (l) - 0xdc00 + 0x10000) + +/** + * g_utf16_to_utf8: + * @str: a UTF-16 encoded string + * @len: the maximum length of @str to use. If < 0, then + * the string is terminated with a 0 character. + * @items_read: location to store number of words read, or %NULL. + * If %NULL, then %G_CONVERT_ERROR_PARTIAL_INPUT will be + * returned in case @str contains a trailing partial + * character. If an error occurs then the index of the + * invalid input is stored here. + * @items_written: location to store number of bytes written, or %NULL. + * The value stored here does not include the trailing + * 0 byte. + * @error: location to store the error occuring, or %NULL to ignore + * errors. Any of the errors in #GConvertError other than + * %G_CONVERT_ERROR_NO_CONVERSION may occur. + * + * Convert a string from UTF-16 to UTF-8. The result will be + * terminated with a 0 byte. + * + * Return value: a pointer to a newly allocated UTF-8 string. + * This value must be freed with g_free(). If an + * error occurs, %NULL will be returned and + * @error set. + **/ +gchar * +g_utf16_to_utf8 (const gunichar2 *str, + gint len, + gint *items_read, + gint *items_written, + GError **error) +{ + /* This function and g_utf16_to_ucs4 are almost exactly identical - The lines that differ + * are marked. + */ + const gunichar2 *in; + gchar *out; + gchar *result = NULL; + gint n_bytes; + gunichar high_surrogate; + + g_return_val_if_fail (str != 0, NULL); + + n_bytes = 0; + in = str; + high_surrogate = 0; + while ((len < 0 || in - str < len) && *in) + { + gunichar2 c = *in; + gunichar wc; + + if (c >= 0xdc00 && c < 0xe000) /* low surrogate */ + { + if (high_surrogate) + { + wc = SURROGATE_VALUE (high_surrogate, c); + high_surrogate = 0; + } + else + { + g_set_error (error, G_CONVERT_ERROR, G_CONVERT_ERROR_ILLEGAL_SEQUENCE, + _("Invalid sequence in conversion input")); + goto err_out; + } + } + else + { + if (high_surrogate) + { + g_set_error (error, G_CONVERT_ERROR, G_CONVERT_ERROR_ILLEGAL_SEQUENCE, + _("Invalid sequence in conversion input")); + goto err_out; + } + + if (c >= 0xd800 && c < 0xdc00) /* high surrogate */ + { + high_surrogate = c; + goto next1; + } + else + wc = c; + } + + /********** DIFFERENT for UTF8/UCS4 **********/ + n_bytes += UTF8_LENGTH (wc); + + next1: + in++; + } + + if (high_surrogate && !items_read) + { + g_set_error (error, G_CONVERT_ERROR, G_CONVERT_ERROR_PARTIAL_INPUT, + _("Partial character sequence at end of input")); + goto err_out; + } + + /* At this point, everything is valid, and we just need to convert + */ + /********** DIFFERENT for UTF8/UCS4 **********/ + result = g_malloc (n_bytes + 1); + + high_surrogate = 0; + out = result; + in = str; + while (out < result + n_bytes) + { + gunichar2 c = *in; + gunichar wc; + + if (c >= 0xdc00 && c < 0xe000) /* low surrogate */ + { + wc = SURROGATE_VALUE (high_surrogate, c); + high_surrogate = 0; + } + else if (c >= 0xd800 && c < 0xdc00) /* high surrogate */ + { + high_surrogate = c; + goto next2; + } + else + wc = c; + + /********** DIFFERENT for UTF8/UCS4 **********/ + out += g_unichar_to_utf8 (wc, out); + + next2: + in++; + } + + /********** DIFFERENT for UTF8/UCS4 **********/ + *out = '\0'; + + if (items_written) + /********** DIFFERENT for UTF8/UCS4 **********/ + *items_written = out - result; + + err_out: + if (items_read) + *items_read = in - str; + + return result; +} + +/** + * g_utf16_to_ucs4: + * @str: a UTF-16 encoded string + * @len: the maximum length of @str to use. If < 0, then + * the string is terminated with a 0 character. + * @items_read: location to store number of words read, or %NULL. + * If %NULL, then %G_CONVERT_ERROR_PARTIAL_INPUT will be + * returned in case @str contains a trailing partial + * character. If an error occurs then the index of the + * invalid input is stored here. + * @items_written: location to store number of characters written, or %NULL. + * The value stored here does not include the trailing + * 0 character. + * @error: location to store the error occuring, or %NULL to ignore + * errors. Any of the errors in #GConvertError other than + * %G_CONVERT_ERROR_NO_CONVERSION may occur. + * + * Convert a string from UTF-16 to UCS-4. The result will be + * terminated with a 0 character. + * + * Return value: a pointer to a newly allocated UCS-4 string. + * This value must be freed with g_free(). If an + * error occurs, %NULL will be returned and + * @error set. + **/ +gunichar * +g_utf16_to_ucs4 (const gunichar2 *str, + gint len, + gint *items_read, + gint *items_written, + GError **error) +{ + const gunichar2 *in; + gchar *out; + gchar *result = NULL; + gint n_bytes; + gunichar high_surrogate; + + g_return_val_if_fail (str != 0, NULL); + + n_bytes = 0; + in = str; + high_surrogate = 0; + while ((len < 0 || in - str < len) && *in) + { + gunichar2 c = *in; + gunichar wc; + + if (c >= 0xdc00 && c < 0xe000) /* low surrogate */ + { + if (high_surrogate) + { + wc = SURROGATE_VALUE (high_surrogate, c); + high_surrogate = 0; + } + else + { + g_set_error (error, G_CONVERT_ERROR, G_CONVERT_ERROR_ILLEGAL_SEQUENCE, + _("Invalid sequence in conversion input")); + goto err_out; + } + } + else + { + if (high_surrogate) + { + g_set_error (error, G_CONVERT_ERROR, G_CONVERT_ERROR_ILLEGAL_SEQUENCE, + _("Invalid sequence in conversion input")); + goto err_out; + } + + if (c >= 0xd800 && c < 0xdc00) /* high surrogate */ + { + high_surrogate = c; + goto next1; + } + else + wc = c; + } + + /********** DIFFERENT for UTF8/UCS4 **********/ + n_bytes += sizeof (gunichar); + + next1: + in++; + } + + if (high_surrogate && !items_read) + { + g_set_error (error, G_CONVERT_ERROR, G_CONVERT_ERROR_PARTIAL_INPUT, + _("Partial character sequence at end of input")); + goto err_out; + } + + /* At this point, everything is valid, and we just need to convert + */ + /********** DIFFERENT for UTF8/UCS4 **********/ + result = g_malloc (n_bytes + 4); + + high_surrogate = 0; + out = result; + in = str; + while (out < result + n_bytes) + { + gunichar2 c = *in; + gunichar wc; + + if (c >= 0xdc00 && c < 0xe000) /* low surrogate */ + { + wc = SURROGATE_VALUE (high_surrogate, c); + high_surrogate = 0; + } + else if (c >= 0xd800 && c < 0xdc00) /* high surrogate */ + { + high_surrogate = c; + goto next2; + } + else + wc = c; + + /********** DIFFERENT for UTF8/UCS4 **********/ + *(gunichar *)out = wc; + out += sizeof (gunichar); + + next2: + in++; + } + + /********** DIFFERENT for UTF8/UCS4 **********/ + *(gunichar *)out = 0; + + if (items_written) + /********** DIFFERENT for UTF8/UCS4 **********/ + *items_written = (out - result) / sizeof (gunichar); + + err_out: + if (items_read) + *items_read = in - str; + + return (gunichar *)result; +} + +/** + * g_utf8_to_utf16: + * @str: a UTF-8 encoded string + * @len: the maximum length of @str to use. If < 0, then + * the string is %NULL terminated. + + * @items_read: location to store number of bytes read, or %NULL. + * If %NULL, then %G_CONVERT_ERROR_PARTIAL_INPUT will be + * returned in case @str contains a trailing partial + * character. If an error occurs then the index of the + * invalid input is stored here. + * @items_written: location to store number of words written, or %NULL. + * The value stored here does not include the trailing + * 0 word. + * @error: location to store the error occuring, or %NULL to ignore + * errors. Any of the errors in #GConvertError other than + * %G_CONVERT_ERROR_NO_CONVERSION may occur. + * + * Convert a string from UTF-8 to UTF-16. A 0 word will be + * added to the result after the converted text. + * + * Return value: a pointer to a newly allocated UTF-16 string. + * This value must be freed with g_free(). If an + * error occurs, %NULL will be returned and + * @error set. + **/ +gunichar2 * +g_utf8_to_utf16 (const gchar *str, + gint len, + gint *items_read, + gint *items_written, + GError **error) +{ + gunichar2 *result = NULL; + gint n16; + const gchar *in; + gint i; + + g_return_val_if_fail (str != NULL, NULL); + + in = str; + n16 = 0; + while ((len < 0 || str + len - in > 0) && *in) + { + gunichar wc = g_utf8_get_char_extended (in, str + len - in); + if (wc & 0x80000000) + { + if (wc == (gunichar)-2) + { + if (items_read) + break; + else + g_set_error (error, G_CONVERT_ERROR, G_CONVERT_ERROR_PARTIAL_INPUT, + _("Partial character sequence at end of input")); + } + else + g_set_error (error, G_CONVERT_ERROR, G_CONVERT_ERROR_ILLEGAL_SEQUENCE, + _("Invalid byte sequence in conversion input")); + + goto err_out; + } + + if (wc < 0xd800) + n16 += 1; + else if (wc < 0xe000) + { + g_set_error (error, G_CONVERT_ERROR, G_CONVERT_ERROR_ILLEGAL_SEQUENCE, + _("Invalid sequence in conversion input")); + + goto err_out; + } + else if (wc < 0x10000) + n16 += 1; + else if (wc < 0x110000) + n16 += 2; + else + { + g_set_error (error, G_CONVERT_ERROR, G_CONVERT_ERROR_ILLEGAL_SEQUENCE, + _("Character out of range for UTF-16")); + + goto err_out; + } + + in = g_utf8_next_char (in); + } + + result = g_new (gunichar2, n16 + 1); + + in = str; + for (i = 0; i < n16;) + { + gunichar wc = g_utf8_get_char (in); + + if (wc < 0x10000) + { + result[i++] = wc; + } + else + { + result[i++] = (wc - 0x10000) / 0x400 + 0xd800; + result[i++] = (wc - 0x10000) % 0x400 + 0xdc00; + } + + in = g_utf8_next_char (in); + } + + result[i] = 0; + + if (items_written) + *items_written = n16; + + err_out: + if (items_read) + *items_read = in - str; + + return result; +} + +/** + * g_ucs4_to_utf16: + * @str: a UCS-4 encoded string + * @len: the maximum length of @str to use. If < 0, then + * the string is terminated with a zero character. + * @items_read: location to store number of bytes read, or %NULL. + * If an error occurs then the index of the invalid input + * is stored here. + * @items_written: location to store number of words written, or %NULL. + * The value stored here does not include the trailing + * 0 word. + * @error: location to store the error occuring, or %NULL to ignore + * errors. Any of the errors in #GConvertError other than + * %G_CONVERT_ERROR_NO_CONVERSION may occur. + * + * Convert a string from UCS-4 to UTF-16. A 0 word will be + * added to the result after the converted text. + * + * Return value: a pointer to a newly allocated UTF-16 string. + * This value must be freed with g_free(). If an + * error occurs, %NULL will be returned and + * @error set. + **/ +gunichar2 * +g_ucs4_to_utf16 (const gunichar *str, + gint len, + gint *items_read, + gint *items_written, + GError **error) +{ + gunichar2 *result = NULL; + gint n16; + gint i, j; + + n16 = 0; + i = 0; + while ((len < 0 || i < len) && str[i]) + { + gunichar wc = str[i]; + + if (wc < 0xd800) + n16 += 1; + else if (wc < 0xe000) + { + g_set_error (error, G_CONVERT_ERROR, G_CONVERT_ERROR_ILLEGAL_SEQUENCE, + _("Invalid sequence in conversion input")); + + goto err_out; + } + else if (wc < 0x10000) + n16 += 1; + else if (wc < 0x110000) + n16 += 2; + else + { + g_set_error (error, G_CONVERT_ERROR, G_CONVERT_ERROR_ILLEGAL_SEQUENCE, + _("Character out of range for UTF-16")); + + goto err_out; + } + + i++; + } + + result = g_new (gunichar2, n16 + 1); + + for (i = 0, j = 0; j < n16; i++) + { + gunichar wc = str[i]; + + if (wc < 0x10000) + { + result[j++] = wc; + } + else + { + result[j++] = (wc - 0x10000) / 0x400 + 0xd800; + result[j++] = (wc - 0x10000) % 0x400 + 0xdc00; + } + } + result[j] = 0; + + if (items_written) + *items_written = n16; + + err_out: + if (items_read) + *items_read = i; + return result; } @@ -567,6 +1357,8 @@ g_utf8_validate (const gchar *str, { const gchar *p; + + g_return_val_if_fail (str != NULL, FALSE); if (end) *end = str; @@ -591,8 +1383,14 @@ g_utf8_validate (const gchar *str, UTF8_GET (result, p, i, mask, len); + if (UTF8_LENGTH (result) != len) /* Check for overlong UTF-8 */ + break; + if (result == (gunichar)-1) break; + + if (!UNICODE_VALID (result)) + break; p += len; } diff --git a/gunicode.h b/gunicode.h index 93f368337..db4800a63 100644 --- a/gunicode.h +++ b/gunicode.h @@ -206,18 +206,39 @@ gchar *g_utf8_strchr (const gchar *p, gchar *g_utf8_strrchr (const gchar *p, gunichar c); -gunichar2 *g_utf8_to_utf16 (const gchar *str, - gint len); -gunichar * g_utf8_to_ucs4 (const gchar *str, - gint len); -gunichar * g_utf16_to_ucs4 (const gunichar2 *str, - gint len); -gchar * g_utf16_to_utf8 (const gunichar2 *str, - gint len); -gunichar * g_ucs4_to_utf16 (const gunichar *str, - gint len); -gchar * g_ucs4_to_utf8 (const gunichar *str, - gint len); +gunichar2 *g_utf8_to_utf16 (const gchar *str, + gint len, + gint *items_read, + gint *items_written, + GError **error); +gunichar * g_utf8_to_ucs4 (const gchar *str, + gint len, + gint *items_read, + gint *items_written, + GError **error); +gunichar * g_utf8_to_ucs4_fast (const gchar *str, + gint len, + gint *items_written); +gunichar * g_utf16_to_ucs4 (const gunichar2 *str, + gint len, + gint *items_read, + gint *items_written, + GError **error); +gchar * g_utf16_to_utf8 (const gunichar2 *str, + gint len, + gint *items_read, + gint *items_written, + GError **error); +gunichar2 *g_ucs4_to_utf16 (const gunichar *str, + gint len, + gint *items_read, + gint *items_written, + GError **error); +gchar * g_ucs4_to_utf8 (const gunichar *str, + gint len, + gint *items_read, + gint *items_written, + GError **error); /* Convert a single character into UTF-8. outbuf must have at * least 6 bytes of space. Returns the number of bytes in the diff --git a/gutf8.c b/gutf8.c index f584080bc..788b74ad9 100644 --- a/gutf8.c +++ b/gutf8.c @@ -33,6 +33,8 @@ #include #endif +#define _(s) (s) + #define UTF8_COMPUTE(Char, Mask, Len) \ if (Char < 128) \ { \ @@ -67,6 +69,14 @@ else \ Len = -1; +#define UTF8_LENGTH(Char) \ + ((Char) < 0x80 ? 1 : \ + ((Char) < 0x800 ? 2 : \ + ((Char) < 0x10000 ? 3 : \ + ((Char) < 0x200000 ? 4 : \ + ((Char) < 0x4000000 ? 5 : 6))))) + + #define UTF8_GET(Result, Chars, Count, Mask, Len) \ (Result) = (Chars)[0] & (Mask); \ for ((Count) = 1; (Count) < (Len); ++(Count)) \ @@ -79,6 +89,13 @@ (Result) <<= 6; \ (Result) |= ((Chars)[(Count)] & 0x3f); \ } + +#define UNICODE_VALID(Char) \ + ((Char) < 0x110000 && \ + ((Char) < 0xD800 || (Char) >= 0xE000) && \ + (Char) != 0xFFFE && (Char) != 0xFFFF) + + gchar g_utf8_skip[256] = { 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, @@ -473,33 +490,272 @@ unicode_strrchr (const char *p, gunichar c) #endif +/* Like g_utf8_get_char, but take a maximum length + * and return (gunichar)-2 on incomplete trailing character + */ +static inline gunichar +g_utf8_get_char_extended (const gchar *p, int max_len) +{ + gint i, len; + gunichar wc = (guchar) *p; + + if (wc < 0x80) + { + return wc; + } + else if (wc < 0xc0) + { + return (gunichar)-1; + } + else if (wc < 0xe0) + { + len = 2; + wc &= 0x1f; + } + else if (wc < 0xf0) + { + len = 3; + wc &= 0x0f; + } + else if (wc < 0xf8) + { + len = 4; + wc &= 0x07; + } + else if (wc < 0xfc) + { + len = 5; + wc &= 0x03; + } + else if (wc < 0xfe) + { + len = 6; + wc &= 0x01; + } + else + { + return (gunichar)-1; + } + + if (len == -1) + return (gunichar)-1; + if (max_len >= 0 && len > max_len) + { + for (i = 1; i < max_len; i++) + { + if ((((guchar *)p)[i] & 0xc0) != 0x80) + return (gunichar)-1; + } + return (gunichar)-2; + } + + for (i = 1; i < len; ++i) + { + gunichar ch = ((guchar *)p)[i]; + + if ((ch & 0xc0) != 0x80) + { + if (ch) + return (gunichar)-1; + else + return (gunichar)-2; + } + + wc <<= 6; + wc |= (ch & 0x3f); + } + + if (UTF8_LENGTH(wc) != len) + return (gunichar)-1; + + return wc; +} + /** - * g_utf8_to_ucs4: - * @str: a UTF-8 encoded strnig - * @len: the length of @ - * + * g_utf8_to_ucs4_fast: + * @str: a UTF-8 encoded string + * @len: the maximum length of @str to use. If < 0, then + * the string is %NULL terminated. + * @items_written: location to store the number of characters in the + * result, or %NULL. + * * Convert a string from UTF-8 to a 32-bit fixed width - * representation as UCS-4. + * representation as UCS-4, assuming valid UTF-8 input. + * This function is roughly twice as fast as g_utf8_to_ucs4() + * but does no error checking on the input. * * Return value: a pointer to a newly allocated UCS-4 string. * This value must be freed with g_free() **/ gunichar * -g_utf8_to_ucs4 (const char *str, int len) +g_utf8_to_ucs4_fast (const gchar *str, + gint len, + gint *items_written) { + gint j, charlen; gunichar *result; gint n_chars, i; const gchar *p; + + g_return_val_if_fail (str != NULL, NULL); + + p = str; + n_chars = 0; + if (len < 0) + { + while (*p) + { + p = g_utf8_next_char (p); + ++n_chars; + } + } + else + { + while (*p && p < str + len) + { + p = g_utf8_next_char (p); + ++n_chars; + } + } - n_chars = g_utf8_strlen (str, len); - result = g_new (gunichar, n_chars); + result = g_new (gunichar, n_chars + 1); p = str; for (i=0; i < n_chars; i++) { - result[i] = g_utf8_get_char (p); - p = g_utf8_next_char (p); + gunichar wc = ((unsigned char *)p)[0]; + + if (wc < 0x80) + { + result[i] = wc; + p++; + } + else + { + if (wc < 0xe0) + { + charlen = 2; + wc &= 0x1f; + } + else if (wc < 0xf0) + { + charlen = 3; + wc &= 0x0f; + } + else if (wc < 0xf8) + { + charlen = 4; + wc &= 0x07; + } + else if (wc < 0xfc) + { + charlen = 5; + wc &= 0x03; + } + else + { + charlen = 6; + wc &= 0x01; + } + + for (j = 1; j < charlen; j++) + { + wc <<= 6; + wc |= ((unsigned char *)p)[j] & 0x3f; + } + + result[i] = wc; + p += charlen; + } } + result[i] = 0; + + if (items_written) + *items_written = i; + + return result; +} + +/** + * g_utf8_to_ucs4: + * @str: a UTF-8 encoded string + * @len: the maximum length of @str to use. If < 0, then + * the string is %NULL terminated. + * @items_read: location to store number of bytes read, or %NULL. + * If %NULL, then %G_CONVERT_ERROR_PARTIAL_INPUT will be + * returned in case @str contains a trailing partial + * character. If an error occurs then the index of the + * invalid input is stored here. + * @items_written: location to store number of characters written or %NULL. + * The value here stored does not include the trailing 0 + * character. + * @error: location to store the error occuring, or %NULL to ignore + * errors. Any of the errors in #GConvertError other than + * %G_CONVERT_ERROR_NO_CONVERSION may occur. + * + * Convert a string from UTF-8 to a 32-bit fixed width + * representation as UCS-4. A trailing 0 will be added to the + * string after the converted text. + * + * Return value: a pointer to a newly allocated UCS-4 string. + * This value must be freed with g_free(). If an + * error occurs, %NULL will be returned and + * @error set. + **/ +gunichar * +g_utf8_to_ucs4 (const gchar *str, + gint len, + gint *items_read, + gint *items_written, + GError **error) +{ + gunichar *result = NULL; + gint n_chars, i; + const gchar *in; + + in = str; + n_chars = 0; + while ((len < 0 || str + len - in > 0) && *in) + { + gunichar wc = g_utf8_get_char_extended (in, str + len - in); + if (wc & 0x80000000) + { + if (wc == (gunichar)-2) + { + if (items_read) + break; + else + g_set_error (error, G_CONVERT_ERROR, G_CONVERT_ERROR_PARTIAL_INPUT, + _("Partial character sequence at end of input")); + } + else + g_set_error (error, G_CONVERT_ERROR, G_CONVERT_ERROR_ILLEGAL_SEQUENCE, + _("Invalid byte sequence in conversion input")); + + goto err_out; + } + + n_chars++; + + in = g_utf8_next_char (in); + } + + result = g_new (gunichar, n_chars + 1); + + in = str; + for (i=0; i < n_chars; i++) + { + result[i] = g_utf8_get_char (in); + in = g_utf8_next_char (in); + } + result[i] = 0; + + if (items_written) + *items_written = n_chars; + + err_out: + if (items_read) + *items_read = in - str; return result; } @@ -507,35 +763,569 @@ g_utf8_to_ucs4 (const char *str, int len) /** * g_ucs4_to_utf8: * @str: a UCS-4 encoded string - * @len: the length of @ - * + * @len: the maximum length of @str to use. If < 0, then + * the string is %NULL terminated. + * @items_read: location to store number of characters read read, or %NULL. + * @items_written: location to store number of bytes written or %NULL. + * The value here stored does not include the trailing 0 + * byte. + * @error: location to store the error occuring, or %NULL to ignore + * errors. Any of the errors in #GConvertError other than + * %G_CONVERT_ERROR_NO_CONVERSION may occur. + * * Convert a string from a 32-bit fixed width representation as UCS-4. - * to UTF-8. + * to UTF-8. The result will be terminated with a 0 byte. * * Return value: a pointer to a newly allocated UTF-8 string. - * This value must be freed with g_free() + * This value must be freed with g_free(). If an + * error occurs, %NULL will be returned and + * @error set. **/ gchar * -g_ucs4_to_utf8 (const gunichar *str, int len) +g_ucs4_to_utf8 (const gunichar *str, + gint len, + gint *items_read, + gint *items_written, + GError **error) { gint result_length; - gchar *result, *p; + gchar *result = NULL; + gchar *p; gint i; result_length = 0; - for (i = 0; i < len ; i++) - result_length += g_unichar_to_utf8 (str[i], NULL); + for (i = 0; len < 0 || i < len ; i++) + { + if (!str[i]) + break; - result_length++; + if (str[i] >= 0x80000000) + { + if (items_read) + *items_read = i; + + g_set_error (error, G_CONVERT_ERROR, G_CONVERT_ERROR_ILLEGAL_SEQUENCE, + _("Character out of range for UTF-8")); + goto err_out; + } + + result_length += UTF8_LENGTH (str[i]); + } result = g_malloc (result_length + 1); p = result; - for (i = 0; i < len ; i++) - p += g_unichar_to_utf8 (str[i], p); + i = 0; + while (p < result + result_length) + p += g_unichar_to_utf8 (str[i++], p); *p = '\0'; + if (items_written) + *items_written = p - result; + + err_out: + if (items_read) + *items_read = i; + + return result; +} + +#define SURROGATE_VALUE(h,l) (((h) - 0xd800) * 0x400 + (l) - 0xdc00 + 0x10000) + +/** + * g_utf16_to_utf8: + * @str: a UTF-16 encoded string + * @len: the maximum length of @str to use. If < 0, then + * the string is terminated with a 0 character. + * @items_read: location to store number of words read, or %NULL. + * If %NULL, then %G_CONVERT_ERROR_PARTIAL_INPUT will be + * returned in case @str contains a trailing partial + * character. If an error occurs then the index of the + * invalid input is stored here. + * @items_written: location to store number of bytes written, or %NULL. + * The value stored here does not include the trailing + * 0 byte. + * @error: location to store the error occuring, or %NULL to ignore + * errors. Any of the errors in #GConvertError other than + * %G_CONVERT_ERROR_NO_CONVERSION may occur. + * + * Convert a string from UTF-16 to UTF-8. The result will be + * terminated with a 0 byte. + * + * Return value: a pointer to a newly allocated UTF-8 string. + * This value must be freed with g_free(). If an + * error occurs, %NULL will be returned and + * @error set. + **/ +gchar * +g_utf16_to_utf8 (const gunichar2 *str, + gint len, + gint *items_read, + gint *items_written, + GError **error) +{ + /* This function and g_utf16_to_ucs4 are almost exactly identical - The lines that differ + * are marked. + */ + const gunichar2 *in; + gchar *out; + gchar *result = NULL; + gint n_bytes; + gunichar high_surrogate; + + g_return_val_if_fail (str != 0, NULL); + + n_bytes = 0; + in = str; + high_surrogate = 0; + while ((len < 0 || in - str < len) && *in) + { + gunichar2 c = *in; + gunichar wc; + + if (c >= 0xdc00 && c < 0xe000) /* low surrogate */ + { + if (high_surrogate) + { + wc = SURROGATE_VALUE (high_surrogate, c); + high_surrogate = 0; + } + else + { + g_set_error (error, G_CONVERT_ERROR, G_CONVERT_ERROR_ILLEGAL_SEQUENCE, + _("Invalid sequence in conversion input")); + goto err_out; + } + } + else + { + if (high_surrogate) + { + g_set_error (error, G_CONVERT_ERROR, G_CONVERT_ERROR_ILLEGAL_SEQUENCE, + _("Invalid sequence in conversion input")); + goto err_out; + } + + if (c >= 0xd800 && c < 0xdc00) /* high surrogate */ + { + high_surrogate = c; + goto next1; + } + else + wc = c; + } + + /********** DIFFERENT for UTF8/UCS4 **********/ + n_bytes += UTF8_LENGTH (wc); + + next1: + in++; + } + + if (high_surrogate && !items_read) + { + g_set_error (error, G_CONVERT_ERROR, G_CONVERT_ERROR_PARTIAL_INPUT, + _("Partial character sequence at end of input")); + goto err_out; + } + + /* At this point, everything is valid, and we just need to convert + */ + /********** DIFFERENT for UTF8/UCS4 **********/ + result = g_malloc (n_bytes + 1); + + high_surrogate = 0; + out = result; + in = str; + while (out < result + n_bytes) + { + gunichar2 c = *in; + gunichar wc; + + if (c >= 0xdc00 && c < 0xe000) /* low surrogate */ + { + wc = SURROGATE_VALUE (high_surrogate, c); + high_surrogate = 0; + } + else if (c >= 0xd800 && c < 0xdc00) /* high surrogate */ + { + high_surrogate = c; + goto next2; + } + else + wc = c; + + /********** DIFFERENT for UTF8/UCS4 **********/ + out += g_unichar_to_utf8 (wc, out); + + next2: + in++; + } + + /********** DIFFERENT for UTF8/UCS4 **********/ + *out = '\0'; + + if (items_written) + /********** DIFFERENT for UTF8/UCS4 **********/ + *items_written = out - result; + + err_out: + if (items_read) + *items_read = in - str; + + return result; +} + +/** + * g_utf16_to_ucs4: + * @str: a UTF-16 encoded string + * @len: the maximum length of @str to use. If < 0, then + * the string is terminated with a 0 character. + * @items_read: location to store number of words read, or %NULL. + * If %NULL, then %G_CONVERT_ERROR_PARTIAL_INPUT will be + * returned in case @str contains a trailing partial + * character. If an error occurs then the index of the + * invalid input is stored here. + * @items_written: location to store number of characters written, or %NULL. + * The value stored here does not include the trailing + * 0 character. + * @error: location to store the error occuring, or %NULL to ignore + * errors. Any of the errors in #GConvertError other than + * %G_CONVERT_ERROR_NO_CONVERSION may occur. + * + * Convert a string from UTF-16 to UCS-4. The result will be + * terminated with a 0 character. + * + * Return value: a pointer to a newly allocated UCS-4 string. + * This value must be freed with g_free(). If an + * error occurs, %NULL will be returned and + * @error set. + **/ +gunichar * +g_utf16_to_ucs4 (const gunichar2 *str, + gint len, + gint *items_read, + gint *items_written, + GError **error) +{ + const gunichar2 *in; + gchar *out; + gchar *result = NULL; + gint n_bytes; + gunichar high_surrogate; + + g_return_val_if_fail (str != 0, NULL); + + n_bytes = 0; + in = str; + high_surrogate = 0; + while ((len < 0 || in - str < len) && *in) + { + gunichar2 c = *in; + gunichar wc; + + if (c >= 0xdc00 && c < 0xe000) /* low surrogate */ + { + if (high_surrogate) + { + wc = SURROGATE_VALUE (high_surrogate, c); + high_surrogate = 0; + } + else + { + g_set_error (error, G_CONVERT_ERROR, G_CONVERT_ERROR_ILLEGAL_SEQUENCE, + _("Invalid sequence in conversion input")); + goto err_out; + } + } + else + { + if (high_surrogate) + { + g_set_error (error, G_CONVERT_ERROR, G_CONVERT_ERROR_ILLEGAL_SEQUENCE, + _("Invalid sequence in conversion input")); + goto err_out; + } + + if (c >= 0xd800 && c < 0xdc00) /* high surrogate */ + { + high_surrogate = c; + goto next1; + } + else + wc = c; + } + + /********** DIFFERENT for UTF8/UCS4 **********/ + n_bytes += sizeof (gunichar); + + next1: + in++; + } + + if (high_surrogate && !items_read) + { + g_set_error (error, G_CONVERT_ERROR, G_CONVERT_ERROR_PARTIAL_INPUT, + _("Partial character sequence at end of input")); + goto err_out; + } + + /* At this point, everything is valid, and we just need to convert + */ + /********** DIFFERENT for UTF8/UCS4 **********/ + result = g_malloc (n_bytes + 4); + + high_surrogate = 0; + out = result; + in = str; + while (out < result + n_bytes) + { + gunichar2 c = *in; + gunichar wc; + + if (c >= 0xdc00 && c < 0xe000) /* low surrogate */ + { + wc = SURROGATE_VALUE (high_surrogate, c); + high_surrogate = 0; + } + else if (c >= 0xd800 && c < 0xdc00) /* high surrogate */ + { + high_surrogate = c; + goto next2; + } + else + wc = c; + + /********** DIFFERENT for UTF8/UCS4 **********/ + *(gunichar *)out = wc; + out += sizeof (gunichar); + + next2: + in++; + } + + /********** DIFFERENT for UTF8/UCS4 **********/ + *(gunichar *)out = 0; + + if (items_written) + /********** DIFFERENT for UTF8/UCS4 **********/ + *items_written = (out - result) / sizeof (gunichar); + + err_out: + if (items_read) + *items_read = in - str; + + return (gunichar *)result; +} + +/** + * g_utf8_to_utf16: + * @str: a UTF-8 encoded string + * @len: the maximum length of @str to use. If < 0, then + * the string is %NULL terminated. + + * @items_read: location to store number of bytes read, or %NULL. + * If %NULL, then %G_CONVERT_ERROR_PARTIAL_INPUT will be + * returned in case @str contains a trailing partial + * character. If an error occurs then the index of the + * invalid input is stored here. + * @items_written: location to store number of words written, or %NULL. + * The value stored here does not include the trailing + * 0 word. + * @error: location to store the error occuring, or %NULL to ignore + * errors. Any of the errors in #GConvertError other than + * %G_CONVERT_ERROR_NO_CONVERSION may occur. + * + * Convert a string from UTF-8 to UTF-16. A 0 word will be + * added to the result after the converted text. + * + * Return value: a pointer to a newly allocated UTF-16 string. + * This value must be freed with g_free(). If an + * error occurs, %NULL will be returned and + * @error set. + **/ +gunichar2 * +g_utf8_to_utf16 (const gchar *str, + gint len, + gint *items_read, + gint *items_written, + GError **error) +{ + gunichar2 *result = NULL; + gint n16; + const gchar *in; + gint i; + + g_return_val_if_fail (str != NULL, NULL); + + in = str; + n16 = 0; + while ((len < 0 || str + len - in > 0) && *in) + { + gunichar wc = g_utf8_get_char_extended (in, str + len - in); + if (wc & 0x80000000) + { + if (wc == (gunichar)-2) + { + if (items_read) + break; + else + g_set_error (error, G_CONVERT_ERROR, G_CONVERT_ERROR_PARTIAL_INPUT, + _("Partial character sequence at end of input")); + } + else + g_set_error (error, G_CONVERT_ERROR, G_CONVERT_ERROR_ILLEGAL_SEQUENCE, + _("Invalid byte sequence in conversion input")); + + goto err_out; + } + + if (wc < 0xd800) + n16 += 1; + else if (wc < 0xe000) + { + g_set_error (error, G_CONVERT_ERROR, G_CONVERT_ERROR_ILLEGAL_SEQUENCE, + _("Invalid sequence in conversion input")); + + goto err_out; + } + else if (wc < 0x10000) + n16 += 1; + else if (wc < 0x110000) + n16 += 2; + else + { + g_set_error (error, G_CONVERT_ERROR, G_CONVERT_ERROR_ILLEGAL_SEQUENCE, + _("Character out of range for UTF-16")); + + goto err_out; + } + + in = g_utf8_next_char (in); + } + + result = g_new (gunichar2, n16 + 1); + + in = str; + for (i = 0; i < n16;) + { + gunichar wc = g_utf8_get_char (in); + + if (wc < 0x10000) + { + result[i++] = wc; + } + else + { + result[i++] = (wc - 0x10000) / 0x400 + 0xd800; + result[i++] = (wc - 0x10000) % 0x400 + 0xdc00; + } + + in = g_utf8_next_char (in); + } + + result[i] = 0; + + if (items_written) + *items_written = n16; + + err_out: + if (items_read) + *items_read = in - str; + + return result; +} + +/** + * g_ucs4_to_utf16: + * @str: a UCS-4 encoded string + * @len: the maximum length of @str to use. If < 0, then + * the string is terminated with a zero character. + * @items_read: location to store number of bytes read, or %NULL. + * If an error occurs then the index of the invalid input + * is stored here. + * @items_written: location to store number of words written, or %NULL. + * The value stored here does not include the trailing + * 0 word. + * @error: location to store the error occuring, or %NULL to ignore + * errors. Any of the errors in #GConvertError other than + * %G_CONVERT_ERROR_NO_CONVERSION may occur. + * + * Convert a string from UCS-4 to UTF-16. A 0 word will be + * added to the result after the converted text. + * + * Return value: a pointer to a newly allocated UTF-16 string. + * This value must be freed with g_free(). If an + * error occurs, %NULL will be returned and + * @error set. + **/ +gunichar2 * +g_ucs4_to_utf16 (const gunichar *str, + gint len, + gint *items_read, + gint *items_written, + GError **error) +{ + gunichar2 *result = NULL; + gint n16; + gint i, j; + + n16 = 0; + i = 0; + while ((len < 0 || i < len) && str[i]) + { + gunichar wc = str[i]; + + if (wc < 0xd800) + n16 += 1; + else if (wc < 0xe000) + { + g_set_error (error, G_CONVERT_ERROR, G_CONVERT_ERROR_ILLEGAL_SEQUENCE, + _("Invalid sequence in conversion input")); + + goto err_out; + } + else if (wc < 0x10000) + n16 += 1; + else if (wc < 0x110000) + n16 += 2; + else + { + g_set_error (error, G_CONVERT_ERROR, G_CONVERT_ERROR_ILLEGAL_SEQUENCE, + _("Character out of range for UTF-16")); + + goto err_out; + } + + i++; + } + + result = g_new (gunichar2, n16 + 1); + + for (i = 0, j = 0; j < n16; i++) + { + gunichar wc = str[i]; + + if (wc < 0x10000) + { + result[j++] = wc; + } + else + { + result[j++] = (wc - 0x10000) / 0x400 + 0xd800; + result[j++] = (wc - 0x10000) % 0x400 + 0xdc00; + } + } + result[j] = 0; + + if (items_written) + *items_written = n16; + + err_out: + if (items_read) + *items_read = i; + return result; } @@ -567,6 +1357,8 @@ g_utf8_validate (const gchar *str, { const gchar *p; + + g_return_val_if_fail (str != NULL, FALSE); if (end) *end = str; @@ -591,8 +1383,14 @@ g_utf8_validate (const gchar *str, UTF8_GET (result, p, i, mask, len); + if (UTF8_LENGTH (result) != len) /* Check for overlong UTF-8 */ + break; + if (result == (gunichar)-1) break; + + if (!UNICODE_VALID (result)) + break; p += len; } diff --git a/tests/Makefile.am b/tests/Makefile.am index 756e1b789..1d8ce8a1a 100644 --- a/tests/Makefile.am +++ b/tests/Makefile.am @@ -33,7 +33,8 @@ test_programs = \ thread-test \ threadpool-test \ tree-test \ - type-test + type-test \ + unicode-encoding test_scripts = run-markup-tests.sh @@ -71,6 +72,7 @@ thread_test_LDADD = $(thread_LDADD) threadpool_test_LDADD = $(thread_LDADD) tree_test_LDADD = $(progs_LDADD) type_test_LDADD = $(progs_LDADD) +unicode_encoding_LDADD = $(progs_LDADD) lib_LTLIBRARIES = libmoduletestplugin_a.la libmoduletestplugin_b.la diff --git a/tests/mainloop-test.c b/tests/mainloop-test.c index 2652d63e6..422a669cc 100644 --- a/tests/mainloop-test.c +++ b/tests/mainloop-test.c @@ -155,7 +155,7 @@ adder_thread (gpointer data) g_free (channels); - g_main_loop_destroy (addr_data.loop); + g_main_loop_unref (addr_data.loop); g_print ("Timeout run %d times\n", addr_data.count); @@ -393,7 +393,7 @@ main (int argc, g_timeout_add (RECURSER_TIMEOUT, recurser_start, NULL); g_main_loop_run (main_loop); - g_main_loop_destroy (main_loop); + g_main_loop_unref (main_loop); #endif return 0; diff --git a/tests/unicode-encoding.c b/tests/unicode-encoding.c new file mode 100644 index 000000000..498137b89 --- /dev/null +++ b/tests/unicode-encoding.c @@ -0,0 +1,411 @@ +#include +#include +#include +#include +#include + +static gint exit_status = 0; + +void +croak (char *format, ...) +{ + va_list va; + + va_start (va, format); + vfprintf (stderr, format, va); + va_end (va); + + exit (1); +} + +void +fail (char *format, ...) +{ + va_list va; + + va_start (va, format); + vfprintf (stderr, format, va); + va_end (va); + + exit_status |= 1; +} + +typedef enum +{ + VALID, + INCOMPLETE, + NOTUNICODE, + OVERLONG, + MALFORMED +} Status; + +static gboolean +ucs4_equal (gunichar *a, gunichar *b) +{ + while (*a && *b && (*a == *b)) + { + a++; + b++; + } + + return (*a == *b); +} + +static gboolean +utf16_equal (gunichar2 *a, gunichar2 *b) +{ + while (*a && *b && (*a == *b)) + { + a++; + b++; + } + + return (*a == *b); +} + +static gint +utf16_count (gunichar2 *a) +{ + gint result = 0; + + while (a[result]) + result++; + + return result; +} + +static void +process (gint line, + gchar *utf8, + Status status, + gunichar *ucs4, + gint ucs4_len) +{ + const gchar *end; + gboolean is_valid = g_utf8_validate (utf8, -1, &end); + GError *error = NULL; + gint items_read, items_written; + + switch (status) + { + case VALID: + if (!is_valid) + { + fail ("line %d: valid but g_utf8_validate returned FALSE\n", line); + return; + } + break; + case NOTUNICODE: + case INCOMPLETE: + case OVERLONG: + case MALFORMED: + if (is_valid) + { + fail ("line %d: invalid but g_utf8_validate returned TRUE\n", line); + return; + } + break; + } + + if (status == INCOMPLETE) + { + gunichar *ucs4_result; + + ucs4_result = g_utf8_to_ucs4 (utf8, -1, NULL, NULL, &error); + + if (!error || !g_error_matches (error, G_CONVERT_ERROR, G_CONVERT_ERROR_PARTIAL_INPUT)) + { + fail ("line %d: incomplete input not properly detected\n", line); + return; + } + g_clear_error (&error); + + ucs4_result = g_utf8_to_ucs4 (utf8, -1, &items_read, NULL, &error); + + if (!ucs4_result || items_read == strlen (utf8)) + { + fail ("line %d: incomplete input not properly detected\n", line); + return; + } + + g_free (ucs4_result); + } + + if (status == VALID || status == NOTUNICODE) + { + gunichar *ucs4_result; + gchar *utf8_result; + + ucs4_result = g_utf8_to_ucs4 (utf8, -1, &items_read, &items_written, &error); + if (!ucs4_result) + { + fail ("line %d: conversion to ucs4 failed: %s\n", line, error->message); + return; + } + + if (!ucs4_equal (ucs4_result, ucs4) || + items_read != strlen (utf8) || + items_written != ucs4_len) + { + fail ("line %d: results of conversion to ucs4 do not match expected.\n", line); + return; + } + + g_free (ucs4_result); + + ucs4_result = g_utf8_to_ucs4_fast (utf8, -1, &items_written); + + if (!ucs4_equal (ucs4_result, ucs4) || + items_written != ucs4_len) + { + fail ("line %d: results of conversion to ucs4 do not match expected.\n", line); + return; + } + + utf8_result = g_ucs4_to_utf8 (ucs4_result, -1, &items_read, &items_written, &error); + if (!utf8_result) + { + fail ("line %d: conversion back to utf8 failed: %s", line, error->message); + return; + } + + if (strcmp (utf8_result, utf8) != 0 || + items_read != ucs4_len || + items_written != strlen (utf8)) + { + fail ("line %d: conversion back to utf8 did not match original\n", line); + return; + } + + g_free (utf8_result); + g_free (ucs4_result); + } + + if (status == VALID) + { + gunichar2 *utf16_expected_tmp; + gunichar2 *utf16_expected; + gunichar2 *utf16_from_utf8; + gunichar2 *utf16_from_ucs4; + gunichar *ucs4_result; + gint bytes_written; + gint n_chars; + gchar *utf8_result; + + if (!(utf16_expected_tmp = (gunichar2 *)g_convert (utf8, -1, "UTF-16", "UTF-8", + NULL, &bytes_written, NULL))) + { + fail ("line %d: could not convert to UTF-16 via g_convert\n", line); + return; + } + + /* zero-terminate and remove BOM + */ + n_chars = bytes_written / 2; + if (utf16_expected_tmp[0] == 0xfeff) /* BOM */ + { + n_chars--; + utf16_expected = g_new (gunichar2, n_chars + 1); + memcpy (utf16_expected, utf16_expected_tmp + 1, sizeof(gunichar2) * n_chars); + } + else if (utf16_expected_tmp[0] == 0xfffe) /* ANTI-BOM */ + { + fail ("line %d: conversion via iconv to \"UTF-16\" is not native-endian\n"); + return; + } + else + { + utf16_expected = g_new (gunichar2, n_chars + 1); + memcpy (utf16_expected, utf16_expected_tmp, sizeof(gunichar2) * n_chars); + } + + utf16_expected[n_chars] = '\0'; + + if (!(utf16_from_utf8 = g_utf8_to_utf16 (utf8, -1, &items_read, &items_written, &error))) + { + fail ("line %d: conversion to ucs16 failed: %s\n", line, error->message); + return; + } + + if (items_read != strlen (utf8) || + utf16_count (utf16_from_utf8) != items_written) + { + fail ("line %d: length error in conversion to ucs16\n", line); + return; + } + + if (!(utf16_from_ucs4 = g_ucs4_to_utf16 (ucs4, -1, &items_read, &items_written, &error))) + { + fail ("line %d: conversion to ucs16 failed: %s\n", line, error->message); + return; + } + + if (items_read != ucs4_len || + utf16_count (utf16_from_ucs4) != items_written) + { + fail ("line %d: length error in conversion to ucs16\n", line); + return; + } + + if (!utf16_equal (utf16_from_utf8, utf16_expected) || + !utf16_equal (utf16_from_ucs4, utf16_expected)) + { + fail ("line %d: results of conversion to ucs16 do not match\n", line); + return; + } + + if (!(utf8_result = g_utf16_to_utf8 (utf16_from_utf8, -1, &items_read, &items_written, &error))) + { + fail ("line %d: conversion back to utf8 failed: %s\n", line, error->message); + return; + } + + if (items_read != utf16_count (utf16_from_utf8) || + items_written != strlen (utf8)) + { + fail ("line %d: length error in conversion from ucs16 to utf8\n", line); + return; + } + + if (!(ucs4_result = g_utf16_to_ucs4 (utf16_from_ucs4, -1, &items_read, &items_written, &error))) + { + fail ("line %d: conversion back to utf8/ucs4 failed\n", line); + return; + } + + if (items_read != utf16_count (utf16_from_utf8) || + items_written != ucs4_len) + { + fail ("line %d: length error in conversion from ucs16 to ucs4\n", line); + return; + } + + if (strcmp (utf8, utf8_result) != 0 || + !ucs4_equal (ucs4, ucs4_result)) + { + fail ("line %d: conversion back to utf8/ucs4 did not match original\n", line); + return; + } + + g_free (utf16_expected_tmp); + g_free (utf16_expected); + g_free (utf16_from_utf8); + g_free (utf16_from_ucs4); + g_free (utf8_result); + g_free (ucs4_result); + } +} + +int +main (int argc, char **argv) +{ + gchar *srcdir = getenv ("srcdir"); + gchar *testfile; + gchar *contents; + GError *error = NULL; + gchar *p, *end; + char *tmp; + gint state = 0; + gint line = 1; + gint start_line = 0; /* Quiet GCC */ + gchar *utf8 = NULL; /* Quiet GCC */ + GArray *ucs4; + Status status = VALID; /* Quiet GCC */ + + if (!srcdir) + srcdir = "."; + + testfile = g_strconcat (srcdir, "/", "utf8.txt", NULL); + + g_file_get_contents (testfile, &contents, NULL, &error); + if (error) + croak ("Cannot open utf8.txt: %s", error->message); + + ucs4 = g_array_new (TRUE, FALSE, sizeof(gunichar)); + + p = contents; + + /* Loop over lines */ + while (*p) + { + while (*p && (*p == ' ' || *p == '\t')) + p++; + + end = p; + while (*end && *end != '\n') + end++; + + if (!*p || *p == '#' || *p == '\n') + goto next_line; + + tmp = g_strstrip (g_strndup (p, end - p)); + + switch (state) + { + case 0: + /* UTF-8 string */ + start_line = line; + utf8 = tmp; + tmp = NULL; + break; + + case 1: + /* Status */ + if (!strcmp (tmp, "VALID")) + status = VALID; + else if (!strcmp (tmp, "INCOMPLETE")) + status = INCOMPLETE; + else if (!strcmp (tmp, "NOTUNICODE")) + status = NOTUNICODE; + else if (!strcmp (tmp, "OVERLONG")) + status = OVERLONG; + else if (!strcmp (tmp, "MALFORMED")) + status = MALFORMED; + else + croak ("Invalid status on line %d\n", line); + + if (status != VALID && status != NOTUNICODE) + state++; /* No UCS-4 data */ + + break; + + case 2: + /* UCS-4 version */ + + p = strtok (tmp, " \t"); + while (p) + { + gchar *endptr; + + gunichar ch = strtoul (p, &endptr, 16); + if (*endptr != '\0') + croak ("Invalid UCS-4 character on line %d\n", line); + + g_array_append_val (ucs4, ch); + + p = strtok (NULL, " \t"); + } + + break; + } + + g_free (tmp); + state = (state + 1) % 3; + + if (state == 0) + { + process (start_line, utf8, status, (gunichar *)ucs4->data, ucs4->len); + g_array_set_size (ucs4, 0); + g_free (utf8); + } + + next_line: + p = end; + if (*p && *p == '\n') + p++; + + line++; + } + + return 0; +} diff --git a/tests/utf8.txt b/tests/utf8.txt new file mode 100644 index 000000000..8197d0bf9 --- /dev/null +++ b/tests/utf8.txt @@ -0,0 +1,297 @@ +# This file is derived from +# +# http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt +# +# Which was created by Markus Kuhn - 2000-09-02 +# +# lines begining with # and blank lines are ignored +# +# Beyond that, this file consists of a series of test cases. Each test case consists of +# 2 or 3 lines: +# +# 1. A UTF-8 string +# 2. A status +# VALID : The string is a valid UTF-8 representation of valid Unicode +# INCOMPLETE : The string has a partial character at the end +# NOTUNICODE : The string is valid UTF-8, but the characters represented +# are not valid unicode ( +# OVERLONG : The string includes overlong sequences +# MALFORMED : The string is not valid UTF-8 +# 3. If the status is VALID or NOTUNICODE, the UCS-4 representation of the string, +# as a series of hex numbers. + +# 1 Some correct UTF-8 text +κόσμε +VALID +03ba 1f79 03c3 03bc 03b5 + +# 2.1 First possible sequence of a certain length +# +# FIXME - handle NULLS? +# +# [ NULL BYTE ] +#VALID +#0000 + +€ +VALID +0080 + +ࠀ +VALID +0800 + +𐀀 +VALID +00010000 + + +NOTUNICODE +00200000 + + +NOTUNICODE +04000000 + + +VALID +0000007f + +߿ +VALID +000007ff + +￿ +NOTUNICODE +0000ffff + + +NOTUNICODE +001fffff + + +NOTUNICODE +03ffffff + + +NOTUNICODE +7fffffff + +# 2.3 Other boundary conditions + +퟿ +VALID +d7ff + + +VALID +e000 + +� +VALID +fffd + +􏿿 +VALID +0010ffff + + +NOTUNICODE +00110000 + +# 3.1 Unexpected continuation bytes + + +MALFORMED + +MALFORMED + +MALFORMED + +MALFORMED + +MALFORMED + +MALFORMED + +MALFORMED + +MALFORMED + +MALFORMED + +# 3.2 Lonely start characters + + +MALFORMED + +MALFORMED + +MALFORMED + +MALFORMED + +MALFORMED + +# 3.3 Sequences with last continuation byte missing + + +INCOMPLETE + +INCOMPLETE + +INCOMPLETE + +INCOMPLETE + +INCOMPLETE + +INCOMPLETE + +INCOMPLETE + +INCOMPLETE + +INCOMPLETE + +INCOMPLETE + +# 3.4 Concatenation of incomplete sequences + + +MALFORMED + +# 3.5 Impossible bytes + + +MALFORMED + +MALFORMED + +MALFORMED + +# Examples of an overlong ASCII character + + +OVERLONG + +OVERLONG + +OVERLONG + +OVERLONG + +OVERLONG + +# Maximum overlong sequences + + +OVERLONG + +OVERLONG + +OVERLONG + +OVERLONG + +OVERLONG + +# Overlong representation of the NUL character + + +OVERLONG + +OVERLONG + +OVERLONG + +OVERLONG + +OVERLONG + +# Illegal code positions + +# Single UTF-16 surrogates + + +NOTUNICODE +d800 + + +NOTUNICODE +db7f + + +NOTUNICODE +db80 + + +NOTUNICODE +dbff + + +NOTUNICODE +dc00 + + +NOTUNICODE +df80 + + +NOTUNICODE +dfff + +# Paired UTF-16 surrogates + + +NOTUNICODE +d800 dc00 + + +NOTUNICODE +d800 dfff + + +NOTUNICODE +db7f dc00 + + +NOTUNICODE +db7f dfff + + +NOTUNICODE +db80 dc00 + + +NOTUNICODE +db80 dfff + + +NOTUNICODE +dbff dc00 + + +NOTUNICODE +dbff dfff + +# Other illegal code positions + +￾ +NOTUNICODE +fffe + +￿ +NOTUNICODE +ffff + +################ +# +# Some more tests, not from Markus Kuhn's file +# + +# Mixed plane 0 and higher planes + +A𐀀B􏿿C +VALID +41 00010000 42 10ffff 43