gunicode: Switch compose_array table from guint16 to gunichar

The time has finally come when Unicode has specified a codepoint above
U+FFFF which has a decomposition: U+16125 GURUNG KHEMA VOWEL SIGN AI, in
Unicode 16 which the following commits will add support for.

So far, we’ve managed to store the reverse-lookup from decomposed pairs
to their composed form using a 16-bit integer. Now we have to switch to
storing the composed form in a 32-bit `gunichar` as U+16125 won’t fit
otherwise.

This introduces no functional changes, but does double the in-memory
size of the `compose_array` table from 9176 bytes to 19932 bytes.

The code which uses this lookup table, in `gunidecomp.c`, was already
implicitly converting the loaded value to a `gunichar`, so needs no
changes.

When we update to Unicode 16, the new `NormalizationTest.txt` file
contains a test which will check that composed codepoints > U+FFFF work.
Specifically, U+11391 TULU-TIGALARI LETTER AU is tested.

Signed-off-by: Philip Withnall <pwithnall@gnome.org>

Helps: #3470
This commit is contained in:
Philip Withnall 2024-10-18 14:37:40 +01:00
parent e9902a66a9
commit ad51ff8038
No known key found for this signature in database
GPG Key ID: C5C42CFB268637CA

View File

@ -1352,7 +1352,7 @@ sub output_composition_table
# Output array of composition pairs
print OUT <<EOT;
static const guint16 compose_array[$n_first][$n_second] = {
static const gunichar compose_array[$n_first][$n_second] = {
EOT
for (my $i = 0; $i < $n_first; $i++) {
@ -1361,10 +1361,7 @@ EOT
for (my $j = 0; $j < $n_second; $j++) {
print OUT ", " if $j;
if (exists $reverse{"$i|$j"}) {
if ($reverse{"$i|$j"} > 0xFFFF) {
die "time to switch compose_array to gunichar" ;
}
printf OUT "0x%04x", $reverse{"$i|$j"};
printf OUT "0x%06x", $reverse{"$i|$j"};
} else {
print OUT " 0";
}
@ -1377,7 +1374,7 @@ EOT
};
EOT
$bytes_out += $n_first * $n_second * 2;
$bytes_out += $n_first * $n_second * 4;
printf STDERR "Generated %d bytes in compose tables\n", $bytes_out;
}