Merge branch 'gregex_description_update' into 'main'

GRegex: update class description See merge request GNOME/glib!4646
2025-11-27 04:29:51 +01:00 · 2025-05-30 11:39:50 +00:00
parent b7ce23703e 2ab485a472
commit c1ff515f09
1 changed files with 120 additions and 45 deletions
--- a/glib/gregex.c
+++ b/glib/gregex.c
@@ -42,58 +42,133 @@
 /**
 * GRegex:
 *
- * A `GRegex` is the "compiled" form of a regular expression pattern.
+ * A `GRegex` is a compiled form of a regular expression.
 * 
 * After instantiating a `GRegex`, you can use its methods to find matches
 * in a string, replace matches within a string, or split the string at matches.
 *
- * `GRegex` implements regular expression pattern matching using syntax and
+ * `GRegex` implements regular expression pattern matching using syntax and 
- * semantics similar to Perl regular expression. See the
+ * semantics (such as character classes, quantifiers, and capture groups) 
- * [PCRE documentation](man:pcrepattern(3)) for the syntax definition.
+ * similar to Perl regular expression. See the 
 * [PCRE documentation](man:pcre2pattern(3)) for details.
 *
- * Some functions accept a @start_position argument, setting it differs
+ * A typical scenario for regex pattern matching is to check if a string 
- * from just passing over a shortened string and setting %G_REGEX_MATCH_NOTBOL
+ * matches a pattern. The following statements implement this scenario.
- * in the case of a pattern that begins with any kind of lookbehind assertion.
+ * 
- * For example, consider the pattern `\Biss\B` which finds occurrences of `iss`
+ * ``` { .c }
- * in the middle of words. (`\B` matches only if the current position in the
+ * const char *regex_pattern = ".*GLib.*";
- * subject is not a word boundary.) When applied to the string `Mississipi`
+ * const char *string_to_search = "You will love the GLib implementation of regex";
- * from the fourth byte, namely `issipi`, it does not match, because `\B` is
+ * g_autoptr(GMatchInfo) match_info = NULL;
- * always false at the start of the subject, which is deemed to be a word
+ * g_autoptr(GRegex) regex = NULL;
 * boundary. However, if the entire string is passed, but with
 * @start_position set to 4, it finds the second occurrence of `iss` because
 * it is able to look behind the starting point to discover that it is
 * preceded by a letter.
 *
- * Note that, unless you set the %G_REGEX_RAW flag, all the strings passed
+ * regex = g_regex_new (regex_pattern, G_REGEX_DEFAULT, G_REGEX_MATCH_DEFAULT, NULL);
- * to these functions must be encoded in UTF-8. The lengths and the positions
+ * g_assert (regex != NULL);
- * inside the strings are in bytes and not in characters, so, for instance,
+ * 
- * `\xc3\xa0` (i.e., `à`) is two bytes long but it is treated as a
+ * if (g_regex_match (regex, string_to_search, G_REGEX_MATCH_DEFAULT, &match_info))
- * single character. If you set %G_REGEX_RAW the strings can be non-valid
+ *   {
- * UTF-8 strings and a byte is treated as a character, so `\xc3\xa0` is two
+ *     int start_pos, end_pos;
- * bytes and two characters long.
+ *     g_match_info_fetch_pos (match_info, 0, &start_pos, &end_pos);
 *     g_print ("Match successful! Overall pattern matches bytes %d to %d\n", start_pos, end_pos);
 *   }
 * else
 *   {
 *     g_print ("No match!\n");
 *   }
 * ```
 * 
 * The constructor for `GRegex` includes two sets of bitmapped flags:
 * * [flags@GLib.RegexCompileFlags]—These flags 
 * control how GLib compiles the regex. There are options for case 
 * sensitivity, multiline, ignoring whitespace, etc.
 * * [flags@GLib.RegexMatchFlags]—These flags control 
 * `GRegex`’s matching behavior, such as anchoring and customizing definitions 
 * for newline characters.
 * 
 * Some regex patterns include backslash assertions, such as `\d` (digit) or 
 * `\D` (non-digit). The regex pattern must escape those backslashes. For 
 * example, the pattern `"\\d\\D"` matches a digit followed by a non-digit.
 *
- * When matching a pattern, `\n` matches only against a `\n` character in
+ * GLib’s implementation of pattern matching includes a `start_position` 
- * the string, and `\r` matches only a `\r` character. To match any newline
+ * argument for some of the match, replace, and split methods. Specifying 
- * sequence use `\R`. This particular group matches either the two-character
+ * a start position provides flexibility when you want to ignore the first 
- * sequence CR + LF (`\r\n`), or one of the single characters LF (linefeed,
+ * _n_ characters of a string, but want to incorporate backslash assertions 
- * U+000A, `\n`), VT vertical tab, U+000B, `\v`), FF (formfeed, U+000C, `\f`),
+ * at character _n_ - 1. For example, a database field contains inconsistent
- * CR (carriage return, U+000D, `\r`), NEL (next line, U+0085), LS (line
+ * spelling for a job title: `healthcare provider` and `health-care provider`.
- * separator, U+2028), or PS (paragraph separator, U+2029).
+ * The database manager wants to make the spelling consistent by adding a 
 * hyphen when it is missing. The following regex pattern tests for the string 
 * `care` preceded by a non-word boundary character (instead of a hyphen) 
 * and followed by a space.
 *
- * The behaviour of the dot, circumflex, and dollar metacharacters are
+ * ``` { .c }
- * affected by newline characters, the default is to recognize any newline
+ * const char *regex_pattern = "\\Bcare\\s";
- * character (the same characters recognized by `\R`). This can be changed
+ * ```
 * with `G_REGEX_NEWLINE_CR`, `G_REGEX_NEWLINE_LF` and `G_REGEX_NEWLINE_CRLF`
 * compile options, and with `G_REGEX_MATCH_NEWLINE_ANY`,
 * `G_REGEX_MATCH_NEWLINE_CR`, `G_REGEX_MATCH_NEWLINE_LF` and
 * `G_REGEX_MATCH_NEWLINE_CRLF` match options. These settings are also
 * relevant when compiling a pattern if `G_REGEX_EXTENDED` is set, and an
 * unescaped `#` outside a character class is encountered. This indicates
 * a comment that lasts until after the next newline.
 *
- * Creating and manipulating the same `GRegex` structure from different
+ * An efficient way to match with this pattern is to start examining at 
- * threads is not a problem as `GRegex` does not modify its internal
+ * `start_position` 6 in the string `healthcare` or `health-care`.
- * state between creation and destruction, on the other hand `GMatchInfo`
+
- * is not threadsafe.
+ * ``` { .c }
 * const char *regex_pattern = "\\Bcare\\s";
 * const char *string_to_search = "healthcare provider";
 * g_autoptr(GMatchInfo) match_info = NULL;
 * g_autoptr(GRegex) regex = NULL;
 *
- * The regular expressions low-level functionalities are obtained through
+ * regex = g_regex_new (
 *   regex_pattern,
 *   G_REGEX_DEFAULT,
 *   G_REGEX_MATCH_DEFAULT,
 *   NULL);
 * g_assert (regex != NULL);
 * 
 * g_regex_match_full (
 *   regex, 
 *   string_to_search, 
 *   -1,
 *   6, // position of 'c' in the test string.
 *   G_REGEX_MATCH_DEFAULT, 
 *   &match_info,
 *   NULL);
 * ```
 * 
 * The method [method@GLib.Regex.match_full] (and other methods implementing 
 * `start_pos`) allow for lookback before the start position to determine if 
 * the previous character satisfies an assertion.
 *
 * Unless you set the [flags@GLib.RegexCompileFlags.RAW] as one of 
 * the `GRegexCompileFlags`, all the strings passed to `GRegex` methods must 
 * be encoded in UTF-8. The lengths and the positions inside the strings are 
 * in bytes and not in characters, so, for instance, `\xc3\xa0` (i.e., `à`) 
 * is two bytes long but it is treated as a single character. If you set 
 * `G_REGEX_RAW`, the strings can be non-valid UTF-8 strings and a byte is 
 * treated as a character, so `\xc3\xa0` is two bytes and two characters long.
 *
 * Regarding line endings, `\n` matches a `\n` character, and `\r` matches 
 * a `\r` character. More generally, `\R` matches all typical line endings: 
 * CR + LF (`\r\n`), LF (linefeed, U+000A, `\n`), VT (vertical tab, U+000B, 
 * `\v`), FF (formfeed, U+000C, `\f`), CR (carriage return, U+000D, `\r`), 
 * NEL (next line, U+0085), LS (line separator, U+2028), and PS (paragraph 
 * separator, U+2029).
 * 
 * The behaviour of the dot, circumflex, and dollar metacharacters are 
 * affected by newline characters. By default, `GRegex` matches any newline 
 * character matched by `\R`. You can limit the matched newline characters by 
 * specifying the [flags@GLib.RegexMatchFlags.NEWLINE_CR], 
 * [flags@GLib.RegexMatchFlags.NEWLINE_LF], and 
 * [flags@GLib.RegexMatchFlags.NEWLINE_CRLF] compile options, and 
 * with [flags@GLib.RegexMatchFlags.NEWLINE_ANY], 
 * [flags@GLib.RegexMatchFlags.NEWLINE_CR], 
 * [flags@GLib.RegexMatchFlags.NEWLINE_LF] and 
 * [flags@GLib.RegexMatchFlags.NEWLINE_CRLF] match options. 
 * These settings are also relevant when compiling a pattern if 
 * [flags@GLib.RegexCompileFlags.EXTENDED] is set and an unescaped 
 * `#` outside a character class is encountered. This indicates a comment 
 * that lasts until after the next newline.
 * 
 * Because `GRegex` does not modify its internal state between creation and 
 * destruction, you can create and modify the same `GRegex` instance from 
 * different threads. In contrast, [struct@GLib.MatchInfo] is not thread safe.
 * 
 * The regular expression low-level functionalities are obtained through
 * the excellent [PCRE](http://www.pcre.org/) library written by Philip Hazel.
 *
 * Since: 2.14