# README for configuration files used by org.unicode.icu.tool.cldrtoicu.regex.RegexTransformer. # # © 2019 and later: Unicode, Inc. and others. # # CLDR data files are interpreted according to the LDML specification (http://unicode.org/reports/tr35/) # For terms of use, see http://www.unicode.org/copyright.html ====== Basics ====== The RegexTransformer class converts CLDR paths and values to ICU Resource Bundle paths and values, based on a set of transformation rules typically loaded from a text file (e.g. ldml2icu_locale.txt). The basic format of transformation rules is: ; [; =]* A simple example of a transformation rule is: //ldml/localeDisplayNames/keys/key[@type="(%A)"] ; /Keys/$1 which transforms CLDR values whose path matches the path specification, and emits: * A resource bundle path "/Keys/xx", where 'xx' is the captured type attribute. * A resource bundle value, which is just the CLDR value's base value. A path specification can be thought of as a regular expression which matches the CLDR path and can capture some element names or attribute values; however unlike a regular expression, the '[',']' characters are treated as literals, similar to XPath expressions. If a single CLDR value should produce more than one resource bundle path/value, then it should be written: ; [; ]* ; [; ]* ===================== Argument Substitution ===================== Before a rule can be matched, any %-variables must be substituted. These are defined in the same configuration file as the rules, and look something like: %W=[\w\-]++ or: %D=//ldml/numbers/defaultNumberingSystem The first case can be thought of as just a snippet of regular expression (in this case something that matches hyphen separated words) and, importantly, here '[' and ']' are treated as regular expression metacharacters. These arguments are static and wil be substituted exactly as-is into the regular expression to be used for matching. The second case (used exactly once) is a dynamic argument which references a CLDR value in the set of data being transformed. This is simply indicated by the fact that it starts with '//'. This path is resolved and the value is substituted just prior to matching. Variable names are limited to a single upper-case letter (A-Z). =========================== Implicit Argument Splitting =========================== This is a (somewhat non-obvious) mechanism which allows for a single rule to generate multiple results from a single input path when a argument is a list of tokens. Consider the rule: //supplementalData/timeData/hours[@allowed="(%W)"][@preferred="(%W)"][@regions="(%W)"] ; /timeData/$3/allowed ; values=$1 ; /timeData/$3/preferred ; values=$2 where the "regions" attributes (which is captured as '$3') contains a whitespace separated list of region codes (e.g. "US GB AU NZ"). In this case the rule is applied once for each region, producing paths such as "/timeData/US/allowed" or "/timeData/NZ/preferred". Note that there is no explicit instruction to do this, it just happens. The rule is that the first unquoted argument in the resource bundle path is always treated as splittable. To suppress this behaviour, the argument must be quoted (e.g. /timeData/"$3"/allowed). Now, if there were another following unquoted argument, that would become implicitly splittable (but only one argument is ever splittable). ============ Instructions ============ Additional instructions can be supplied to control value transformation and specify fallback values. The set of instructions is: * values: The most common instruction which defines how values are transformed. * fallback: Defines a fallback value to be used if this rule was not matched. There are two other special case instructions which should (if at all possible) not be used, and might be removed at some point: * group: Causes values to be grouped as sub-arrays for very specific use cases (prefer using "Hidden Labels" where possible). * base_xpath: Allows deduplication of results between multiple different rules (this is a hack to work around limitations in how matching is performed). ------------------- values= ------------------- The "values" instruction defines an expression whose evaluated result becomes the output resource bundle value(s). Unless quoting is present, this evaluated expression is split on whitespace and can become multiple values in the resulting resource bundle. Examples: * values=$1 $2 $3 Produces three separate values in the resource bundle for the first three captured arguments. * values="$1 $2" $3 Produces two values in the resource bundle, the first of which is two captured values separated by a space character. * values={value} Substitutes the CLDR value, but then performs whitespace splitting on the result. This differs from the behaviour when no "values" instructions is present (which does not split the results). * values="{value}" $1 Produces two values, the first of which is the unsplit CLDR value, and the second is a captured argument. * values=&func($1, {value}) Invokes a transformation function, passing in a captured argument and the CLDR value, and the result is then split. The set of functions available to a transformer is configured when it is created. Note that in the above examples, it is assumed that the $N arguments do not contain spaces. If they did, it would result in more output values. To be strict about things, every value which should not be split must be quoted (e.g. values="$1" "$2" "$3") but since captured values are often IDs or other tokens, this is not what is seen in practice, so it is not reflected in these examples. --------------------- fallback= --------------------- The fallback instruction provides a way for default values to be emitted for a path that was not matched. Fallbacks are useful when several different rules produce values for the same resource bundle. In this case the output path produced by one rule can be used as the "key" for any unmatched rules with fallback values (to "fill in the gaps"). Consider the two rules which can emit the same resource bundle path: //ldml/numbers/currencies/currency[@type="(%W)"]/symbol ; /Currencies/$1 ; fallback=$1 //ldml/numbers/currencies/currency[@type="(%W)"]/displayName ; /Currencies/$1 ; fallback=$1 These rules, if both matched, will produce two values for the same resource bundle path. Consider the CLDR values: //ldml/numbers/currencies/currency[@type="USD"]/symbol ==> "$" //ldml/numbers/currencies/currency[@type="USD"]/displayName ==> "US Dollar" After matching both of these paths, the values for the resource bundle "/Currencies/USD" will be the array { "$", "US Dollar" }. However, if only one value were present to be converted, the converter could use the matched path "/Currencies/XXX" and infer the missing fallback value, ensuring that the output array (it if was emitted at all) was always two values. Note that in order for this to work, the fallback value must be derivable only from the matched path. E.g. it cannot contain arguments that are not also present in the matched path, and obviously cannot reference the "{value}" at all. Thus the following would not be permitted: //ldml/foo/bar[@type="(%W)"][@region=(%A)] ; /Foo/$1 ; fallback=$2 However the fallback value can reference existing CLDR or resource bundle paths (expected to be present from other rules). For example: fallback=/weekData/001:intvector[0] or: fallback=//ldml/numbers/symbols[@numberSystem="%D"]/decimal The latter case is especially complex because it also uses the "dynamic" argument: %D=//ldml/numbers/defaultNumberingSystem So determining the resulting value will require: 1) resolving "//ldml/numbers/defaultNumberingSystem" to, for example, "arab" 2) looking up the value of "//ldml/numbers/symbols[@numberSystem="arab"]/decimal" ----------------- base_xpath= ----------------- The base_xpath instruction allows a rule to specify a proxy path which is used in place of the originally matched path in the returned result. This is a useful hack for cases where values are derived from information in a path prefix. Because path matching for transformation happens only on full paths, it is possible that several distinct CLDR paths might effectively generate the same result if they share the same prefix (i.e. paths in the same "sub hierarchy" of the CLDR data). If this happens, then you end up generating "the same" result from different paths. To fix this, a "surrogate" CLDR path can be specified as a proxy for the source path, allowing several results to appears to have come from the same source, which results in deduplication of the final value. For example, the two rules : //supplementalData/territoryInfo/territory[...][@writingPercent="(%N)"][@populationPercent="(%N)"][@officialStatus="(%W)"](?:[@references="%W"])? ; /territoryInfo/$1/territoryF:intvector ; values=&exp($2) &exp($3,-2) &exp($4) ; base_xpath=//supplementalData/territoryInfo/territory[@type="$1"] //supplementalData/territoryInfo/territory[...][@writingPercent="(%N)"][@populationPercent="(%N)"](?:[@references="%W"])? ; /territoryInfo/$1/territoryF:intvector ; values=&exp($2) &exp($3,-2) &exp($4) ; base_xpath=//supplementalData/territoryInfo/territory[@type="$1"] Produce the same results for different paths (with or without the "officialStatus" attribute) but only one such result is desired. By specifying the same base_xpath on both rules, the conversion logic can deduplicate these to produce only one result. When using base_xpath, it is worth noting that: 1) Base xpaths must be valid "distinguishing" paths (but are never matched to any rule). 2) Base xpaths can use arguments to achieve the necessary level of uniqueness. 3) Rules which share the same base xpath must always produce the same values. Note however that this is a still very much a hack because since two rules are responsible for generating the same result, there is no well defined "line number" to use for ordering of values. Thus this mechanism should only be used for rules which produce "single" values, and must not be used in cases where the ordering of values in arrays is important. This mechanism only exists because there is currently no mechanism for partial matching or a way to match one path against multiple rules. ----- group ----- The "group" instruction should be considered a "last resort" hack for controlling value grouping, in cases where "hidden labels" are not suitable (see below). ============================== Value Arrays and Hidden Labels ============================== In the simplest case, one rule produces one or more output path/values per matched CLDR value (i.e. one-to-one or one-to-many). If that happens, then output ordering of the resource bundle paths is just the natural resource bundle path ordering. However it is also possible for several rules to produce values for a single output path (i.e. many-to-one). When this happens there are some important details about how results are grouped and ordered. ------------ Value Arrays ------------ If several rules produce results for the same resource bundle path, the values produced by the rules are always ordered according to the order of the rule in the configuration rule (and it is best practice to group any such rules together for clarity). If each rule produces multiple values, then depending on grouping, those values can either be concatenated together in a single array or grouped individually to create an array of arrays. In the example below, there are four rules producing values for the same path ( //.../firstDay[@day="(%W)"][@territories="(%W)"] ; /weekData/$2:intvector ; values=&day_number($1) //.../minDays[@count="(%N)"][@territories="(%W)"] ; /weekData/$2:intvector ; values=$1 //.../weekendStart[@day="(%W)"][@territories="(%W)"] ; /weekData/$2:intvector ; values=&day_number($1) 0 //.../weekendEnd[@day="(%W)"][@territories="(%W)"] ; /weekData/$2:intvector ; values=&day_number($1) 86400000 The first two rules produce one value each, and the last two produce two values each. This results in the resource bundle "/weekData/xxx:intvector" having a single array consisting of six values. In the real configuration, these rules also use fallback instructions to ensure that the resulting array of values is always six values, even if some CLDR paths are not present. ------------- Hidden Labels ------------- Sometimes rules should produce separate "sub-arrays" of values, rather than having all the values appended to a single array. Consider the following path/value pairs: x/y: a x/y: b x/y: c Which produce the resource bundle "x/y" with three values: x{ y{ "a", "b", "c" } } Now suppose we want to make a resource bundle where the values are grouped into their own sub-array: x{ y{ { "a", "b", "c" } } } We can think of this as coming from the path/value pairs: x/y/-: a x/y/-: b x/y/-: c where to represent the sub-array we introduce the idea of an empty path element '-'. In a transformation rule, these "empty elements" are represent as "hidden labels", and look like "". They are treated as "normal" path elements for purposes of ordering and grouping, but are treated as empty when the paths are written to the ICU data files. For example the rule: //.../currencyCodes[@type="(%W)"][@numeric="(%N)"].* ; /codeMappingsCurrency/<$1> ; values=$1 $2 Generates a series of grouped, 2-element sub-arrays split by the captured type attribute. codeMappingCurrency{ { type-1, numeric-1 } { type-2, numeric-2 } { type-3, numeric-3 } } is a special hidden label which is substituted for in incrementing counting when sorting paths. It ensures that values in the same array are sorted in the order that they were encountered. However this mechanism imposes a strict requirement that the ordering of CLDR values to be transformed matches the expected ICU value order, so it should be avoided where possible to avoid this implicit, subtle dependency. Note that this mechanism is currently only enabled for the transformation of "supplemental data" and may eventually be removed. Hidden labels are a neat solution which permits the generation of sub-array values, but they don't quite work in every case. For example if you need to produce a resource bundle with a mix of values and sub-arrays, like: x{ y{ "a", { "b", "c" } "d" } } which can be thought of as coming from the path/value pairs: x/y: a x/y/: b x/y/: c x/y: d we find that, after sorting the resource bundle paths, we end up with: x/y: a x/y: d x/y/: b x/y/: c which produces the wrong result. This happens because values with different paths are sorted primarily by their path. I cases like this, where a mix of values and sub-arrays are required, the "group" instruction can be used instead. For example: //ldml/numbers/currencies/currency[@type="(%W)"]/symbol ; /Currencies/$1 //ldml/numbers/currencies/currency[@type="(%W)"]/displayName ; /Currencies/$1 //ldml/numbers/currencies/currency[@type="(%W)"]/pattern ; /Currencies/$1 ; group //ldml/numbers/currencies/currency[@type="(%W)"]/decimal ; /Currencies/$1 ; group //ldml/numbers/currencies/currency[@type="(%W)"]/group ; /Currencies/$1 ; group Produces resource bundles which look like: Currencies{ xxx{ "", "", { "", "", "" } } }