Using UTF-8 as internal representation
Some notes on using UTF-8 as internal representation for ECMAScript strings when surrogate pairs can be combined.
Current representation
Current internal representation is a union of:
- CESU-8: to support full 16-bit codepoint sequences without limitations. In particular, individual and unpaired surrogates must work without interpretation or conversion.
- UTF-8: to support non-BMP characters, if they are created from C code or e.g. using String.fromCharCode(0x12345).
- Extended UTF-8: to support codepoints up to U+FFFFFFFF. This is now only needed by the regexp bytecode, which uses extended UTF-8 as its internal representation and needs to represent long offsets as codepoints.
C API problem with current representation
One concrete problem with this arrangement is that non-BMP strings are internally represented as CESU-8:
- If source code contains a non-BMP character, the ECMAScript specification requires that such a character is decoded into surrogates, from https://www.ecma-international.org/ecma-262/5.1/#sec-6:
- If an actual source text is encoded in a form other than 16-bit code units it must be processed as if it was first converted to UTF-16.
- This means that
x = '\u{12345}'
andx = '\ud808\udf45'
MUST be treated identically. For example, for both inputs:- The string's
.length
must be 2. x[0]
must be 0xd808, andx[1]
must be 0xdf45.- RegExps must be able to match the individual surrogates, and one must be able to e.g. backtrack each surrogate separately.
- It must be possible to take a substring whose one end is between the surrogate codepoints.
- The string's
- In the current C API such a string will appear CESU-8 encoded because that's the internal representation used for surrogate codepoints.
- Applications dealing natively with UTF-8 would often prefer to see UTF-8 rather than CESU-8, thus avoiding the need to transcode CESU-8 to UTF-8.
The ECMAScript specification doesn't (and cannot) mandate any specific internal representation, nor does it provide any requirements on how a C API must represent strings. The current convention of using CESU-8 for standard ECMAScript strings is thus not really mandatory. However, if an alternative representation is used, it MUST behave identically as far as script code is concerned.
Automatically combining surrogates in internal representation
One alternative to the current internal representation is to:
- Keep the current CESU-8 + UTF-8 + extended UTF-8 as the base representation.
- When conceptual ECMAScript strings contain correctly paired surrogates, combine the surrogates into the actual non-BMP codepoint. The resulting codepoint is then valid UTF-8 and not CESU-8.
- When a non-paired surrogate is found, encode it as CESU-8 as before.
- This process must be applied to all inputs, both script code and C code, so that a certain conceptual ECMAScript string has a unique duk_hstring representation. (If this is not the case, string comparison using an interned string pointer would no longer be valid which leads to a lot of complications.)
This would have the upside that:
- Valid Unicode strings in UTF-8 codepoint range (U+0000 to U+10FFFF without surrogate range U+D800 to U+DFFF) would appear as valid UTF-8 (not CESU-8) in the C API.
- Pushing UTF-8 strings would produce strings that behaved like standard ECMAScript strings, i.e. they would conceptually have surrogate pairs in place of non-BMP.
And a few downsides:
- All the internal code would need to maintain an "as if" illusion: such strings must appear as uninterpreted 16-bit codepoint sequences, and all 16-bit codepoint sequences must still work without difference as far as script code is concerned. This is not trivial, more on this below.
- One would no longer be able to push an arbitrary byte sequence as a string (duk_push_string()) and then read it back as is. The automatic surrogate combination would mean the output might be different, with surrogates represented in CESU-8 combined into UTF-8. This is a loss of current functionality which has been useful for some applications; one can e.g. push ISO-8859-1 strings as is, and read them back. Script code will see such strings as being somewhat broken, but they have previously passed through without modification.
Some internals where the "as if" illusion must be maintained:
- String
.length
must count non-BMP codepoints as 2 codepoints to get the standard length. - String.charCodeAt() and all other String functions must use an index scheme that references the conceptual 16-bit codepoint sequence index (where each non-BMP counts as two indices), and allow reading, substringing, etc, both of the surrogate pairs individually.
- There's no longer an easy "char offset to byte offset" internal primitive. Currently such a conversion maps an integer to an integer (or error). For non-BMP characters the result would now be a tuple: an integer pointing to the start of the codepoint, and a flag indicating whether we want the high or the low surrogate. All places maintaining "current offset" must track that additional flag somehow (it could maybe be encoded as the high bit of a 32-bit unsigned value?).
- When doing string replacements, code must always check whether the replacements created valid surrogate pairs from previously unpaired surrogates. They must be merged, to maintain a unique strnig representation. Such surrogates may appear at the edges of replacement strings.
- When combining strings, must check for previously unpaired surrogates at string join point.
- RegExp matching must match non-BMP codepoints as two surrogates individually as far as patterns are concerned. It must be possible to capture only one of the surrogates, backtrack each surrogate individually, match start offset must try both surrogates as starting points, etc.
- RegExp /u mode would work trivially with this internal representation, as the codepoints are already combined.