Closed Bug 320500 Opened 19 years ago Closed 9 years ago

Add \u{xxxxxx} string literals for non-BMP Unicode characters

Categories

(Core :: JavaScript Engine, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla40
Tracking Status
firefox40 --- fixed

People

(Reporter: daumling, Assigned: arai)

References

Details

(Keywords: dev-doc-complete, intl)

Attachments

(1 file)

Chinese government requires support for Unicode characters > 0xFFFF. SpiderMonkey should at least support the definition of large Unicode character constants. Adobe ExtendScript has the \Uxxxxxxxx (capital U, 8 hex digits) notation that generates a surrogate pair. We should add this capability to the SpiderMonkey parser as well.

Is extended Unicode character support planned for JS 2.0?
(In reply to comment #0)
> Adobe ExtendScript has the \Uxxxxxxxx (capital U, 8 hex digits)
> notation that generates a surrogate pair. We should add this capability to the
> SpiderMonkey parser as well.

Does it support only \Uxxxxxxxx and not, say, \U{xxxxxxxx} ? The latter is much easy to read and allows to write \U{xxxxx} with 5 x which should cover AFAIK all currently defined characters.
Let me modify the proposal according to your idea:

Let's allow for {xxx} as a generic hex escape sequence. Make \u equivalent to \x:

\u{12345} == \x{12345}
I don't think we can move unilaterally. I'm being too lazy to dig up related ECMA activities in this area. Something must have been going on...
Keywords: intl
OS: Windows XP → All
Hardware: PC → All
Summary: REQUEST: Add \Uxxxxxxxx string literals for 32-bit glyphs → Add \Uxxxxxxxx string literals for non-BMP Unicode characters
(In reply to comment #3)
> I don't think we can move unilaterally. I'm being too lazy to dig up related
> ECMA activities in this area. Something must have been going on...

Yes, ECMA TG1 is meeting (see my blog in a little bit for an update).

We could use some i18n advice.  There are obvious problems with ECMA-262 Edition 3 (e.g., RegExp character classes are generally ASCII-only).  Jungshik, do you know of lists of problems, or bugs on file, that we can collate?

Michael, can you take assignment of this bug?

/be
The Ecma TC 39 meeting in May 2012 decided to use \u{xxxxxx}, with up to six hex digits and values up to 10FFFF.
http://norbertlindenberg.com/2012/05/ecmascript-supplementary-characters/index.html#Escapes
Summary: Add \Uxxxxxxxx string literals for non-BMP Unicode characters → Add \u{xxxxxx} string literals for non-BMP Unicode characters
Assignee: general → nobody
Blocks: 1135377
There seems to be no restriction to the length of the HexDigits [1] (there is restriction to it's MV [2] though), so added test for too long leading "0", is it correct? (or did I overlook?)

  assertEq(eval(`"\\u{${"0".repeat(Math.pow(2, 28) - 20) + "1234"}}"`), String.fromCodePoint(0x1234));

Green on try run: https://treeherder.mozilla.org/#/jobs?repo=try&revision=f389f9debf15

[1] http://people.mozilla.org/~jorendorff/es6-draft.html#sec-literals-string-literals
[2] http://people.mozilla.org/~jorendorff/es6-draft.html#sec-string-literals-static-semantics-early-errors
Assignee: nobody → arai.unmht
Attachment #8591431 - Flags: review?(sstangl)
Comment on attachment 8591431 [details] [diff] [review]
Add \u{xxxxxx} string literals.

Forwarding review to Waldo, who likely has an opinion on the matter.
Attachment #8591431 - Flags: review?(sstangl) → review?(jwalden+bmo)
Comment on attachment 8591431 [details] [diff] [review]
Add \u{xxxxxx} string literals.

Review of attachment 8591431 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/frontend/TokenStream.cpp
@@ +1681,5 @@
>  
>  bool
> +TokenStream::getBracedUnicode(uint32_t* cp)
> +{
> +    skipChars(1);

Use consumeKnownChar('{'); instead.

@@ +1687,5 @@
> +    bool first = true;
> +    int32_t c;
> +    uint32_t code = 0;
> +    while (true) {
> +        c = getCharIgnoreEOL();

I'd kind of prefer explicit treatment of |c == EOF| meaning return false directly, rather than through the JS7_ISHEX further down.

@@ +1750,5 @@
> +                    if (!getBracedUnicode(&code)) {
> +                        reportError(JSMSG_MALFORMED_ESCAPE, "Unicode");
> +                        return false;
> +                    }
> +

Add MOZ_ASSERT(code <= 0x10FFFF) here.

@@ +1751,5 @@
> +                        reportError(JSMSG_MALFORMED_ESCAPE, "Unicode");
> +                        return false;
> +                    }
> +
> +                    if (code >= 0x10000) {

Consistent with UTF16Encoding(cp) in the spec, I'd prefer if these two arm were reversed -- single code unit first, two code units second.

@@ +1754,5 @@
> +
> +                    if (code >= 0x10000) {
> +                        if (!tokenbuf.append((code - 0x10000) / 1024 + 0xD800))
> +                            return false;
> +                        c = (code - 0x10000) % 1024 + 0xDC00;

Mild preference for bracing the %, tho I think most readers would probably parse it as it executes.

::: js/src/tests/ecma_6/String/unicode-braced.js
@@ +39,5 @@
> +assertEq("\u{00}", String.fromCodePoint(0x0));
> +assertEq("\u{00000000000000000}", String.fromCodePoint(0x0));
> +assertEq("\u{00000000000001000}", String.fromCodePoint(0x1000));
> +
> +assertEq(eval(`"\\u{${"0".repeat(Math.pow(2, 28) - 20) + "1234"}}"`), String.fromCodePoint(0x1234));

512MB allocation here seems a bit much.  :-)  Math.pow(2, 24) should be more than adequate.  (Actually I'm a little surprised you could [in theory] allocate a string that large.  I thought our length limits were lower than that.)

@@ +52,5 @@
> +assertThrowsInstanceOf(() => eval(`"\\u{"`), SyntaxError);
> +assertThrowsInstanceOf(() => eval(`"\\u{110000}"`), SyntaxError);
> +assertThrowsInstanceOf(() => eval(`"\\u{00110000}"`), SyntaxError);
> +assertThrowsInstanceOf(() => eval(`"\\u{100000000000000000000000000000}"`), SyntaxError);
> +assertThrowsInstanceOf(() => eval(`"\\u{FFFFFFFFFFFFFFFFFFFFFFFFFFFFFF}"`), SyntaxError);

Add some tests with spaces before, after, and intermixt in the HexDigits.

Also a test with 100000001 or somesuch, to verify the absence of overflow wrapping around back to "\u0001", would be nice.
Attachment #8591431 - Flags: review?(jwalden+bmo) → review+
https://hg.mozilla.org/mozilla-central/rev/d31dfe0f365a
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla40
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: