Closed Bug 918879 Opened 11 years ago Closed 10 years ago

Implement String#codePointAt and String.fromCodePoint

Categories

(Core :: JavaScript Engine, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla29

People

(Reporter: evilpie, Assigned: evilpie)

References

(Blocks 1 open bug)

Details

(Keywords: dev-doc-complete, feature)

Attachments

(1 file, 2 obsolete files)

Currently working on some tests for this. There is at least one unresolved issue, about whether codePointAt outside the range should return NaN or undefined.
I guess this is a duplicate of bug 888093 (or vice versa).

Tests I wrote today:
* https://github.com/mathiasbynens/String.fromCodePoint/blob/master/tests/tests.js
* https://github.com/mathiasbynens/String.prototype.codePointAt/blob/master/tests/tests.js

Feel free to re-use these! Oh, and let me know if I missed anything important.

Looking at the example code in Norbert’s original proposal (http://norbertlindenberg.com/2012/05/ecmascript-supplementary-characters/#String), I think the intention is to return `undefined`. Keep in mind that my tests are based on that assumption. See https://bugs.ecmascript.org/show_bug.cgi?id=1153.
Comment on attachment 807822 [details] [diff] [review]
Implementation of fromCodePoint and codePointAt

Review of attachment 807822 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/builtin/String.js
@@ +128,5 @@
> +            ThrowError(JSMSG_NOT_A_CODEPOINT, ToString(nextCP));
> +
> +        // Step 5f.
> +        // Inlined UTF-16 Encoding
> +        if (nextCP < 65535) {

Shouldn’t this be 65536 (0x010000, i.e. 0xFFFF + 1)?
You are right, I missed the <=.
Assignee: general → evilpies
Mathias the tests you landed for v8 looked quite good, I think we could use them.
Attached patch v1 (obsolete) — Splinter Review
Attachment #807822 - Attachment is obsolete: true
Attachment #812453 - Flags: review?(jwalden+bmo)
(In reply to Tom Schuster [:evilpie] from comment #5)
> Mathias the tests you landed for v8 looked quite good, I think we could use
> them.

Happy to hear. This totally made my day :) Yay!

By the way, I like that in the new patch you’re using HexIntegerLiterals for code point values — much more readable that way IMHO.
Comment on attachment 812453 [details] [diff] [review]
v1

Review of attachment 812453 [details] [diff] [review]:
-----------------------------------------------------------------

Stealing review.

This looks good overall, but I have a few requests I'd like to see addressed.

As mentioned in-line below, I'm not too happy with codePointAt calling charCodeAt and thus duplicating all the conversion and checking overhead. You know better than I do how much work would be involved in exposing a performant String_getChar intrinsic to be used instead. If it's not too much work, I'd like for that to be done here. If it is, please file a follow-up bug about it.

The test files should get license header. Since the tests use MIT in Mathias' repo, that's what we'll have to use, too.

::: js/src/builtin/String.js
@@ +20,5 @@
> +    if (position < 0 || position >= size)
> +        return undefined;
> +
> +    // Steps 8-9.
> +    var first = callFunction(std_String_charCodeAt, S, position);

It's really unfortunate that we effectively repeat steps 1-7 by calling charCodeAt. How about introducing a String_getChar intrinsic? That should be easy enough to do. OTOH, I don't know whether we inline charCodeAt in Ion. OTOOH, having an inlinable String_getChar would probably allow us to self-host more String methods.

@@ +21,5 @@
> +        return undefined;
> +
> +    // Steps 8-9.
> +    var first = callFunction(std_String_charCodeAt, S, position);
> +    if (first < 0xD800 || first > 0xDBFF || position + 1 == size)

Nit: ===

@@ +30,5 @@
> +    if (second < 0xDC00 || second > 0xDFFF)
> +        return first;
> +
> +    // Step 12.
> +    return ((first - 0xD800) * 0x400) + (second - 0xDC00) + 0x10000;

Nit: over-bracing for the multiplication

@@ +107,5 @@
> +    // Step 2.
> +    var length = arguments.length;
> +
> +    // Step 3.
> +    var elements = new List();

While List is the right spec type, we can use NewDenseArray(length) for more efficiency, here.

@@ +127,5 @@
> +
> +        // Step 5f.
> +        // Inlined UTF-16 Encoding
> +        if (nextCP <= 0xFFFF) {
> +            elements.push(nextCP);

You can use UnsafePutElements(elements, nextIndex, nextCP) here after making the change to NewDenseArray above.

@@ +131,5 @@
> +            elements.push(nextCP);
> +            continue;
> +        }
> +
> +        elements.push(std_Math_floor((nextCP - 0x10000) / 0x400) + 0xD800);

coercing to int with |0 should be faster than Math.floor. Or at least tax the jit less heavily, I guess.

@@ +132,5 @@
> +            continue;
> +        }
> +
> +        elements.push(std_Math_floor((nextCP - 0x10000) / 0x400) + 0xD800);
> +        elements.push(((nextCP - 0x10000) % 0x400) + 0xDC00);

UnsafePutElements can be used with more than one index/value pair, so these two calls can be combined.
Attachment #812453 - Flags: review?(jwalden+bmo) → feedback+
(In reply to Till Schneidereit [:till] from comment #8)
> Comment on attachment 812453 [details] [diff] [review]
> v1
> 
> Review of attachment 812453 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> Stealing review.
> 
> This looks good overall, but I have a few requests I'd like to see addressed.
> 
> As mentioned in-line below, I'm not too happy with codePointAt calling
> charCodeAt and thus duplicating all the conversion and checking overhead.
> You know better than I do how much work would be involved in exposing a
> performant String_getChar intrinsic to be used instead. If it's not too much
> work, I'd like for that to be done here. If it is, please file a follow-up
> bug about it.
> 
> The test files should get license header. Since the tests use MIT in
> Mathias' repo, that's what we'll have to use, too.
> 
> ::: js/src/builtin/String.js
> @@ +20,5 @@
> > +    if (position < 0 || position >= size)
> > +        return undefined;
> > +
> > +    // Steps 8-9.
> > +    var first = callFunction(std_String_charCodeAt, S, position);
> 
> It's really unfortunate that we effectively repeat steps 1-7 by calling
> charCodeAt. How about introducing a String_getChar intrinsic? That should be
> easy enough to do. OTOH, I don't know whether we inline charCodeAt in Ion.
> OTOOH, having an inlinable String_getChar would probably allow us to
> self-host more String methods.
> 
We already do a pretty good job at inlining charCodeAt. But I guess with an unsafe macro we could avoid some length check, unless that one is already folded with our length check :)
> @@ +21,5 @@
> > +        return undefined;
> > +
> > +    // Steps 8-9.
> > +    var first = callFunction(std_String_charCodeAt, S, position);
> > +    if (first < 0xD800 || first > 0xDBFF || position + 1 == size)
> 
> Nit: ===
> 
> @@ +30,5 @@
> > +    if (second < 0xDC00 || second > 0xDFFF)
> > +        return first;
> > +
> > +    // Step 12.
> > +    return ((first - 0xD800) * 0x400) + (second - 0xDC00) + 0x10000;
> 
> Nit: over-bracing for the multiplication
> 
> @@ +107,5 @@
> > +    // Step 2.
> > +    var length = arguments.length;
> > +
> > +    // Step 3.
> > +    var elements = new List();
> 
> While List is the right spec type, we can use NewDenseArray(length) for more
> efficiency, here.
> 
> @@ +127,5 @@
> > +
> > +        // Step 5f.
> > +        // Inlined UTF-16 Encoding
> > +        if (nextCP <= 0xFFFF) {
> > +            elements.push(nextCP);
> 
> You can use UnsafePutElements(elements, nextIndex, nextCP) here after making
> the change to NewDenseArray above.
> 
Is that safe when nextIndex == elements.length?
> @@ +131,5 @@
> > +            elements.push(nextCP);
> > +            continue;
> > +        }
> > +
> > +        elements.push(std_Math_floor((nextCP - 0x10000) / 0x400) + 0xD800);
> 
> coercing to int with |0 should be faster than Math.floor. Or at least tax
> the jit less heavily, I guess.
> 
v8 uses the FLOOR macro here, but I had no luck finding out how that is implemented.
> @@ +132,5 @@
> > +            continue;
> > +        }
> > +
> > +        elements.push(std_Math_floor((nextCP - 0x10000) / 0x400) + 0xD800);
> > +        elements.push(((nextCP - 0x10000) % 0x400) + 0xDC00);
> 
> UnsafePutElements can be used with more than one index/value pair, so these
> two calls can be combined.

Some general observation in terms of speed: We might want to use something that allows us to construct the resulting string without having to call fromCharCode. Maybe something like an Uint16Array where we can use that as the backing storage for the new string.
(In reply to Tom Schuster [:evilpie] from comment #9)
> > It's really unfortunate that we effectively repeat steps 1-7 by calling
> > charCodeAt. How about introducing a String_getChar intrinsic? That should be
> > easy enough to do. OTOH, I don't know whether we inline charCodeAt in Ion.
> > OTOOH, having an inlinable String_getChar would probably allow us to
> > self-host more String methods.
> > 
> We already do a pretty good job at inlining charCodeAt. But I guess with an
> unsafe macro we could avoid some length check, unless that one is already
> folded with our length check :)

Yeah, you know more about this than I do, so if you think this doesn't cost much, that works for me. :)

> > 
> > You can use UnsafePutElements(elements, nextIndex, nextCP) here after making
> > the change to NewDenseArray above.
> > 
> Is that safe when nextIndex == elements.length?

It's not. But IIUC, you know the length beforehand, no? You'd call NewDenseArray with the length and never run into this situation. That's how, e.g., Array#map works.

> > @@ +131,5 @@
> > > +            elements.push(nextCP);
> > > +            continue;
> > > +        }
> > > +
> > > +        elements.push(std_Math_floor((nextCP - 0x10000) / 0x400) + 0xD800);
> > 
> > coercing to int with |0 should be faster than Math.floor. Or at least tax
> > the jit less heavily, I guess.
> > 
> v8 uses the FLOOR macro here, but I had no luck finding out how that is
> implemented.

`|0` definitely works. Technically, you could also replace the division with `>> 10`, but that might be too hard to understand later. If you do that, add a comment explaining what's going on.

> > @@ +132,5 @@
> > > +            continue;
> > > +        }
> > > +
> > > +        elements.push(std_Math_floor((nextCP - 0x10000) / 0x400) + 0xD800);
> > > +        elements.push(((nextCP - 0x10000) % 0x400) + 0xDC00);
> > 
> > UnsafePutElements can be used with more than one index/value pair, so these
> > two calls can be combined.
> 
> Some general observation in terms of speed: We might want to use something
> that allows us to construct the resulting string without having to call
> fromCharCode. Maybe something like an Uint16Array where we can use that as
> the backing storage for the new string.

That would be very nice to have, indeed. You know more about how our strings work: do you think it'd make sense to create an intrinsic along the lines of StringFromTypedArray?
Typed array element storage isn't memory-layout-interconvertible with string memory layout.  Intrinsics to create a string of a given length, with chars fillable by another intrinsic, is totally crazy and unreasonable, but not unreasonable.  *handwaving*  All but certainly should be followup-land, of course.
Till would you mind if we did this, with the List for the moment? We can optimize this somehow in the future.
(In reply to Tom Schuster [:evilpie] from comment #13)
> Till would you mind if we did this, with the List for the moment? We can
> optimize this somehow in the future.

Not at all. In fact, we should have done that much earlier - sorry for holding that up.

If you r? me on a patch with the other feedback applied, I'll be quick with the review.
Attached patch v2Splinter Review
Attachment #812453 - Attachment is obsolete: true
Attachment #8351211 - Flags: review?(till)
Comment on attachment 8351211 [details] [diff] [review]
v2

Review of attachment 8351211 [details] [diff] [review]:
-----------------------------------------------------------------

Nice!
Attachment #8351211 - Flags: review?(till) → review+
https://hg.mozilla.org/mozilla-central/rev/2411714cd058
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla29
Keywords: feature
Is there any need for manual QA here (given the existing automation coverage)?
Flags: needinfo?(evilpies)
Flags: in-testsuite+
This feature has enough tests that manual QA seems unnecessary.
Flags: needinfo?(evilpies)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: