Closed Bug 918987 Opened 11 years ago Closed 10 years ago

Implement String.prototype.normalize

Categories

(Core :: JavaScript Engine, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla31

People

(Reporter: evilpie, Assigned: arai)

References

(Blocks 1 open bug, )

Details

(Keywords: dev-doc-complete, feature, relnote, Whiteboard: [js:p2:fx31][DocArea=JS][qa-])

Attachments

(1 file, 3 obsolete files)

* String.prototype.normalize(form = "NFC") - section 21.1.3.12 
  Returns a Unicode normalization form of the string. [2]

https://github.com/rwldrn/tc39-notes/blob/master/es6/2012-11/nov-27.md#string-normalization
OS: Linux → All
Hardware: x86_64 → All
I guess I'll take this up then. :)
Assignee: general → mz_mhs-ctb
Assignee: mz_mhs-ctb → nobody
Comment on attachment 8369130 [details] [diff] [review]
implement String.prototype.normalize and its test created from NormalizationTest.txt

I think Waldo is the more appropriate reviewer for this patch.
Attachment #8369130 - Flags: review?(luke) → review?(jwalden+bmo)
Whiteboard: [js:p2:fx30]
Comment on attachment 8369130 [details] [diff] [review]
implement String.prototype.normalize and its test created from NormalizationTest.txt

Review of attachment 8369130 [details] [diff] [review]:
-----------------------------------------------------------------

Tests are tl;dr for now.  :-)  (And honestly, I don't know a thing about Unicode normalization, except haziness about the idea.  I'm happy to read up and learn, but it might be worth having someone knowledgeable about normalization look at them.)  That said, if you generated these test files, we want the script that created them included in the patch.  Far easier to review such a script (and more easily assume correctness of the input data), than to review its output and be forced to assume the generation was done correctly.

::: js/src/js.msg
@@ +223,5 @@
>  MSG_DEF(JSMSG_NEGATIVE_REPETITION_COUNT, 170, 0, JSEXN_RANGEERR, "repeat count must be non-negative")
>  MSG_DEF(JSMSG_INVALID_FOR_OF_INIT,    171, 0, JSEXN_SYNTAXERR, "for-of loop variable declaration may not have an initializer")
>  MSG_DEF(JSMSG_INVALID_MAP_ITERABLE,   172, 0, JSEXN_TYPEERR, "iterable for map should have array-like objects")
>  MSG_DEF(JSMSG_NOT_A_CODEPOINT,        173, 1, JSEXN_RANGEERR, "{0} is not a valid code point")
> +MSG_DEF(JSMSG_INVALID_NORMALIZE_FORM, 174, 0, JSEXN_RANGEERR, "form must be one of \"NFC\", \"NFD\", \"NFKC\", or \"NFKD\"")

Single-quote the nested strings here for readability.

::: js/src/jsstr.cpp
@@ +52,5 @@
>  #include "vm/Interpreter-inl.h"
>  #include "vm/String-inl.h"
>  #include "vm/StringObject-inl.h"
>  
> +#include "unicode/unorm.h"

This needs

#if EXPOSE_INTL_API
#include ...
#endif

as long as we support building without ICU and the Internationalization API.  (Which is the case right now as regards Firefox for Android/FxOS, although eventually we'll fix that.)

@@ +839,5 @@
>      return true;
>  }
>  #endif
>  
> +/* ES6 20140210 draft 21.1.3.12. */

And probably this whole method should be in similar ifdefs.  (Thus String.prototype.normalize would only be present if Intl is also present -- doesn't seem unreasonable to me in the short run, while ICU isn't built everywhere.)

@@ +845,5 @@
> +str_normalize(JSContext *cx, unsigned argc, Value *vp)
> +{
> +    CallArgs args = CallArgsFromVp(argc, vp);
> +
> +    // Steps 1, 2, and 3

Minor nitpicks:

// Step 1.
// Steps 2-4.
// Steps 5-6.

and so on for all of these, periods at end, dashes for ranges.

@@ +854,5 @@
> +    // Step 4
> +    UNormalizationMode form;
> +    if (!args.hasDefined(0))
> +        form = UNORM_NFC;
> +    else {

Style -- if one arm of an if is braced, both arms are supposed to be:

    if (!args.hasDefined(0)) {
        form = UNORM_NFC;
    } else {

@@ +863,5 @@
> +
> +        // Step 7
> +        if (StringEqualsAscii(formStr, "NFC"))
> +            form = UNORM_NFC;
> +        else if (StringEqualsAscii(formStr, "NFD"))

Instead of using StringEqualsASCII, please add all these strings to vm/CommonPropertyNames.h, then compare against cx->names().NFC and such.

Also, these one-line bodies need braces, same as above.

@@ +878,5 @@
> +    }
> +
> +    // Step 8
> +    JSString *ns;
> +    int32_t srcLen = str->length();

Use #include "mozilla/Casting.h" and mozilla::SafeCast<int32_t>(str->length()) here, please.

@@ +880,5 @@
> +    // Step 8
> +    JSString *ns;
> +    int32_t srcLen = str->length();
> +    if (srcLen > 0) {
> +        const UChar *srcChars = reinterpret_cast<const UChar *>(str->getChars(cx));

You need to null-check this and return false if so.  As it is now, if |this| is a rope string, this could return null, and that would be Bad.

Also, #include "builtin/Intl.h" and use JSCharToUChar.

@@ +883,5 @@
> +    if (srcLen > 0) {
> +        const UChar *srcChars = reinterpret_cast<const UChar *>(str->getChars(cx));
> +        UErrorCode status = U_ZERO_ERROR;
> +        int32_t nsLen = unorm_normalize(srcChars, srcLen, form, 0,
> +                                        nullptr, 0, &status);

This probably works, but I have to imagine it's pretty slow to do a full normalization pass just to get ultimate length, then allocate memory, then do *another* normalization pass to write out the data.  So we should normalize into a StringBuffer to attempt to avoid two-pass slowness -- see intl_FormatNumber for how that would work.  (Duplicate the static const size_t SB_LENGTH = 32; thing for now -- we might make it a constant in StringBuffer, but that can be followup cleanups to what you're doing.)

@@ +898,5 @@
> +        }
> +        ns = js_NewStringCopyN<CanGC>(cx, nsChars, nsLen);
> +        js_free(nsChars);
> +    } else
> +        ns = cx->runtime()->emptyString;

Don't bother with the empty-string optimization here, just normalize no matter what.

@@ +3862,5 @@
>  #else
>      JS_FN("localeCompare",     str_localeCompare,     1,JSFUN_GENERIC_NATIVE),
>  #endif
>      JS_SELF_HOSTED_FN("repeat", "String_repeat",      1,0),
> +    JS_FN("normalize",         str_normalize,         0,JSFUN_GENERIC_NATIVE),

And this entry as well.
Attachment #8369130 - Flags: review?(jwalden+bmo)
Oh, sorry for the delay here.  :-(  Haven't been keeping on top of reviews well lately, which is totally my fault for holding you up here.
Attached patch addressing review comments (obsolete) — Splinter Review
Thank you for reviewing!

Added make-string-normalize-tests.py which makes test input data,
and string-normalize-input.js as the result of the script,
which was string-normalize-part*.js in last patch.

Also added empty string test and JSRope test in string-normalize.js
Attachment #8369130 - Attachment is obsolete: true
Attachment #8387429 - Flags: review?(jwalden+bmo)
Keywords: feature
Whiteboard: [js:p2:fx30] → [js:p2:fx31]
Whiteboard: [js:p2:fx31] → [js:p2:fx31][DocArea=JS]
Comment on attachment 8387429 [details] [diff] [review]
addressing review comments

Review of attachment 8387429 [details] [diff] [review]:
-----------------------------------------------------------------

Blah, I fail again at speedy review turnaround.  :-(  I promise I'll do better next time, seeing as I actually understand what's being done in this patch (well, in the tests) at this point and won't have to take hours to grok it.  Brief summary is, the code looks good excepting trivial style nits, but I'd like some changes to the test to enhance readability and to not burden our testing infrastructure quite so hard.  Should be pretty quick to fix.

::: js/src/jit-test/tests/basic/make-string-normalize-tests.py
@@ +11,5 @@
> +import re, sys
> +
> +sep_pat = re.compile(' +')
> +def to_code_list(codes):
> +    return ', '.join(map(lambda x: '0x{0}'.format(x), re.split(sep_pat, codes)))

Hmm.  I guess writing out numbers is okay to avoid the whole non-BMP UTF-16-ification issue here.  But we should really be careful here -- all those statically-unnecessary bits of conversion seem ripe for removal if this test needs speeding up.

@@ +35,5 @@
> +                    outf.write('[')
> +                    for i in range(1, 6):
> +                        if i > 1:
> +                            outf.write(', ')
> +                        outf.write('[{list}]'.format(list=to_code_list(m.group(i))))

While writing this data out as an array is all well and good, I would prefer if it were written out as an object for greater clarity of use.  Something like:

pat = '{{ source: {source}, NFC: {NFC}, NFD: {NFD}, NFKC: {NFKC}, NFKD: {NFKD} }}'
outf.write(pat.format(source=to_code_list(m.group(1)),
                      NFC=to_code_list(m.group(2)),
                      NFD=to_code_list(m.group(3)),
                      NFKC=to_code_list(m.group(4)),
                      NFKD=to_code_list(m.group(5))))

@@ +65,5 @@
> +
> +if __name__ == '__main__':
> +    dir = '../../../../../'
> +    if len(sys.argv) > 1:
> +        dir = sys.argv[1]

For simplicity, let's just always require that a directory be specified on the command line.  No need to worry about the cwd when running the script, then.

::: js/src/jit-test/tests/basic/string-normalize.js
@@ +1,1 @@
> +/* Test String.prototype.normalize */

Could you rename this file to string-normalize-generateddata.js?  There may well be more normalization tests over time, and I really don't want people piling on this single test every time -- or making it harder to find the test for some specific aspect of String.prototype.normalize.

And, a somewhat more important concern.  Right now, this test runs as a jit-test.  That means that on Tinderbox, it runs a bunch of times, one for each JIT flags variant.  For a very large test like this one, that's pretty wasteful.  Please generate this file into js/src/tests/ecma_6/String/ (create the directory, with empty shell.js/browser.js and the appropriate jstests.list in it, if needed).  You'll have to mark it as shell-only using "// |reftest| skip-if(!xulRuntime.shell) -- uses shell load() function" at the top of the file and |if (typeof reportCompare === "function") reportCompare(true, true);| at the end, to make it a valid jstest.

@@ +3,5 @@
> +load(libdir + 'asserts.js');
> +load('tests/basic/string-normalize-input.js');
> +
> +function runTest(test) {
> +  let [c1, c2, c3, c4, c5] = test.map(t => t.map(x => String.fromCodePoint(x)).join(""));

With the object-ification mentioned in the test-data file, this becomes

  function codePointsToString(points)
  {
    return points.map(x => String.fromCodePoint(x)).join("");
  }
  let source = codePointsToString(test.source);
  let NFC = codePointsToString(test.NFC);
  let NFD = codePointsToString(test.NFD);
  let NFKC = codePointsToString(test.NFKC);
  let NFKD = codePointsToString(test.NFKD);

Less compact, sure.  But I think a whole lot more readable.  (The c1/r1 names are pure artifacts of the original data source -- we absolutely shouldn't preserve them in this test source.)

@@ +4,5 @@
> +load('tests/basic/string-normalize-input.js');
> +
> +function runTest(test) {
> +  let [c1, c2, c3, c4, c5] = test.map(t => t.map(x => String.fromCodePoint(x)).join(""));
> +  let [r1, r2, r3, r4, r5] = test.map(t => t.map(x => x.toString(16)).join(","));

And then these become

  function stringify(points) { return points.map(x => x.toString(16)).join(); }
  let sourceStr = stringify(test.source);
  let nfcStr = stringify(test.NFC);
  let nfdStr = stringify(test.NFD);
  let nfkcStr = stringify(test.NFKC);
  let nfkdStr = stringify(test.NFKD);

@@ +62,5 @@
> +  runTest(test);
> +}
> +
> +/* not listed in Part 1 */
> +for (let x = 0; x <= 0x2FFFF; x ++) {

Get rid of the space between x and ++.

So.  This test is overall O(0x2FFFF) runtime, which is kinda big.  Granted each individual item there is smallish.  But overall, is this going to end up being a relatively slow test overall?

@@ +67,5 @@
> +  if (part1.has(x)) {
> +    continue;
> +  }
> +  let c = String.fromCodePoint(x);
> +  assertEq(c.normalize(), c, "NFC of " + x.toString(16));

Add a |let xstr = x.toString(16);| and use that in all these asserts.

::: js/src/jsstr.cpp
@@ +38,5 @@
>  #include "jstypes.h"
>  #include "jsutil.h"
>  
>  #include "builtin/RegExp.h"
> +#include "builtin/Intl.h"

I before R, so one line earlier.

@@ +844,5 @@
>  }
>  #endif
>  
> +#if EXPOSE_INTL_API
> +static const int32_t SB_LENGTH = 32;

Blank line after this, please -- and make it size_t, not int, because that's the natural type for a size/count.

@@ +887,5 @@
> +    Rooted<JSFlatString*> flatStr(cx, str->ensureFlat(cx));
> +    if (!flatStr)
> +        return false;
> +    const UChar *srcChars = JSCharToUChar(flatStr->chars());
> +    int32_t srcLen = mozilla::SafeCast<int32_t>(flatStr->length());

Add |using mozilla::SafeCast;| amidst the other using-mozilla::* at the start of the file, alphabetically, please, so you can kill the mozilla:: prefix.

@@ +901,5 @@
> +            return false;
> +        status = U_ZERO_ERROR;
> +        unorm_normalize(srcChars, srcLen, form, 0,
> +                        JSCharToUChar(chars.begin()), size,
> +                        &status);

Let's do

#ifdef DEBUG
        int32_t finalSize =
#endif
            unorm_normalize(...);
            MOZ_ASSERT(size == finalSize || U_FAILURE(status), "unorm_normalize behaved inconsistently");

to catch any ICU bugs here.
Attachment #8387429 - Flags: review?(jwalden+bmo) → feedback+
Attached patch addressing review comments (obsolete) — Splinter Review
Thank you for your reviewing and letting me know about readability!

> Hmm.  I guess writing out numbers is okay to avoid the whole non-BMP
> UTF-16-ification issue here.  But we should really be careful here -- all
> those statically-unnecessary bits of conversion seem ripe for removal if
> this test needs speeding up.

Yeah, converting code points to UTF-16 string may speed up "runTest" function.
However, to compare code point in "not listed in Part 1" tests,
raw code point may be better than UTF-16 string.

> Could you rename this file to string-normalize-generateddata.js?

Since those test files are moved into "String" directory,
also removed "string-" prefix from their filenames,
as other tests does (e.g. codePointAt.js).

> So.  This test is overall O(0x2FFFF) runtime, which is kinda big.  Granted
> each individual item there is smallish.  But overall, is this going to end
> up being a relatively slow test overall?

Measured "real" time taken to run whole jstest on Mac OS X,
results are following:

opt build
  original:                       1m5.168s
  add all normalize test:         1m5.969s  : increase about 0.8s (1.2%)
  without "not listed in Part 1": 1m5.596s  : increase about 0.4s (0.6%)
debug build
  original:                       11m45.046s
  add all normalize test:         11m56.427s : increase about 10s (1.6%)
  without "not listed in Part 1": 11m48.009s : increase about 3s  (0.4%)

Is this acceptable?
Attachment #8387429 - Attachment is obsolete: true
Attachment #8399904 - Flags: review?(jwalden+bmo)
Oh good grief no.  :-)  That's an order of magnitude or two too slow.  We can't have tests taking a full minute *anywhere*.  It is absolutely imperative that this test be split up.  Right now one of our longest tests, on my machine (new, good laptop, unplugged), takes 35.8s in a debug build.  That's probably pretty close to an upper bound on how long a test can take.  So this test is going to need splitting up for sure, at least for now.  (Maybe ICU fixes will let us claw it back, sometime.)  I wonder if it's normalization that's inherently slow, ICU's implementation of it, or what, exactly, here.

Now, how to split it up.  You don't have obvious cut points here.  We may have to revert to the strategy of some of the tests in ecma_5/Object/, and have one shared file that a bunch of people include, that has a function that lets the execution space be partitioned reasonably.  Or perhaps you have better ideas -- if so, go for 'em.

I have more comments to post on the patch, but I should get this out in advance of them, in case you've got down time to burn now.  :-)
Sorry, my report was not clear.

Normalization test does not take 1 minute or 11 minutes,
those values are time taken to running all js tests (so, running 6225 tests).

Single "normalize-generateddata.js" test takes 0.8 seconds with opt build, and 10 seconds with debug build.

However, those values depend on machine speed,
so I compared whole time with and without "normalize" test.
Comment on attachment 8399904 [details] [diff] [review]
addressing review comments

Review of attachment 8399904 [details] [diff] [review]:
-----------------------------------------------------------------

Ten seconds for one test is just fine.  Whew!  I was a little surprised at the times you were claiming; perhaps I should have been even more surprised and suspected something off.

Just to note, because of the file moves and all, it's pretty hard to review this.  :-)  I had to manually munge patches to remove the generated-data file, and to deal with the file name changes.  Not a big deal, just something to keep in mind when posting patches with new-file name changes in them, and for super-large patches where the generated parts of them dwarf the non-generated parts whose changes are the truly important parts.

Things look good here!  I'll make the one noted change, push this to try, then land if that pans out.

::: js/src/tests/ecma_6/String/make-normalize-generateddata-input.py
@@ +3,5 @@
> +""" Usage: make-normalize-generateddata-input.py PATH_TO_MOZILLA_CENTRAL
> +
> +    This script generates test input data for String.prototype.normalize
> +    from intl/icu/source/data/unidata/NormalizationTest.txt
> +    to js/src/jit-test/tests/basic/normalize-generateddata-input.js

This path's wrong.
Attachment #8399904 - Flags: review?(jwalden+bmo) → review+
Replaced double quotation in '#include "unicode/unorm.h"' with angle bracket, in jsstr.cpp.

I've just learned of check_spidermonkey_style.py.
It seems that I should run it before sending patch, in addition to jstests.
Attachment #8399904 - Attachment is obsolete: true
Attachment #8404261 - Flags: review?(jwalden+bmo)
Comment on attachment 8404261 [details] [diff] [review]
use angle bracket in include

Review of attachment 8404261 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/jsstr.cpp
@@ +55,5 @@
>  #include "vm/String-inl.h"
>  #include "vm/StringObject-inl.h"
>  
> +#if EXPOSE_INTL_API
> +#include <unicode/unorm.h>

<>-style includes will pick up system headers, but this isn't a system header, it's part of our build system.  The real solution isn't to do this, but to modify the style-checking script for this file.  See config/check_spidermonkey_style.py.  I've done that locally, things work, will push a patch with that change.  (I also moved the #include into alphabetical order among the other "x/Y.h" includes, for semi-consistency with builtin/Intl.cpp.  I'm not sure we're 100% on board with that style, but might as well start a trend.)
Attachment #8404261 - Flags: review?(jwalden+bmo)
https://hg.mozilla.org/integration/mozilla-inbound/rev/da17b1da2a39

Thanks for the patch!

We need to update https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String and https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/normalize to account for this addition.  Any chance you could start those pages off, then we can get people who wordsmith regularly to polish them to perfection?
Status: NEW → ASSIGNED
Keywords: relnote
QA Contact: arai_a
Thank you for your assistance!

I updated those 2 pages, without compatibility table in "normalize" page.
I'll update it and other pages (ES6 support, release notes, ...) after the patch was merged to mozilla-central.
https://hg.mozilla.org/mozilla-central/rev/da17b1da2a39
Assignee: nobody → arai_a
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla31
Updated following pages:
  https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/normalize
    (compatibility table)
  https://developer.mozilla.org/en-US/docs/Web/JavaScript/ECMAScript_6_support_in_Mozilla
  https://developer.mozilla.org/en-US/Firefox/Releases/31

By the way, it seems that I'm assigned to QA Contact, what should I do for it?
Just verifying the functionality is available in Nightly?
I don't think this needs any testing.
QA Contact: arai_a
Whiteboard: [js:p2:fx31][DocArea=JS] → [js:p2:fx31][DocArea=JS][qa-]
(In reply to Tooru Fujisawa [:arai] from comment #19)
> Updated following pages:

Looking good!  I tend to leave dev-doc-needed on these things so that MDN peoples can make sure such changes get hooked up in all the right places.  (You may have found all of them already, but I'm not confident of that.)  So let's leave it here, and if more is needed, they can poke us (and if not they'll just change the keyword themselves).

> By the way, it seems that I'm assigned to QA Contact, what should I do for
> it?
> Just verifying the functionality is available in Nightly?

My mistake, I meant to set Assignee.  You're good here!
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: