Closed Bug 345823 Opened 18 years ago Closed 9 months ago

Implement Unicode word breaking (UAX #29, section 4)

Categories

(Core :: Internationalization, defect)

defect

Tracking

()

RESOLVED DUPLICATE of bug 1719535

People

(Reporter: uriber, Assigned: m_kato)

References

(Blocks 1 open bug, )

Details

(Keywords: intl)

We currently use two separate algorithms to determine word boundaries (for ctrl-left/right, double-click, etc.).
One is for "ASCII"- (really, Latin-1) only text, implemented directly in nsTextTransformer, and the other is a very simplistic algorithm implemented by nsSampleWordBreaker, used for anything that contains non-"ASCII" characters.

We should replace both (or at least nsSampleWordBreaker) with a word breaker that implements the Unicode word breaking algorithm, described in section 4 of UAX #29 (see URL).

See bug 56652 for the equivalent line-breaking issue.
Another related bug is bug 229896 (for grapheme clusters). This had better be filed under i18n because it can be potentially used for places other than layout. 
Component: Layout: Fonts and Text → Internationalization
Assignee: nobody → smontagu
QA Contact: layout.fonts-and-text → amyy
Blocks: word-select
QA Contact: amyy → i18n
Assignee: smontagu → m_kato
Severity: normal → S3

We've integrated ICU4X word segmenter in bug 1719535, which is UAX 29 compatible.

Status: NEW → RESOLVED
Closed: 9 months ago
Duplicate of bug: 1719535
Resolution: --- → DUPLICATE
You need to log in before you can comment on or make changes to this bug.