Closed Bug 760050 Opened 12 years ago Closed 10 years ago

Character encoding auto-detect fails on UTF-8 text file (regarded as TIS-620)

Tracking

()

Status:

RESOLVED WORKSFORME

People

(Reporter: vincent-moz, Unassigned)

Details

Attachments

(2 files)

Text file showing the bug 12 years ago Vincent Lefevre 10 bytes, text/plain; charset=		Details
Same file with BOM 11 years ago Gingerbread Man 13 bytes, text/plain		Details

Vincent Lefevre

Reporter

Description

•

12 years ago

Attached file Text file showing the bug — Details

User Agent: Mozilla/5.0 (X11; Linux x86_64; rv:15.0) Gecko/15.0 Firefox/15.0a1
Build ID: 20120530144327

Steps to reproduce:

Open a UTF-8 text file (same as attached) with the "file:" URL scheme, containing:
Test: §9

Character encoding choice is set to Auto-Detect → Universal.


Actual results:

Firefox regards the file as encoded in TIS-620 (Thai).


Expected results:

Firefox should have regarded the file as encoded in UTF-8. Since this is the most standard encoding, it should always be tried first with Universal.

Vincent Lefevre

Reporter

Comment 1

•

12 years ago

Note: the bug seems to occur only with the "file:" URL scheme, not with "http:" (so, do not try the attached text file directly, save it first).

Simon Montagu :smontagu

Updated

•

12 years ago

Attachment #628670 - Attachment mime type: text/plain → text/plain; charset=

Gingerbread Man

Comment 2

•

11 years ago

Attached file Same file with BOM — Details

Your file is missing the byte order mark (BOM). All browsers I've tried open it as Western (Windows-1252). Notepad++ identifies it as "ANSI as UTF8".

"Beyond its specific use as a byte-order indicator, the BOM character may also indicate which of the several Unicode representations the text is encoded in."
— http://en.wikipedia.org/wiki/Byte_order_mark

Saving it with BOM makes all browsers treat it as UTF-8.

Vincent Lefevre

Reporter

Comment 3

•

11 years ago

The user shouldn't be forced to use the BOM as various Unix tools (less, xterm, the whole concept of streams and redirections...) can't handle it nicely. Let's recall that the BOM is optional for UTF-8 files; see also the "The reasons the standard does not advocate the UTF-8 BOM" text on Wikipedia. So, the drawbacks would be rather important, while a good user-level choice would be sufficient in practice. In particular in a context where the browser is run under UTF-8 based locales and the files are viewed locally (with the "file:" URL scheme), UTF-8 should be preferred to other encodings by default (and/or it should be configurable).

Zack Weinberg (:zwol)

Comment 4

•

11 years ago

http://www.unicode.org/faq/utf_bom.html discourages use of BOMs on UTF-8 (with the rationale that it may interfere with other expectations for the initial characters in the file, e.g. #!).

On today's internet, the encoding detector really ought to assume UTF-8 until proven otherwise.

Henri Sivonen (:hsivonen)

Comment 5

•

10 years ago

This has been resolved by no longer offering the "Universal" detector.

Status: UNCONFIRMED → RESOLVED

Closed: 10 years ago

Resolution: --- → WORKSFORME

Vincent Lefevre

Reporter

Comment 6

•

10 years ago

The original bug is no longer there. However Firefox now interprets UTF-8 files as ISO-8859-1 when no charset information could be provided (e.g. with the "file:" URL scheme). This isn't satisfactory either.

Note: I'm not talking about HTTP, where there can be some default (it was ISO-8859-1 in the past).

Henri Sivonen (:hsivonen)

Comment 7

•

10 years ago

See bug 815551 comment 5.

FWIW, I'm very much against autodetecting UTF-8 on http[s] URLs but I think we should do it for file: URLs, since in the file: case we don't have HTTP headers but do have all the bytes in advance.

Feel free to argue for UTF-8 autodetection on file: URLs in bug 815551.

Vincent Lefevre

Reporter

Comment 8

•

10 years ago

Bug 815551 is about the HTML parser. Here this is about text/plain files open with a "file:" URL scheme, for which there is no way to specify the encoding (possibly except BOM, but which has major drawbacks in other contexts, so that it is not used in practice, as already said). This is very different from HTML files.

Vincent Lefevre

Reporter

Comment 9

•

10 years ago

Bug 815551 was actually for HTML files served via HTTP and wontfixed for this reason (see the first comments). So, I've submitted a new bug for text/plain files with unknown charset: Bug 1071816.

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Quick Search

Character encoding auto-detect fails on UTF-8 text file (regarded as TIS-620)

Categories

(Core :: Internationalization, defect)

Tracking

()

People

(Reporter: vincent-moz, Unassigned)

References

Details

Crash Data

Security

(public)

User Story

Attachments

(2 files)

Description

Comment 1

Updated

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Attachment

General

Description

File Name

Content Type