Closed Bug 760050 Opened 12 years ago Closed 10 years ago

Character encoding auto-detect fails on UTF-8 text file (regarded as TIS-620)

Categories

(Core :: Internationalization, defect)

x86_64
Linux
defect
Not set
normal

Tracking

()

RESOLVED WORKSFORME

People

(Reporter: vincent-moz, Unassigned)

Details

Attachments

(2 files)

User Agent: Mozilla/5.0 (X11; Linux x86_64; rv:15.0) Gecko/15.0 Firefox/15.0a1
Build ID: 20120530144327

Steps to reproduce:

Open a UTF-8 text file (same as attached) with the "file:" URL scheme, containing:
Test: §9

Character encoding choice is set to Auto-Detect → Universal.


Actual results:

Firefox regards the file as encoded in TIS-620 (Thai).


Expected results:

Firefox should have regarded the file as encoded in UTF-8. Since this is the most standard encoding, it should always be tried first with Universal.
Note: the bug seems to occur only with the "file:" URL scheme, not with "http:" (so, do not try the attached text file directly, save it first).
Attachment #628670 - Attachment mime type: text/plain → text/plain; charset=
Attached file Same file with BOM
Your file is missing the byte order mark (BOM). All browsers I've tried open it as Western (Windows-1252). Notepad++ identifies it as "ANSI as UTF8".

"Beyond its specific use as a byte-order indicator, the BOM character may also indicate which of the several Unicode representations the text is encoded in."
— http://en.wikipedia.org/wiki/Byte_order_mark

Saving it with BOM makes all browsers treat it as UTF-8.
The user shouldn't be forced to use the BOM as various Unix tools (less, xterm, the whole concept of streams and redirections...) can't handle it nicely. Let's recall that the BOM is optional for UTF-8 files; see also the "The reasons the standard does not advocate the UTF-8 BOM" text on Wikipedia. So, the drawbacks would be rather important, while a good user-level choice would be sufficient in practice. In particular in a context where the browser is run under UTF-8 based locales and the files are viewed locally (with the "file:" URL scheme), UTF-8 should be preferred to other encodings by default (and/or it should be configurable).
http://www.unicode.org/faq/utf_bom.html discourages use of BOMs on UTF-8 (with the rationale that it may interfere with other expectations for the initial characters in the file, e.g. #!).

On today's internet, the encoding detector really ought to assume UTF-8 until proven otherwise.
This has been resolved by no longer offering the "Universal" detector.
Status: UNCONFIRMED → RESOLVED
Closed: 10 years ago
Resolution: --- → WORKSFORME
The original bug is no longer there. However Firefox now interprets UTF-8 files as ISO-8859-1 when no charset information could be provided (e.g. with the "file:" URL scheme). This isn't satisfactory either.

Note: I'm not talking about HTTP, where there can be some default (it was ISO-8859-1 in the past).
See bug 815551 comment 5.

FWIW, I'm very much against autodetecting UTF-8 on http[s] URLs but I think we should do it for file: URLs, since in the file: case we don't have HTTP headers but do have all the bytes in advance.

Feel free to argue for UTF-8 autodetection on file: URLs in bug 815551.
Bug 815551 is about the HTML parser. Here this is about text/plain files open with a "file:" URL scheme, for which there is no way to specify the encoding (possibly except BOM, but which has major drawbacks in other contexts, so that it is not used in practice, as already said). This is very different from HTML files.
Bug 815551 was actually for HTML files served via HTTP and wontfixed for this reason (see the first comments). So, I've submitted a new bug for text/plain files with unknown charset: Bug 1071816.
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: