Codepage recognition not working (UTF-8 vs. 1252)

English main discussion
Post Reply
  • Author
  • Message
Offline
Posts: 3
Joined: Thu Mar 02, 2017 5:31 am

Codepage recognition not working (UTF-8 vs. 1252)

Post by pkrabbit »

I've set Codepage recognition to Western European, but the auto-detection isn't working. It seems to always use the Default codepage (be it UTF-8 or 1252).

Steps to reproduce:
- Create two text files containing, for example, the word "désolé".
- Save the first file using UTF-8 (without BOM) and the second one using 1252.
- Rename both files to anything else (so that AkelPad doesn't know the encoding beforehand)
- Try to open each file in AkelPad. One of them will have corrupted characters (according to the Default codepage setting).

I believe detection should be possible, since this 1252 file isn't valid UTF-8, and it would be a very nice feature.

Offline
Site Admin
Posts: 6311
Joined: Thu Jul 06, 2006 7:20 am

Post by Instructor »


Offline
Posts: 3
Joined: Thu Mar 02, 2017 5:31 am

Post by pkrabbit »

"Maybe the file is too small. Characters for recognition must be greater than 11."
That seems to be the case indeed! It causes issues in some smaller Portuguese files though, since few words in the language actually use diacritics, so fewer than 11 special characters in a .txt is quite common.

As an example, a 1252 file beginning with "désolé" and lots of English text also causes issues, since it has only 2 special characters.

It'd be nice to fall back to 1252 as soon as invalid UTF-8 is detected.

Offline
Posts: 3
Joined: Thu Mar 02, 2017 5:31 am

Post by pkrabbit »

Why does the character detection require 11 special characters?

I've tested in Notepad++ and it detects the "désolé" in 1252 correctly.

I think it shouldn't detect it as UTF-8, since it's invalid UTF-8. A suggestion would be to try UTF-8 by default, and if an invalid UTF-8 glyph is found, fall back to 1252.
Post Reply