Codepage recognition not working (UTF-8 vs. 1252)

pkrabbit · Post by **pkrabbit** » Thu Mar 02, 2017 5:58 am

I've set Codepage recognition to Western European, but the auto-detection isn't working. It seems to always use the Default codepage (be it UTF-8 or 1252).

Steps to reproduce:
- Create two text files containing, for example, the word "désolé".
- Save the first file using UTF-8 (without BOM) and the second one using 1252.
- Rename both files to anything else (so that AkelPad doesn't know the encoding beforehand)
- Try to open each file in AkelPad. One of them will have corrupted characters (according to the Default codepage setting).

I believe detection should be possible, since this 1252 file isn't valid UTF-8, and it would be a very nice feature.

Instructor · Post by **Instructor** » Sat Mar 18, 2017 4:06 pm

FAQ: Why file encoding incorrectly detected?

pkrabbit · Post by **pkrabbit** » Sun Jun 17, 2018 12:37 am

"Maybe the file is too small. Characters for recognition must be greater than 11."

That seems to be the case indeed! It causes issues in some smaller Portuguese files though, since few words in the language actually use diacritics, so fewer than 11 special characters in a .txt is quite common.

As an example, a 1252 file beginning with "désolé" and lots of English text also causes issues, since it has only 2 special characters.

It'd be nice to fall back to 1252 as soon as invalid UTF-8 is detected.

pkrabbit · Post by **pkrabbit** » Fri Oct 26, 2018 2:01 am

Why does the character detection require 11 special characters?

I've tested in Notepad++ and it detects the "désolé" in 1252 correctly.

I think it shouldn't detect it as UTF-8, since it's invalid UTF-8. A suggestion would be to try UTF-8 by default, and if an invalid UTF-8 glyph is found, fall back to 1252.