codepage recognition broken?

tmsg · Post by **tmsg** » Sat May 06, 2017 11:18 am

I have for an hour or so tried in vain to get AP to recognise the codepages used for different text files, a mix of CP1252, ISO-8859-1 and UTF-8. The recognition (and the redetection logic) always return the code page that is defined as Default Codepage in the Settings. (The only exception is when I open a UTF-8 file with BOM. However, many UTF-8 files I use do not have a BOM and I do not intend to force its usage.)

I have also played around with various values for Buffer in the Settings page, none of which did change anything.

Hints anyone?

Instructor · Post by **Instructor** » Tue May 23, 2017 6:19 am

FAQ: Why file encoding incorrectly detected?

tmsg · Post by **tmsg** » Fri Jul 07, 2017 3:04 pm

Nope.

CP recognition for files in UTF-8 without BOM is definitely broken. If I create a new file with another editor, put in some text including a few UTF-8 characters and save that w/o BOM and under a name never used before, AP is unable to recognize this as UTF-8. (I have the settings as described in the FAQ (buffer size is even 16384), so it's not a question of that.)

UTF-8 with BOM is recognized but that's no great feat:-)

Skif_off · Post by **Skif_off** » Fri Jul 07, 2017 7:30 pm

tmsg
1. ISO-8859-1 is identical to CP1252 except for the code points 128-159 (0x80-0x9F).
2. First 128 symbols (0x00-0x7F) ISO-8859-1/CP1252 is identical to UTF-8.
If you use UTF-8 only with numbers 0-9, letters English alphabet, some basic punctuation symbols and control codes, then UTF-8 == ISO-8859-1/CP1252 (if CP1252 your default system nonUnicode codepage).

AkelPad uses https://en.m.wikipedia.org/wiki/Letter_frequency and buffer size is part of the file for analysis (in bytes, "0-buffer size" or entire file if file size < buffer size).

And check options: General > Codepage recognition: if you use CP1252, then try "Western European (1252, OEM, UTF-8 )".

tmsg · Post by **tmsg** » Sat Jul 08, 2017 10:39 am

@Skif_off: The newly created file mentioned in yesterdays post DOES contain special characters (ie Umlauts and the like). Yet it is consistently "recognized" as CP1252.

The problem lies with the missing BOM. In this case AP seems to assume that any special characters (values>127) signal CP1252-coded characters. Such a file COULD be of course CP1252, but it is improbable if the sequences are UTF-8-coded characters. Deciding whether it's CP1252 or UTF-8 needs a heuristic approach and as such will sometimes produce the wrong result, but AP is CONSISTENTLY opening UTF-8 file w/o BOM as CP1252.

As far as I can see, there's no way how AP can recognize a UTF-8 file w/o BOM.

Instructor · Post by **Instructor** » Sun Jul 09, 2017 11:16 am

tmsg wrote:...put in some text including a few UTF-8 characters...

FAQ wrote:2. Maybe the file is too small. Characters for recognition must be greater than 11.

tmsg · Post by **tmsg** » Sun Jul 09, 2017 12:45 pm

@Instructor: The file itself is bigger than 11 characters. I assume(d) the 11 character limit means file length and NOT number of UTF-8 characters?

EDIT: Hm... it seems that the 11 character limit indeed means that there have to be at least that many UTF-8 characters in the file. That is unfortunate because some of my files have no BOM and will have fewer than 11 UTF-8 characters. I have no problem trying to change that limit and recompile AP... could you give me a hint where to look?

Instructor · Post by **Instructor** » Sun Jul 09, 2017 4:02 pm

Edit.h wrote://Minimum detect characters representation
#define DETECTCHARS_REQUIRED 10

tmsg · Post by **tmsg** » Sun Jul 09, 2017 4:47 pm

@Instructor: Thanks a lot. Will check that out and perhaps (!) change to my requirements.

tmsg · Post by **tmsg** » Wed Jul 12, 2017 3:02 pm

So I've tried and recompiled AP with other values for DETECTCHARS_REQUIRED, with no difference at all. The required number of characters stays at 10.

At any rate, I think the strategy to require a certain minimum amount of characters is misguided anyway. I'd probably try to implement this along these lines:
1. If the file in question has a BOM, open it according to the BOM. (That's the easy bit.)
2. If the file in question has no BOM, it could be in any codepage or it could be UTF-8. Given the sheer amount of possible codepages, my first step would be to try and read the file as UTF-8. Every clean ASCII file will open OK, as well as every clean UTF-8 file.
3. If reading the file as UTF-8 succeeds I'd assume it is UTF-8. If not it's probably in some codepage and then there might be some heuristic and/or user-given precedence list to "guess" the codepage.

c-sanchez · Post by **c-sanchez** » Wed Jul 12, 2017 11:59 pm

Hey @tmsg can you leave a link of the recompiled AP?
To be honest, I don't like to use an script to detect file encoding properly.
I think this must be a own featured from editor.
(from this thread http://akelpad.sourceforge.net/forum/vi ... 2645#32645)

On the other hand i have a curiousity.
BabelPad 10 announce..
http://www.babelstone.co.uk/Software/BabelPad.html
Support the most recent version of the Unicode Standard, currently Unicode 10.0 (released June 2017).
It's the same for AP?

tmsg · Post by **tmsg** » Thu Jul 13, 2017 9:49 am

@c-sanchez: I'd rather not post a link to the executable as it contains numerous other patches I've added over time -- all of which do work for me but have not been tested as thoroughly as an unmodified AP.

At any rate, the modified AP exe behaves exactly like the original, as already stated in the post above.