codepage recognition broken?

English main discussion
Post Reply
  • Author
  • Message
Offline
Posts: 60
Joined: Tue Aug 21, 2012 11:17 am
Location: UK

codepage recognition broken?

Post by tmsg »

I have for an hour or so tried in vain to get AP to recognise the codepages used for different text files, a mix of CP1252, ISO-8859-1 and UTF-8. The recognition (and the redetection logic) always return the code page that is defined as Default Codepage in the Settings. (The only exception is when I open a UTF-8 file with BOM. However, many UTF-8 files I use do not have a BOM and I do not intend to force its usage.)

I have also played around with various values for Buffer in the Settings page, none of which did change anything.

Hints anyone?

Offline
Site Admin
Posts: 6311
Joined: Thu Jul 06, 2006 7:20 am

Post by Instructor »


Offline
Posts: 60
Joined: Tue Aug 21, 2012 11:17 am
Location: UK

Post by tmsg »

Nope.

CP recognition for files in UTF-8 without BOM is definitely broken. If I create a new file with another editor, put in some text including a few UTF-8 characters and save that w/o BOM and under a name never used before, AP is unable to recognize this as UTF-8. (I have the settings as described in the FAQ (buffer size is even 16384), so it's not a question of that.)

UTF-8 with BOM is recognized but that's no great feat:-)

Offline
Posts: 1161
Joined: Sun Oct 20, 2013 11:44 am

Post by Skif_off »

tmsg
1. ISO-8859-1 is identical to CP1252 except for the code points 128-159 (0x80-0x9F).
2. First 128 symbols (0x00-0x7F) ISO-8859-1/CP1252 is identical to UTF-8.
If you use UTF-8 only with numbers 0-9, letters English alphabet, some basic punctuation symbols and control codes, then UTF-8 == ISO-8859-1/CP1252 (if CP1252 your default system nonUnicode codepage).

AkelPad uses https://en.m.wikipedia.org/wiki/Letter_frequency and buffer size is part of the file for analysis (in bytes, "0-buffer size" or entire file if file size < buffer size).

And check options: General > Codepage recognition: if you use CP1252, then try "Western European (1252, OEM, UTF-8 )".

Offline
Posts: 60
Joined: Tue Aug 21, 2012 11:17 am
Location: UK

Post by tmsg »

@Skif_off: The newly created file mentioned in yesterdays post DOES contain special characters (ie Umlauts and the like). Yet it is consistently "recognized" as CP1252.

The problem lies with the missing BOM. In this case AP seems to assume that any special characters (values>127) signal CP1252-coded characters. Such a file COULD be of course CP1252, but it is improbable if the sequences are UTF-8-coded characters. Deciding whether it's CP1252 or UTF-8 needs a heuristic approach and as such will sometimes produce the wrong result, but AP is CONSISTENTLY opening UTF-8 file w/o BOM as CP1252.

As far as I can see, there's no way how AP can recognize a UTF-8 file w/o BOM.

Offline
Site Admin
Posts: 6311
Joined: Thu Jul 06, 2006 7:20 am

Post by Instructor »

tmsg wrote:...put in some text including a few UTF-8 characters...
FAQ wrote:2. Maybe the file is too small. Characters for recognition must be greater than 11.

Offline
Posts: 60
Joined: Tue Aug 21, 2012 11:17 am
Location: UK

Post by tmsg »

@Instructor: The file itself is bigger than 11 characters. I assume(d) the 11 character limit means file length and NOT number of UTF-8 characters?

EDIT: Hm... it seems that the 11 character limit indeed means that there have to be at least that many UTF-8 characters in the file. That is unfortunate because some of my files have no BOM and will have fewer than 11 UTF-8 characters. I have no problem trying to change that limit and recompile AP... could you give me a hint where to look?

Offline
Site Admin
Posts: 6311
Joined: Thu Jul 06, 2006 7:20 am

Post by Instructor »

Edit.h wrote://Minimum detect characters representation
#define DETECTCHARS_REQUIRED 10

Offline
Posts: 60
Joined: Tue Aug 21, 2012 11:17 am
Location: UK

Post by tmsg »

@Instructor: Thanks a lot. Will check that out and perhaps (!) change to my requirements.

Offline
Posts: 60
Joined: Tue Aug 21, 2012 11:17 am
Location: UK

Post by tmsg »

So I've tried and recompiled AP with other values for DETECTCHARS_REQUIRED, with no difference at all. The required number of characters stays at 10.

At any rate, I think the strategy to require a certain minimum amount of characters is misguided anyway. I'd probably try to implement this along these lines:
1. If the file in question has a BOM, open it according to the BOM. (That's the easy bit.)
2. If the file in question has no BOM, it could be in any codepage or it could be UTF-8. Given the sheer amount of possible codepages, my first step would be to try and read the file as UTF-8. Every clean ASCII file will open OK, as well as every clean UTF-8 file.
3. If reading the file as UTF-8 succeeds I'd assume it is UTF-8. If not it's probably in some codepage and then there might be some heuristic and/or user-given precedence list to "guess" the codepage.

Offline
Posts: 61
Joined: Thu Feb 04, 2016 5:27 am

Post by c-sanchez »

Hey @tmsg can you leave a link of the recompiled AP?
To be honest, I don't like to use an script to detect file encoding properly.
I think this must be a own featured from editor.
(from this thread http://akelpad.sourceforge.net/forum/vi ... 2645#32645)

On the other hand i have a curiousity.
BabelPad 10 announce..
http://www.babelstone.co.uk/Software/BabelPad.html
Support the most recent version of the Unicode Standard, currently Unicode 10.0 (released June 2017).
It's the same for AP?

Offline
Posts: 60
Joined: Tue Aug 21, 2012 11:17 am
Location: UK

Post by tmsg »

@c-sanchez: I'd rather not post a link to the executable as it contains numerous other patches I've added over time -- all of which do work for me but have not been tested as thoroughly as an unmodified AP.

At any rate, the modified AP exe behaves exactly like the original, as already stated in the post above.
Post Reply