Page 1 of 1
codepage recognition broken?
Posted: Sat May 06, 2017 11:18 am
by tmsg
I have for an hour or so tried in vain to get AP to recognise the codepages used for different text files, a mix of CP1252, ISO-8859-1 and UTF-8. The recognition (and the redetection logic) always return the code page that is defined as Default Codepage in the Settings. (The only exception is when I open a UTF-8 file with BOM. However, many UTF-8 files I use do not have a BOM and I do not intend to force its usage.)
I have also played around with various values for Buffer in the Settings page, none of which did change anything.
Hints anyone?
Posted: Tue May 23, 2017 6:19 am
by Instructor
Posted: Fri Jul 07, 2017 3:04 pm
by tmsg
Nope.
CP recognition for files in UTF-8 without BOM is definitely broken. If I create a new file with another editor, put in some text including a few UTF-8 characters and save that w/o BOM and under a name never used before, AP is unable to recognize this as UTF-8. (I have the settings as described in the FAQ (buffer size is even 16384), so it's not a question of that.)
UTF-8 with BOM is recognized but that's no great feat:-)
Posted: Fri Jul 07, 2017 7:30 pm
by Skif_off
tmsg
1. ISO-8859-1 is identical to CP1252 except for the code points 128-159 (0x80-0x9F).
2. First 128 symbols (0x00-0x7F) ISO-8859-1/CP1252 is identical to UTF-8.
If you use UTF-8 only with numbers 0-9, letters English alphabet, some basic punctuation symbols and control codes, then UTF-8 == ISO-8859-1/CP1252 (if CP1252 your default system nonUnicode codepage).
AkelPad uses
https://en.m.wikipedia.org/wiki/Letter_frequency and
buffer size is part of the file for analysis (in bytes, "0-buffer size" or entire file if
file size <
buffer size).
And check options: General > Codepage recognition: if you use CP1252, then try "Western European (1252, OEM, UTF-8 )".
Posted: Sat Jul 08, 2017 10:39 am
by tmsg
@Skif_off: The newly created file mentioned in yesterdays post DOES contain special characters (ie Umlauts and the like). Yet it is consistently "recognized" as CP1252.
The problem lies with the missing BOM. In this case AP seems to assume that any special characters (values>127) signal CP1252-coded characters. Such a file COULD be of course CP1252, but it is improbable if the sequences are UTF-8-coded characters. Deciding whether it's CP1252 or UTF-8 needs a heuristic approach and as such will sometimes produce the wrong result, but AP is CONSISTENTLY opening UTF-8 file w/o BOM as CP1252.
As far as I can see, there's no way how AP can recognize a UTF-8 file w/o BOM.
Posted: Sun Jul 09, 2017 11:16 am
by Instructor
tmsg wrote:...put in some text including a few UTF-8 characters...
FAQ wrote:2. Maybe the file is too small. Characters for recognition must be greater than 11.
Posted: Sun Jul 09, 2017 12:45 pm
by tmsg
@Instructor: The file itself is bigger than 11 characters. I assume(d) the 11 character limit means file length and NOT number of UTF-8 characters?
EDIT: Hm... it seems that the 11 character limit indeed means that there have to be at least that many UTF-8 characters in the file. That is unfortunate because some of my files have no BOM and will have fewer than 11 UTF-8 characters. I have no problem trying to change that limit and recompile AP... could you give me a hint where to look?
Posted: Sun Jul 09, 2017 4:02 pm
by Instructor
Edit.h wrote://Minimum detect characters representation
#define DETECTCHARS_REQUIRED 10
Posted: Sun Jul 09, 2017 4:47 pm
by tmsg
@Instructor: Thanks a lot. Will check that out and perhaps (!) change to my requirements.
Posted: Wed Jul 12, 2017 3:02 pm
by tmsg
So I've tried and recompiled AP with other values for DETECTCHARS_REQUIRED, with no difference at all. The required number of characters stays at 10.
At any rate, I think the strategy to require a certain minimum amount of characters is misguided anyway. I'd probably try to implement this along these lines:
1. If the file in question has a BOM, open it according to the BOM. (That's the easy bit.)
2. If the file in question has no BOM, it could be in any codepage or it could be UTF-8. Given the sheer amount of possible codepages, my first step would be to try and read the file as UTF-8. Every clean ASCII file will open OK, as well as every clean UTF-8 file.
3. If reading the file as UTF-8 succeeds I'd assume it is UTF-8. If not it's probably in some codepage and then there might be some heuristic and/or user-given precedence list to "guess" the codepage.
Posted: Wed Jul 12, 2017 11:59 pm
by c-sanchez
Hey @tmsg can you leave a link of the recompiled AP?
To be honest, I don't like to use an script to detect file encoding properly.
I think this must be a own featured from editor.
(from this thread
http://akelpad.sourceforge.net/forum/vi ... 2645#32645)
On the other hand i have a curiousity.
BabelPad 10 announce..
http://www.babelstone.co.uk/Software/BabelPad.html
Support the most recent version of the Unicode Standard, currently Unicode 10.0 (released June 2017).
It's the same for AP?
Posted: Thu Jul 13, 2017 9:49 am
by tmsg
@c-sanchez: I'd rather not post a link to the executable as it contains numerous other patches I've added over time -- all of which do work for me but have not been tested as thoroughly as an unmodified AP.
At any rate, the modified AP exe behaves exactly like the original, as already stated in the post above.