Page 1 of 1

utf-8 recognition

Posted: Wed Jan 14, 2009 7:55 am
by harfman
Hi

thanks for your new version feature

Added: Chinese recognition (UTF-8).

but korean utf-8 character auto recognition is still unavailable

if you want test, visit http://www.cineast.co.kr/ and click view source

then korean utf-8 characters will be shown in broken status

Posted: Wed Jan 14, 2009 9:36 am
by Instructor
Test version for Japanese and Korean codepage recognition.

Posted: Wed Jan 14, 2009 10:53 am
by lupin1984
don't work :(

if the default codepage is utf-8 ,the codepage recognition don't work(you can choose none,cyrilic,latin,chinese)

the no-bom utf-8 text can be auto recognized :D

but i don't use utf-8 always :)

Posted: Wed Jan 14, 2009 11:04 am
by harfman
thanaks your fast reply

test version 4.14 still don't works for korean utf-8 charaters

Posted: Wed Jan 14, 2009 11:19 am
by Instructor
lupin1984 & harfman
1. turn on "Options->Settings...->General->Codepage recognition->Chinese or Korean".
2. turn off "Options->Settings...->Registry->Remember code page" (not necessary, but for clean results).
3. change default codepage to your native (if you change it). Don't use UTF-8 as your default ANSI codepage.
"Options->Settings...->General->Default codepage"
4. open file again.

Note:
File must have been not too small.

Posted: Wed Jan 14, 2009 12:46 pm
by harfman
Ok it works well, thanks for your efforts

Posted: Fri Jan 16, 2009 3:32 pm
by u_u86
Since the work go this way, it is possible to add recognition of Turkish codepage (ANSI 1254)? If you need any information about, feel free to ask.

Posted: Fri Jan 16, 2009 4:48 pm
by Instructor
u_u86
Test version "Turkish (OEM, UTF-8)".

Posted: Fri Jan 16, 2009 5:27 pm
by u_u86
Instructor wrote:u_u86
Test version "Turkish (OEM, UTF-8)".
Don't work for me. And what about (ANSI 1254)? Example: Turkish.rc akelpad language resource file in cp1254, when opening (default cp set to cp1251 or 1252) i want to automaticaly open it in cp1254.

The possible workaround with default cp set to cp1254, and recognize cp1251 also don't work - text always reconized as cp1251.

File with turkish text: http://www.box.net/shared/iio2nq3dum
Only difference between 1254 and 1252 - ~6 chars

Posted: Fri Jan 16, 2009 6:43 pm
by Instructor
u_u86 wrote:Don't work for me.
I hope you understand that you must turn on "Options->Settings...->General->Codepage recognition->Turkish (OEM, UTF-8)". As I wrote in this thread before.
u_u86 wrote:... i want to automaticaly open it in cp1254
It will be worked as you want only if you set 1254 as your default codepage.

Posted: Sat Jan 17, 2009 3:23 am
by u_u86
Of cause, i set.
If i set cp1254 as default it always open all files in that cp, what reason for recognition? Only to recognize UTF-8 and OEM?

Cyrillic recognition works well when default cp set to 1252 or 1254, and recognize to 1251. May be it has some algorithms?

Posted: Sat Jan 17, 2009 5:26 am
by Instructor
u_u86 wrote:Only to recognize UTF-8 and OEM?
Yes.

Posted: Sat Jan 17, 2009 6:12 am
by u_u86
Ok, understand. Thanks for implementation!

Posted: Fri Jan 23, 2009 2:50 am
by lupin1984
you can test the two software , no utf-8 recognition problem, perfect

but akelpad is faster and lighter , efficient :D

they are open source software , thanks

Notepad++
http://notepad-plus.sourceforge.net/uk/site.htm

notepad2
http://www.flos-freeware.ch/notepad2.html

Posted: Fri Jan 23, 2009 3:42 am
by Instructor
lupin1984
Do you get my answers on your emails? Try to read...
i test on xp and vista

vista is ok ,but xp...

thanks
This one detected correctly. Make sure you make all this steps on XP:
1. turn on "Options->Settings...->General->Codepage recognition->Chinese".
2. turn off "Options->Settings...->Registry->Remember code page" (not necessary, but for clean results).
3. change default codepage to your native (if you change it). Don't use UTF-8 as your default ANSI codepage.
"Options->Settings...->General->Default codepage"
4. open file again.
this text file can't be detected
This one is to small (has not much Chinese characters) for detection as UTF-8. Try to copy contents and it will detected correctly:

Code: Select all

测试文本thanks谢谢
18:42 2009/1/13
测试文本thanks谢谢
18:42 2009/1/13
it's firefox's simple chinese lang package

all no-bom text files,the pageInfo.properties can't be auto detected . you can test
Increase recognition buffer, for example to 8096:

"Options->Settings...->General->Buffer"