UTF8 versus ANSI in Lisp

Hi everyone

I've written a language translate utility. It is a function (tra "string-lang-1"), returning "string-lang-2". The function has a read file part that reads a translation file in csv format and stores it as a list. The function itself "assocs" "string-lang-1" from that list in order to return "string-lang-2".

I am defaulting to UTF-8 but I noticed that special characters in the translation file were returned wrong. When I changed the encoding of the csv file to ANSI, everything functions well. Developed this on Windows by the way.

Problem solved, but...

Can someone shine a light?

Comments

  • Torsten Moses
    edited January 2019

    Dear Wiebe,

    traditionally, AutoLISP files are always ANSI ... if there are characters above 0x7F (which are the lower 128 characters, strictly ASCII defined), then those characters 0x60...0xFF (the upper 128 characters, ANSI) are then codepage dependent.

    The UTF rules would work, if the consortium would not have done a major fault :-)
    for UTF16 and UTF32, special lead bytes in the text files are mandatory (so-called "BOM", Byte Order Mask) - while such a BOM is only optionally for UTF8 ... and that is the major fault.

    Majority of editors, which allow storing as UTF8, usually do NOT store that optional BOM sequence in beginning of the text file ...
    in the end, there is a dilemma : if a text file is opened, there is no way to distinguish, whether it is ANSI (no encoding used), or whether it is UTF8 ...

    Some editors, also BricsCAD with some support files, tries to determine whether ANSI or UTF8 is used, by the presence of some characters >= 0xCE, which is then ANSI; but in opposite, the absence of characters >= 0xCE does not mean it is UTF8 :smile:

    So long story short - best to save as ANSI, no encoding at all ...
    the Lisp engine can not really know whether some characters in your text file are
    a) UTF8 by intention
    b) seemingly UTF8, just by coincidence, but intended as ANSI
    c) ANSII by intention
    if you want to use translations, some characters, for some languages, will then work, but only under given "context", meaning the GUI codepage of the system.

    hope this helps a bit ... many greetings !

  • Thanks for clarifying!

  • Just a small addition:
    The 'open' function in BricsCAD allows an optional UTF/Unicode specifier. See the LDSP.

  • Yes, there were some questions from 3rd party developers, to be able to to read/write binary and UTF encoded files ...
    and I also encountered some (more or less undocumented) features of AutoCAD AutoLISP (open) function to optionally specify / support UTF and Unicode text files :smile:
    Hence, we also added such stuff ...
    so far, that seems to work fine, for reading/writing data files ...

    For LISP code files, ANSII is still preferred ....
    many greetings !

  • To get some things clear:

    CAD Lisp has limited support for Unicode, UTF. This makes use of code pages necessary.

    ASCII: Fixed 7 bit, 128 characters set.

    Extended ASCII: 8 bit, a fixed ASCII set plus 128 variable localized characters (80-FF). Use extended ASCII for strings in user output like (princ …). The code page is the map of used characters.

    In DOS, chcp tells you which extended ASCII code page is in use - for example code page "437" is the original IBM-PC character set. Trivium: The IBM ASCII part of "437" is not entirely equal to ASCII, the main character part is.

    In Windows, "ANSI" code pages are used, also extended ASCII. Windows-1252 is the most used code page. It covers code page "437" characters in general. Trivium: Windows-1252 was never approved by ANSI so it is a misnomer.

    In Unices, code pages are default UTF-8, echo $LANG says more. How CAD Lisp handles this is unknown so far. How as that handled?

  • Dear Wiebe,

    Windows GUI these days uses Unicode, as 2-byte-Unicode, Linux uses 4-byte-Unicode, hence covers much more characters.
    Linux usually prefers UTF8 for character encoding into 8-bit character sequenc, for characters using a code above 0xC3 (if I recall correctly) - as Unicode allows character values up to 4-byte range, hence, UTF8 encodes such "numbers" into 2...3...4 byte sequences, each byte < 0xC3.

    And that's the dilemma : for UTF16 or UTF32 that "BOM" identifier is mandators, but not for UTF8;
    and if the "BOM" header is missing, you have no real chance to know upfront whether a text file is ANSII or UTF8 (potentially containing multi-byte sequences) ... any prediction of that, i.e. "by file content", is somewhat stochastic, not deterministic (the glass sphere).

    Not to confuse between GUI and text files ... the problem is with text files only :-)

    Our Lisp engine tries to be somewhat "smart" ... if a Lisp file is loaded, and a character is encountered, where the Windows/Linux/Mac host can not convert to GUI codepage value, then our Lisp simply keeps the numerical value of that character, usually results in GUI garbage - but content remains intact (say, if that file is saved again), rather than to damage the file content.

    Therefore, Lisp code containing extended characters (>= 0x80) are then always GUI codepage dependent.

    For translations of Lisp programs : usually, any GUI related string should be placed into an external file (can even be a Lisp code file - I did this years ago with (setq keyMsg-1 "whatever") in external files, per language, keeping messages "as code") - and depending on environment, the particular language code file was loaded, resulting in proper GUI texts, even for Korean, Thai, Japanese etc. ).

    This works, as "data files" are loaded the normal way, where the file content read goes over the host CRT (C-Runtime functions of the OS), which usually does such necessary conversions, for multi-byte sequences).


    In generally, there is lots of confusion, mixing "UTF8/16 in GUI" with "UTF encoding in text files", both is virtually unrelated :smile:
    many greetings !