Some thoughts on Unicode.

Stephen J. Turnbull stephen at xemacs.org
Tue Aug 31 23:24:10 EDT 2004


>>>>> "Aidan" == Aidan Kehoe <kehoea at parhasard.net> writes:

    Aidan> My understanding (yes, I don't read Japanese, so it's
    Aidan> probably flawed) of what UTF-2000 did was that they were
    Aidan> keeping aound _all_ the old-school Mule charsets, as well
    Aidan> as having Unicode.

Yes.

    Aidan> That's what made keeping around so many tables necessary,
    Aidan> and what ballooned it to 30MB.

The tables were not loaded at runtime, they were dumped Lisp.  That's
what ballooned it, and it's no longer relevant in 21.5.

    Aidan> But if we're not de-unifying Han characters, that shouldn't
    Aidan> be a huge deal.

Even deunifying Han is a maximum of 2 x 2 x 21k x 3 bytes < 256 kB,
not 30MB.  Altogether the Unicode tables in XEmacs 21.5 occupy, uh,
817600 bytes (configure with --memory-usage-stats, M-x
show-memory-usage).

    >> Let's not go there.  Just go straight to UTF-32, and deal with
    >> space efficiency later.  The structures for widechar
    >> representations are already in place, but the code for widechar
    >> buffers still needs to be written, I think.

    Aidan> Or do you mean having Ibytes sixteen bits wide?

Yes, or 32 bits wide.

    Aidan> But Mule's fast enough as-is, and if we are unifying Han,
    Aidan> the maximum size of a UTF-8 encoded string is four octets,
    Aidan> as it is for current Ibytes.

I don't think it is, actually, and especially not in an XEmacs
instrumented for debugging.  There are a lot of potentially n^2
algorithms caused by the fact that checking char positions requires
counting from the beginning of the buffer.  (With error-checking that
could be n^3.)  We do some caching of positions, and optimization of
the counting process, but this turns out to not be very effective.

Also, the code required to deal with caching buffer positions
immensely complicates the buffer motion code, and I believe is the
source of a large number of bugs.

    >> In the next release, we do NOT want fonts to be determined by
    >> charset, we want them determined by "culture" (probably ==
    >> language).

    Aidan> Okay. I had thought that the reason Mule originally
    Aidan> separated its Han charsets was discernment, and thinking
    Aidan> things through; if it's incompatibility, for the sake of
    Aidan> it, then moving away is fine.

Well, not entirely for the sake of it.  But the fact is that
monolingual usage is the normal case, not multilingual usage.
Penalties should be paid for multilingual use, not the other way
around.

    Aidan> I know [latin-unity is] a kludge--I still think its API
    Aidan> should be preserved. An efficient way to ask "can this
    Aidan> buffer be encoded in iso-8859-1 without losing data" would
    Aidan> be, and is, worthwhile.

What's wrong with Just Doing It, and dealing with the error if data
loss would occur?  In general, this is going to require only a tiny
bit more CPU (a memory store per character) than not doing the
translation, and you're going to need the output buffer eventually
(unless it's purely for curiosity's sake that you're asking ;-), so no
space saving.  In the vast majority of cases, you will end up only
trying one coding system, so why do a lookup per character twice?

I don't see a need for this API; I guess if it turns out that there's
a lot of curiosity about really big buffers, we could add a "dry-run"
flag to the decode-coding-region API and save on the output buffer
space.  (This is more or less your second interpretation of what I
wrote.)

Note that the problem that latin-unity addresses is not efficiency,
but rather that our current ISO-8859-X, X != 1, coding systems simply
use ISO 2022 extensions to encode non-ISO-8859-X characters (typically
not what is wanted), while ISO-8859-1 (aka binary) just throws away
anything that isn't ISO-8859-1 and replaces it with ~.

With that context, do you still see a need for a testing API?

    >> An alternative experiment:

    Aidan> (Really, at this point, in August 2004, we should be past
    Aidan> experimenting :-()

Somebody should, but it's not clear why it should be us.  Unicode
simply has not been high on anybody's priority list, not the
developers and not the users, either.  (That is, in the Emacs
community.)

    Aidan> How does that perform with iso-2022-based coding systems? 
    Aidan> Is there much of a hit? I suppose, though, that's not going
    Aidan> to be done that often, compared to converting to UCS-4 for
    Aidan> redisplay.

I haven't finished the implementation to the point where I can use it
at all, otherwise it would be in CVS.  ;-)  However, it's the same
strategy used by XEmacs/UTF-2000, and there was not a noticable hit
there.

I'll see if I can get something into CVS, #ifdef'd if working, on a
branch if not, in a few days.

    >> The next step after that is to change the internal
    >> representation to UTF-8 and use unicode-to-char to index into
    >> the fonts (except for Xft).

    Aidan> ? Oh, as in, we can call XftDrawStringUtf8, after checking
    Aidan> whether a given code point is available in the font.

Something like that, although actually we'll probably use the 16-bit
widechar interface.  (XEmacs redisplay does not work in terms of
Mule-encoded strings; it uses arrays of Ichars.)

    >> Finally we can cache font indicies in a Unicode chartable.

    >> The next step after that is to arrange for the codecs to set
    >> extents in the buffer for the charsets decoded.

    Aidan> What does that give us? Except, perhaps, caching
    Aidan> information for writing things out again. Hmm.

Technically, yes, it allows us to deal with David's request for
invertible codecs.

Most politically important, though, is Han disunity ;-).  In general,
different languages will prefer different fonts, and we should cater
to that, not by changing faces, but by providing faces that map
languages (not charsets) to fonts.  In pig-xml,

<extent lang="fr_FR">C'est la vie</extent> is a French phrase.

would be preferable to

<extent face="italic">C'est la vie</extent> is a French phrase.

as an implementation of the English convention that foreign words
should be displayed as emphasized.  Or (I'm told) the German umlaut is
positioned at a different height above the base character than some
other languages' diaeresis.

Not that I really expect anyone will ever use it, but I'd like it to
be there as sort of an "Easter egg".  ;-)

-- 
Institute of Policy and Planning Sciences     http://turnbull.sk.tsukuba.ac.jp
University of Tsukuba                    Tennodai 1-1-1 Tsukuba 305-8573 JAPAN
               Ask not how you can "do" free software business;
              ask what your business can "do for" free software.




More information about the XEmacs-Beta mailing list