goofy charset selection for Unicode pastes
Aidan Kehoe
kehoea at parhasard.net
Sat Aug 21 07:19:51 EDT 2004
Ar an fichiú lá de mí Lúnasa, scríobh Jamie Zawinski:
> I'm almost afraid to ask, but how does Mozilla end up displaying that
> mdash properly?
Mozilla has Unicode support from the ground up; Mule doesn't. Recent Mozilla
and Firefox use Xft, which is also Unicode from the ground up; there, they just
pass the, e.g. UTF-8 encoded version of the string to the GTK wrapper around
XftDrawString8. (After doing some magic to make sure that the font contains the
characters in the string; Xft doesn't do any internal magic to this effect, and
happily displays blocks if the font doesn't contain the appropriate characters.
The Mozilla people, rightly, think this is a bad thing.)
With old-school server-side fonts, Mozilla seems to maintain a mapping from
each of the normal X11 font registries to Mozilla's internal Unicode character
set, and works out which X11 fonts can be used to display a given character.
Mule needs a Unicode character set. There are many Unicode characters that are
not contained in the repertoires of the existing Mule character sets; when
XEmacs sees them, they get trashed. Stephen tells me adding a Mule character
set is on the table for 21.5, in a mail to this list that isn't in the
archives.
> Really, I just wish mdash (and ldquo, and all that other Windows crap)
> got turned into the roughly-corresponding Latin1 characters on paste...
That's the wrong thing. You asked for a UTF-8 string, you should get the actual
Unicode charcters. In the situation where XEmacs has a Unicode character set,
working out which font's codepoint we can use for it becomes much easier, in
the case where an *-iso10646-1 font exists with the corresponding codepoint.
In the situation where we've moved to Xft (which Stephen made noises about
finishing recently--patches exist, but they don't DTRT with Mule), then it
becomes easier again; to display any character, canonicalise it to Unicode
(from its Chinese or Central European or Ethiopic or IPA or whatever Mule
charset), and pass it to Xft.
> As long as we're on the topic -- how do I search a buffer for
> "problematic" characters? I used to do
>
> (re-search-forward "[\000-\010\013-\037\177-\377]")
>
> but that does not match unicrud. The best guess I've been able to
> come up with is
>
> (while (and (not (eobp))
> (eq 'ascii (charset-after (point))))
> (forward-char 1))
>
> but that really doesn't smell right.
(let ((charsets (delq 'ascii (charsets-in-region (point-min)
(point-max)))))
(when charsets
;; Okay, we've got non-ASCII; do extra processing here))
is what VM does in its MIME handling. You may want to separate that out from
the checks for control characters; what are you doing with the "problematic"
characters, exactly?
--
Like the early Christians, Marx expected the millennium very soon; like
their successors, his have been disappointed--once more, the world has shown
itself recalcitrant to a tidy formula embodying the hopes of some section of
mankind. (Russell)
More information about the XEmacs-Beta
mailing list