goofy charset selection for Unicode pastes

Aidan Kehoe kehoea at parhasard.net
Sat Aug 21 07:19:51 EDT 2004


 Ar an fichiú lá de mí Lúnasa, scríobh Jamie Zawinski: 

 > I'm almost afraid to ask, but how does Mozilla end up displaying that
 > mdash properly?  

Mozilla has Unicode support from the ground up; Mule doesn't. Recent Mozilla
and Firefox use Xft, which is also Unicode from the ground up; there, they just
pass the, e.g. UTF-8 encoded version of the string to the GTK wrapper around
XftDrawString8. (After doing some magic to make sure that the font contains the
characters in the string; Xft doesn't do any internal magic to this effect, and
happily displays blocks if the font doesn't contain the appropriate characters. 
The Mozilla people, rightly, think this is a bad thing.)

With old-school server-side fonts, Mozilla seems to maintain a mapping from
each of the normal X11 font registries to Mozilla's internal Unicode character
set, and works out which X11 fonts can be used to display a given character.

Mule needs a Unicode character set. There are many Unicode characters that are
not contained in the repertoires of the existing Mule character sets; when
XEmacs sees them, they get trashed. Stephen tells me adding a Mule character
set is on the table for 21.5, in a mail to this list that isn't in the
archives.

 > Really, I just wish mdash (and ldquo, and all that other Windows crap)
 > got turned into the roughly-corresponding Latin1 characters on paste...

That's the wrong thing. You asked for a UTF-8 string, you should get the actual
Unicode charcters. In the situation where XEmacs has a Unicode character set,
working out which font's codepoint we can use for it becomes much easier, in
the case where an *-iso10646-1 font exists with the corresponding codepoint. 

In the situation where we've moved to Xft (which Stephen made noises about
finishing recently--patches exist, but they don't DTRT with Mule), then it
becomes easier again; to display any character, canonicalise it to Unicode
(from its Chinese or Central European or Ethiopic or IPA or whatever Mule
charset), and pass it to Xft.

 > As long as we're on the topic -- how do I search a buffer for
 > "problematic" characters?  I used to do
 > 
 >    (re-search-forward "[\000-\010\013-\037\177-\377]")
 > 
 > but that does not match unicrud.  The best guess I've been able to
 > come up with is
 > 
 >    (while (and (not (eobp))
 >                (eq 'ascii (charset-after (point))))
 >      (forward-char 1))
 > 
 > but that really doesn't smell right.

(let ((charsets (delq 'ascii (charsets-in-region (point-min)
					 (point-max)))))
  (when charsets
	;; Okay, we've got non-ASCII; do extra processing here))

is what VM does in its MIME handling. You may want to separate that out from
the checks for control characters; what are you doing with the "problematic"
characters, exactly?

-- 
Like the early Christians, Marx expected the millennium very soon; like
their successors, his have been disappointed--once more, the world has shown
itself recalcitrant to a tidy formula embodying the hopes of some section of
mankind. (Russell)




More information about the XEmacs-Beta mailing list