goofy charset selection for Unicode pastes
Aidan Kehoe
kehoea at parhasard.net
Mon Aug 23 13:27:15 EDT 2004
[This lacks a bit of context at present; I imagine Glynn's original will get
de-spam-trapped presently.]
Ar an tríú lá is fiche de mí Lúnasa, scríobh Glynn Clements:
> The problems with that are:
>
> 1. It requires Mule.
[...]
Oh, and it's a stupid hack, too :-). That's a problem.
Personally, I'm not interested in non-Mule any more. I need to use Unicode,
Mule-UCS, while broken, gives me better Unicode than non-Mule, I'm not using
non-Mule, end of story. Maybe that disqualifies me from talking about non-Mule.
> 2. While it would display correctly, you would still end up with a
> Chinese character in the buffer, which will do the wrong thing if you
> save the buffer, copy the data from the buffer, send it to a process
> or network stream etc.
Eh. You'd end up with an em dash in the buffer. Which is only a Chinese
character if that's what the coding system you're using thinks it is. If you're
using UTF-8, for example, it's western.
> > (I'm trying to get Windows-1252 to look readable with
> >
> > (progn
> > (standard-display-ascii ?\x80 (ucs-to-char #x20AC)) ;; EURO SIGN
> > (standard-display-ascii ?\x82 (ucs-to-char #x201A)) ;; SINGLE LOW-9 QUOTATION MARK
> > ...)
> >
> > but it's not working for me. Is the stuff in disp-table.el non-Mule only?)
>
> The problem here is that both of the ucs-to-char calls return nil.
That was _part_ of the problem; my ellipsis above was intended to mean that
things had been elided. Here's the full progn;
(progn
(standard-display-ascii ?\x80 (ucs-to-char #x20AC)) ;; EURO SIGN
(standard-display-ascii ?\x82 (ucs-to-char #x201A)) ;; SINGLE LOW-9 QUOTATION MARK
(standard-display-ascii ?\x83 (ucs-to-char #x0192)) ;; LATIN SMALL LETTER F WITH HOOK
(standard-display-ascii ?\x84 (ucs-to-char #x201E)) ;; DOUBLE LOW-9 QUOTATION MARK
(standard-display-ascii ?\x85 (ucs-to-char #x2026)) ;; HORIZONTAL ELLIPSIS
(standard-display-ascii ?\x86 (ucs-to-char #x2020)) ;; DAGGER
(standard-display-ascii ?\x87 (ucs-to-char #x2021)) ;; DOUBLE DAGGER
(standard-display-ascii ?\x88 (ucs-to-char #x02C6)) ;; MODIFIER LETTER CIRCUMFLEX ACCENT
(standard-display-ascii ?\x89 (ucs-to-char #x2030)) ;; PER MILLE SIGN
(standard-display-ascii ?\x8A (ucs-to-char #x0160)) ;; LATIN CAPITAL LETTER S WITH CARON
(standard-display-ascii ?\x8B (ucs-to-char #x2039)) ;; SINGLE LEFT-POINTING ANGLE QUOTATION MARK
(standard-display-ascii ?\x8C (ucs-to-char #x0152)) ;; LATIN CAPITAL LIGATURE OE
(standard-display-ascii ?\x8E (ucs-to-char #x017D)) ;; LATIN CAPITAL LETTER Z WITH CARON
(standard-display-ascii ?\x91 (ucs-to-char #x2018)) ;; LEFT SINGLE QUOTATION MARK
(standard-display-ascii ?\x92 (ucs-to-char #x2019)) ;; RIGHT SINGLE QUOTATION MARK
(standard-display-ascii ?\x93 (ucs-to-char #x201C)) ;; LEFT DOUBLE QUOTATION MARK
(standard-display-ascii ?\x94 (ucs-to-char #x201D)) ;; RIGHT DOUBLE QUOTATION MARK
(standard-display-ascii ?\x95 (ucs-to-char #x2022)) ;; BULLET
(standard-display-ascii ?\x96 (ucs-to-char #x2013)) ;; EN DASH
(standard-display-ascii ?\x97 (ucs-to-char #x2014)) ;; EM DASH
(standard-display-ascii ?\x98 (ucs-to-char #x02DC)) ;; SMALL TILDE
(standard-display-ascii ?\x99 (ucs-to-char #x2122)) ;; TRADE MARK SIGN
(standard-display-ascii ?\x9A (ucs-to-char #x0161)) ;; LATIN SMALL LETTER S WITH CARON
(standard-display-ascii ?\x9B (ucs-to-char #x203A)) ;; SINGLE RIGHT-POINTING ANGLE QUOTATION MARK
(standard-display-ascii ?\x9C (ucs-to-char #x0153)) ;; LATIN SMALL LIGATURE OE
(standard-display-ascii ?\x9E (ucs-to-char #x017E)) ;; LATIN SMALL LETTER Z WITH CARON
(standard-display-ascii ?\x9F (ucs-to-char #x0178)) ;; LATIN CAPITAL LETTER Y WITH DIAERESIS
)
The ucs-to-char call succeeds for many of those (#x2018, #x2014 etc.), but the
corresponding characters still display as control characters to me.
Ah, but this is probably because I'm in a TTY, my terminal-coding-system has
been set to 'utf-8, and the conversion to UTF-8 means modifications to the
display-table are something I never see. That makes sense.
> If I replace (ucs-to-char #x20AC) with a literal Euro sign (AltGr+4 on
> XFree86 4.3.0 with a UK keyboard), it works, i.e.:
>
> (standard-display-ascii ?\x80 ?\<euro-sign>)
>
> [where <euro-sign> is a literal Euro sign; VM doesn't seem to handle
> it correctly.]
It wouldn't. I've got a patch to almost make it do so, but the news server I
posted it to (as well as to bug-vm) evidently didn't pass it to the wider
world, so I can't give you a URI to it now.
Eh, I can, though. http://parhasard.net/vm-mime-western-europe.diff . This
fails for the Euro sign on 21.4, though, because of the issue in the next
paragraph.
> I'm not sure why ucs-to-char #x20AC fails; mule-ucs has the
> appropriate entry in reldata/uiso8859-15.el.
Within Mule-UCS, unicode-basic-translation-charset-order-list is
(ascii latin-iso8859-1 latin-iso8859-2 latin-iso8859-3 latin-iso8859-4
cyrillic-iso8859-5 greek-iso8859-7 hebrew-iso8859-8 latin-iso8859-9
latin-iso8859-14 latin-iso8859-15 ipa japanese-jisx0208 japanese-jisx0212
chinese-gb2312 chinese-cns11643-1 chinese-cns11643-2 chinese-cns11643-3
chinese-cns11643-4 chinese-cns11643-5 chinese-cns11643-6 chinese-cns11643-7
chinese-big5-1 chinese-big5-2 korean-ksc5601 latin-jisx0201
katakana-jisx0201 thai-tis620 ethiopic vietnamese-viscii-lower
vietnamese-viscii-upper)
(encode-coding-string (make-char 'hebrew-iso8859-8 #xA1) 'utf-8) fails, as does
trying to encode from a any charset after hebrew in that list. I haven't
figured out why, yet.
> For the reverse, e.g.:
>
> (standard-display-ascii <euro-sign> "EUR")
>
> you first have to enlarge the display table (make-display-table
> creates a 256-element vector), e.g.
>
> (add-spec-to-specifier current-display-table
> (make-vector 10000 nil)
> 'global
> nil
> 'remove-locale)
>
> Then, the above standard-display-ascii call works.
Okay. So on mule, with the extension of the display table as above, and without
a terminal-coding-system in force,
(standard-display-ascii (ucs-to-char #x2014) "--")
will display em dash as two minuses?
(The display-table may have to be enlarged a little further; (char-int
(ucs-to-char #x2014))gives me 102583.)
--
Like the early Christians, Marx expected the millennium very soon; like
their successors, his have been disappointed--once more, the world has shown
itself recalcitrant to a tidy formula embodying the hopes of some section of
mankind. (Russell)
More information about the XEmacs-Beta
mailing list