goofy charset selection for Unicode pastes

Aidan Kehoe kehoea at parhasard.net
Mon Aug 23 13:27:15 EDT 2004


[This lacks a bit of context at present; I imagine Glynn's original will get
de-spam-trapped presently.]

 Ar an tríú lá is fiche de mí Lúnasa, scríobh Glynn Clements: 

 > The problems with that are:
 > 
 > 1. It requires Mule.

[...]

Oh, and it's a stupid hack, too :-). That's a problem. 

Personally, I'm not interested in non-Mule any more. I need to use Unicode,
Mule-UCS, while broken, gives me better Unicode than non-Mule, I'm not using
non-Mule, end of story. Maybe that disqualifies me from talking about non-Mule.

 > 2. While it would display correctly, you would still end up with a
 > Chinese character in the buffer, which will do the wrong thing if you
 > save the buffer, copy the data from the buffer, send it to a process
 > or network stream etc.

Eh. You'd end up with an em dash in the buffer. Which is only a Chinese
character if that's what the coding system you're using thinks it is. If you're
using UTF-8, for example, it's western. 

 > > (I'm trying to get Windows-1252 to look readable with 
 > > 
 > >  (progn 
 > >   (standard-display-ascii ?\x80	(ucs-to-char #x20AC))	;; EURO SIGN
 > >   (standard-display-ascii ?\x82	(ucs-to-char #x201A))	;; SINGLE LOW-9 QUOTATION MARK
 > >   ...) 
 > > 
 > > but it's not working for me. Is the stuff in disp-table.el non-Mule only?)
 > 
 > The problem here is that both of the ucs-to-char calls return nil. 

That was _part_ of the problem; my ellipsis above was intended to mean that
things had been elided. Here's the full progn;

(progn
  (standard-display-ascii ?\x80	(ucs-to-char #x20AC))	;; EURO SIGN
  (standard-display-ascii ?\x82	(ucs-to-char #x201A))	;; SINGLE LOW-9 QUOTATION MARK
  (standard-display-ascii ?\x83	(ucs-to-char #x0192))	;; LATIN SMALL LETTER F WITH HOOK
  (standard-display-ascii ?\x84	(ucs-to-char #x201E))	;; DOUBLE LOW-9 QUOTATION MARK
  (standard-display-ascii ?\x85	(ucs-to-char #x2026))	;; HORIZONTAL ELLIPSIS
  (standard-display-ascii ?\x86	(ucs-to-char #x2020))	;; DAGGER
  (standard-display-ascii ?\x87	(ucs-to-char #x2021))	;; DOUBLE DAGGER
  (standard-display-ascii ?\x88	(ucs-to-char #x02C6))	;; MODIFIER LETTER CIRCUMFLEX ACCENT
  (standard-display-ascii ?\x89	(ucs-to-char #x2030))	;; PER MILLE SIGN
  (standard-display-ascii ?\x8A	(ucs-to-char #x0160))	;; LATIN CAPITAL LETTER S WITH CARON
  (standard-display-ascii ?\x8B	(ucs-to-char #x2039))	;; SINGLE LEFT-POINTING ANGLE QUOTATION MARK
  (standard-display-ascii ?\x8C	(ucs-to-char #x0152))	;; LATIN CAPITAL LIGATURE OE
  (standard-display-ascii ?\x8E	(ucs-to-char #x017D))	;; LATIN CAPITAL LETTER Z WITH CARON
  (standard-display-ascii ?\x91	(ucs-to-char #x2018))	;; LEFT SINGLE QUOTATION MARK
  (standard-display-ascii ?\x92	(ucs-to-char #x2019))	;; RIGHT SINGLE QUOTATION MARK
  (standard-display-ascii ?\x93	(ucs-to-char #x201C))	;; LEFT DOUBLE QUOTATION MARK
  (standard-display-ascii ?\x94	(ucs-to-char #x201D))	;; RIGHT DOUBLE QUOTATION MARK
  (standard-display-ascii ?\x95	(ucs-to-char #x2022))	;; BULLET
  (standard-display-ascii ?\x96	(ucs-to-char #x2013))	;; EN DASH
  (standard-display-ascii ?\x97	(ucs-to-char #x2014))	;; EM DASH
  (standard-display-ascii ?\x98	(ucs-to-char #x02DC))	;; SMALL TILDE
  (standard-display-ascii ?\x99	(ucs-to-char #x2122))	;; TRADE MARK SIGN
  (standard-display-ascii ?\x9A	(ucs-to-char #x0161))	;; LATIN SMALL LETTER S WITH CARON
  (standard-display-ascii ?\x9B	(ucs-to-char #x203A))	;; SINGLE RIGHT-POINTING ANGLE QUOTATION MARK
  (standard-display-ascii ?\x9C	(ucs-to-char #x0153))	;; LATIN SMALL LIGATURE OE
  (standard-display-ascii ?\x9E	(ucs-to-char #x017E))	;; LATIN SMALL LETTER Z WITH CARON
  (standard-display-ascii ?\x9F	(ucs-to-char #x0178))	;; LATIN CAPITAL LETTER Y WITH DIAERESIS
)

The ucs-to-char call succeeds for many of those (#x2018, #x2014 etc.), but the
corresponding characters still display as control characters to me. 

Ah, but this is probably because I'm in a TTY, my terminal-coding-system has
been set to 'utf-8, and the conversion to UTF-8 means modifications to the
display-table are something I never see. That makes sense. 

 > If I replace (ucs-to-char #x20AC) with a literal Euro sign (AltGr+4 on
 > XFree86 4.3.0 with a UK keyboard), it works, i.e.:
 >
 > 	(standard-display-ascii ?\x80 ?\<euro-sign>)
 > 
 > [where <euro-sign> is a literal Euro sign; VM doesn't seem to handle
 > it correctly.]

It wouldn't. I've got a patch to almost make it do so, but the news server I
posted it to (as well as to bug-vm) evidently didn't pass it to the wider
world, so I can't give you a URI to it now. 

Eh, I can, though. http://parhasard.net/vm-mime-western-europe.diff . This
fails for the Euro sign on 21.4, though, because of the issue in the next
paragraph.

 > I'm not sure why ucs-to-char #x20AC fails; mule-ucs has the
 > appropriate entry in reldata/uiso8859-15.el. 

Within Mule-UCS, unicode-basic-translation-charset-order-list is

(ascii latin-iso8859-1 latin-iso8859-2 latin-iso8859-3 latin-iso8859-4
cyrillic-iso8859-5 greek-iso8859-7 hebrew-iso8859-8 latin-iso8859-9
latin-iso8859-14 latin-iso8859-15 ipa japanese-jisx0208 japanese-jisx0212
chinese-gb2312 chinese-cns11643-1 chinese-cns11643-2 chinese-cns11643-3
chinese-cns11643-4 chinese-cns11643-5 chinese-cns11643-6 chinese-cns11643-7
chinese-big5-1 chinese-big5-2 korean-ksc5601 latin-jisx0201
katakana-jisx0201 thai-tis620 ethiopic vietnamese-viscii-lower
vietnamese-viscii-upper)

(encode-coding-string (make-char 'hebrew-iso8859-8 #xA1) 'utf-8) fails, as does
trying to encode from a any charset after hebrew in that list. I haven't
figured out why, yet.

 > For the reverse, e.g.:
 > 
 > 	(standard-display-ascii <euro-sign> "EUR")
 > 
 > you first have to enlarge the display table (make-display-table
 > creates a 256-element vector), e.g.
 > 
 > 	(add-spec-to-specifier current-display-table
 > 			       (make-vector 10000 nil)
 > 			       'global
 > 			       nil
 > 			       'remove-locale)
 > 
 > Then, the above standard-display-ascii call works.

Okay. So on mule, with the extension of the display table as above, and without
a terminal-coding-system in force,

(standard-display-ascii (ucs-to-char #x2014) "--") 

will display em dash as two minuses? 

(The display-table may have to be enlarged a little further; (char-int
(ucs-to-char #x2014))gives me 102583.)

-- 
Like the early Christians, Marx expected the millennium very soon; like
their successors, his have been disappointed--once more, the world has shown
itself recalcitrant to a tidy formula embodying the hopes of some section of
mankind. (Russell)




More information about the XEmacs-Beta mailing list