[COMMIT] Generally make the language environments and coding systems a little more sane.

Aidan Kehoe kehoea at parhasard.net
Tue Aug 28 04:51:08 EDT 2007


 Ar an cúigiú lá is fiche de mí Lúnasa, scríobh Aidan Kehoe: 

 > [...]
 >
 > +  ;; Make them available to user code.
 > +  (defvar unicode-error-sequence-zero
 > +    (aref (decode-coding-string "\xd8\x00\x01\x00" 'utf-16-be) 3)
 > +    "The XEmacs character representing an invalid zero octet in Unicode.
 > +
 > +Subtract this character from each XEmacs character in an invalid sequence to
 > +get the octet on disk. E.g.
 > +
 > +\(- (aref (decode-coding-string ?\\x80 'utf-8) 0)
 > +   unicode-error-characters-zero)
 > +=> ?\\x80
 > +
 > +You can search for invalid sequences using
 > +`unicode-error-sequence-regexp-range', which see.  ")
 > [...]

Note to everyone; this doesn’t work, since the integer values of the
relevant characters are not numerically contiguous. For example, for me on
this build, 

  (decode-coding-string "\xD0" 'utf-8)

gives (U+2000D0 jit-ucs-charset-0 35 32), numerical value 1069472, while 

  (decode-coding-string "\xCF" 'utf-8)

gives (U+2000CF jit-ucs-charset-0 34 127), numerical value 1069439, 33
numeric values apart instead of one. 

The approach I intend to take to fix this is to create a char table mapping
from the error octets to characters with the on-disk values, and advise
using this char table with translate-region to get the octets. 

-- 
On the quay of the little Black Sea port, where the rescued pair came once
more into contact with civilization, Dobrinton was bitten by a dog which was
assumed to be mad, though it may only have been indiscriminating. (Saki)



More information about the XEmacs-Patches mailing list