[COMMIT] Generally make the language environments and coding
systems a little more sane.
Aidan Kehoe
kehoea at parhasard.net
Tue Aug 28 04:51:08 EDT 2007
Ar an cúigiú lá is fiche de mí Lúnasa, scríobh Aidan Kehoe:
> [...]
>
> + ;; Make them available to user code.
> + (defvar unicode-error-sequence-zero
> + (aref (decode-coding-string "\xd8\x00\x01\x00" 'utf-16-be) 3)
> + "The XEmacs character representing an invalid zero octet in Unicode.
> +
> +Subtract this character from each XEmacs character in an invalid sequence to
> +get the octet on disk. E.g.
> +
> +\(- (aref (decode-coding-string ?\\x80 'utf-8) 0)
> + unicode-error-characters-zero)
> +=> ?\\x80
> +
> +You can search for invalid sequences using
> +`unicode-error-sequence-regexp-range', which see. ")
> [...]
Note to everyone; this doesn’t work, since the integer values of the
relevant characters are not numerically contiguous. For example, for me on
this build,
(decode-coding-string "\xD0" 'utf-8)
gives (U+2000D0 jit-ucs-charset-0 35 32), numerical value 1069472, while
(decode-coding-string "\xCF" 'utf-8)
gives (U+2000CF jit-ucs-charset-0 34 127), numerical value 1069439, 33
numeric values apart instead of one.
The approach I intend to take to fix this is to create a char table mapping
from the error octets to characters with the on-disk values, and advise
using this char table with translate-region to get the octets.
--
On the quay of the little Black Sea port, where the rescued pair came once
more into contact with civilization, Dobrinton was bitten by a dog which was
assumed to be mad, though it may only have been indiscriminating. (Saki)
More information about the XEmacs-Patches
mailing list