unicode internal

Hrvoje Niksic hniksic at xemacs.org
Wed Oct 5 06:23:33 EDT 2005


Ben Wing <ben at 666.com> writes:

> -- [1] at some point, use extent properties to track the language of a
> text.  this is well-recognized.

I'm a bit hazy on the concept of tracking "language".  How is that
supposed to work, exactly?  I mean, a word processor can do it because
it has a chance to save its markup when saving the document.  Emacs
works, in most cases, with bare characters, or with charset (not
language) annotations, as is the case with coding cookies or with Gnus
processing MIME messages.

> -- [4] the perl regexp \p syntax should be adopted for referencing
> charsets. (char categories just suck.) for that matter, we should move
> in the direction of being as perl-compatible as possible with our
> regexps, since that is where the world is going. (cf java, python,
> ruby, c#, ...)

It's true that the world is moving to Perl-compatible regexps.  Note,
however, that everyone chooses a subset they like -- implementing the
whole thing is next to impossible.  Also note that Perl itself is
moving *away* from Perl regexps: see Apocalypse 5.

> the big problem here is \( and (, which are backwards.  the only
> reasonable solutions i can see are [a] a global variable to control
> which kinds of regexps are used; [b] a double set of all functions
> that take regexps.  comments?

The problem with [a] is that library functions can and do use regexps,
and setting the variable to something they don't expect will break
them.  This is already the case with case-fold-search, but that one is
well-known to library authors.  Introducing a new one would break huge
amounts of code.

I agree with Stephen that The Right Thing would be to expose "compiled
regexps" to Lisp.  Python's "re" module provides an example of how
this can be done.




More information about the XEmacs-Beta mailing list