lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yonik Seeley <>
Subject Re: "Advanced" query language
Date Tue, 06 Dec 2005 16:03:45 GMT
On 12/6/05, Erik Hatcher <> wrote:
> > example:  <tag>&#0;</tag> is not valid XML
> Can you give an example of a query that needs binary information?

It's never an absolute need - one could always work around the
problem, for sure.  The issue was more a desire to be able to
represent everything that *currently* works in lucene (as far as
queries go).

- hacking the bits of numerics directly into chunks (7 or 15 bits for example)
  (I actually do this)
- representing separation of values or sentences with a null byte

Previously, all I had to watch out for was UCS-16 surrogates: as long
as I stayed below 0xD800, everything worked fine.

> Also I'd be curious to see a problem with Unicode code points in XML,
> if you have one handy.

The definition of valid XML 1.0 characters:
#x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]

The simplest example is code-point 0.  It's a valid unicode character,
but it's not a valid XML character (even when you replace it with an
Example: <tag>NullTerminated&#0;</tag>  is not valid XML

> (must register to see the full article, unfortunately)
> I'm confident that XML can accommodate our needs just fine, and any
> other text transmission would have to re-solve many issues that XML
> has already solved.

Agreed.  It wasn't a blocker, but it was something I wanted to see
tackled up front.  It means adding a little more application logic to
handle escaping/unescaping.

The bottom line is I want to be able to represent the perfectly valid
lucene query new TermQuery(new Term("field","\u0000")).


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message