lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Trejkaz <trej...@trypticon.org>
Subject Re: Dubious error message?
Date Fri, 05 Aug 2016 06:13:21 GMT
On Fri, Aug 5, 2016 at 2:51 PM, Erick Erickson <erickerickson@gmail.com> wrote:
> Question 2: Not that I know of
>
> Question 2.1. It's actually pretty difficult to understand why a single _term_
> can be over 32K and still make sense. This is not to say that a
> single _text_ field can't be over 32K, each term within that field
> is (usually) much less than that.
>
> Do you have a real-world use-case where you have a 115K term
> that can _only_ be matched by searching for exactly that
> sequence of 115K characters? Not substrings. Not wildcards. A
> "string" type (as opposed to anything based on solr.Textfield).

This particular field is used to store unique addresses, and for
precision reasons we wanted to search for addresses without tokenising
them, as if you tokenised them, bob@example.com could accidentally
match bob@example.com.au, even though they're two different people. It
also makes statistics faster to calculate.

Now, addresses in SMTP email are fairly short, limited to something
like 254 characters, but sometimes you get data that violates the
standard, and we store more than just that one kind of address, and
maybe one of the other sorts can be longer.

In this situation, it isn't clear whether you can truncate the data,
because if you truncate it, now two addresses are considered equal
when they're not the same string. But then again, if the old version
of Lucene was already truncating it, people might be fine with it
being truncated in the new version. But if they didn't know that,
there would definitely be someone who objects.

So I'm not really saying that the term "makes sense" - I'm just saying
we encountered it in real-world data, and an error occurred. Someone
then complained about the error.

> As far as the error message is concerned, that does seem somewhat opaque.
> Care to raise a JIRA on it (and, if you're really ambitious attach a patch)?

I'll see. :)

TX

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message