lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Benson Margulies <bimargul...@gmail.com>
Subject Re: Tracking down the input that hits an analysis chain bug
Date Sun, 05 Jan 2014 02:24:10 GMT
I rather assumed that there was some log4j-ish config to be set that
would do this for me. Lacking one, I guess I'll end up there.

On Fri, Jan 3, 2014 at 8:23 PM, Michael Sokolov
<msokolov@safaribooksonline.com> wrote:
> Have you considered using a custom UpdateProcessor to catch the exception
> and provide more context in the logs?
>
> -Mike
>
>
> On 01/03/2014 03:33 PM, Benson Margulies wrote:
>>
>> Robert,
>>
>> Yes, if the problem was not data-dependent, indeed I wouldn't need to
>> index anything. However, I've run a small mountain of data through our
>> tokenizer on my machine, and never seen the error, but my customer
>> gets these errors in the middle of a giant spew of data. As it
>> happens, I _was_ missing that call to clearAttributes(), (and the
>> usual implementation of end()), but I found and fixed that problem
>> precisely by creating a random data test case using checkRandomData().
>> Unfortunately, fixing that didn't make the customer's errors go away.
>>
>> So I'm left needing to help them identify the data that provokes this,
>> because I've so far failed to come up with any.
>>
>> --benson
>>
>>
>> On Fri, Jan 3, 2014 at 2:16 PM, Robert Muir <rcmuir@gmail.com> wrote:
>>>
>>> This exception comes from OffsetAttributeImpl (e.g. you dont need to
>>> index anything to reproduce it).
>>>
>>> Maybe you have a missing clearAttributes() call (your tokenizer
>>> 'returns true' without calling that first)? This could explain it, if
>>> something like a StopFilter is also present in the chain: basically
>>> the offsets overflow.
>>>
>>> the test stuff in BaseTokenStreamTestCase should be able to detect
>>> this as well...
>>>
>>> On Fri, Jan 3, 2014 at 1:56 PM, Benson Margulies <benson@basistech.com>
>>> wrote:
>>>>
>>>> Using Solr Cloud with 4.3.1.
>>>>
>>>> We've got a problem with a tokenizer that manifests as calling
>>>> OffsetAtt.setOffsets() with invalid inputs. OK, so, we want to figure
>>>> out
>>>> what input provokes our code into getting into this pickle.
>>>>
>>>> The problem happens on SolrCloud nodes.
>>>>
>>>> The problem manifests as this sort of thing:
>>>>
>>>> Jan 3, 2014 6:05:33 PM org.apache.solr.common.SolrException log
>>>> SEVERE: java.lang.IllegalArgumentException: startOffset must be
>>>> non-negative, and endOffset must be >= startOffset,
>>>> startOffset=-1811581632,endOffset=-1811581632
>>>>
>>>> How could we get a document ID so that we can tell which document was
>>>> being
>>>> processed?
>
>

Mime
View raw message