lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Stéphane Delprat" <stephane.delp...@blogspirit.com>
Subject Re: segment gets corrupted (after background merge ?)
Date Tue, 18 Jan 2011 14:57:47 GMT
I ran other tests : when I execute the checkIndex on the master I got 
random errors, but when I scp the file on another server (same software 
exactly) no error occurs...

We will start using another server.


Just one question concerning checkIndex :

What does "tokens" mean ?
How is it possible that the number of tokens change while the files were 
not modified at all ? (this is from the faulty server, on the other 
server the tokens do not change at all)
(solr was stopped during the whole checkIndex process)


#diff 20110118_141257_checkIndex.log 20110118_142356_checkIndex.log
15c15
<     test: terms, freq, prox...OK [5211271 terms; 39824029 terms/docs 
pairs; 58236510 tokens]
---
 >     test: terms, freq, prox...OK [5211271 terms; 39824029 terms/docs 
pairs; 58236582 tokens]
43c43
<     test: terms, freq, prox...OK [3947589 terms; 34468256 terms/docs 
pairs; 36740496 tokens]
---
 >     test: terms, freq, prox...OK [3947589 terms; 34468256 terms/docs 
pairs; 36740533 tokens]
85c85
<     test: terms, freq, prox...OK [2600874 terms; 21272098 terms/docs 
pairs; 10862212 tokens]
---
 >     test: terms, freq, prox...OK [2600874 terms; 21272098 terms/docs 
pairs; 10862221 tokens]


Thanks,


Le 14/01/2011 12:59, Michael McCandless a écrit :
> Right, but removing a segment out from under a live IW (when you run
> CheckIndex with -fix) is deadly, because that other IW doesn't know
> you've removed the segment, and will later commit a new segment infos
> still referencing that segment.
>
> The nature of this particular exception from CheckIndex is very
> strange... I think it can only be a bug in Lucene, a bug in the JRE or
> a hardware issue (bits are flipping somewhere).
>
> I don't think an error in the IO system can cause this particular
> exception (it would cause others), because the deleted docs are loaded
> up front when SegmentReader is init'd...
>
> This is why I'd really like to see if a given corrupt index always
> hits precisely the same exception if you run CheckIndex more than
> once.
>
> Mike
>
> On Thu, Jan 13, 2011 at 10:56 PM, Lance Norskog<goksron@gmail.com>  wrote:
>> 1) CheckIndex is not supposed to change a corrupt segment, only remove it.
>> 2) Are you using local hard disks, or do run on a common SAN or remote
>> file server? I have seen corruption errors on SANs, where existing
>> files have random changes.
>>
>> On Thu, Jan 13, 2011 at 11:06 AM, Michael McCandless
>> <lucene@mikemccandless.com>  wrote:
>>> Generally it's not safe to run CheckIndex if a writer is also open on the index.
>>>
>>> It's not safe because CheckIndex could hit FNFE's on opening files,
>>> or, if you use -fix, CheckIndex will change the index out from under
>>> your other IndexWriter (which will then cause other kinds of
>>> corruption).
>>>
>>> That said, I don't think the corruption that CheckIndex is detecting
>>> in your index would be caused by having a writer open on the index.
>>> Your first CheckIndex has a different deletes file (_phe_p3.del, with
>>> 44824 deleted docs) than the 2nd time you ran it (_phe_p4.del, with
>>> 44828 deleted docs), so it must somehow have to do with that change.
>>>
>>> One question: if you have a corrupt index, and run CheckIndex on it
>>> several times in a row, does it always fail in the same way?  (Ie the
>>> same term hits the below exception).
>>>
>>> Is there any way I could get a copy of one of your corrupt cases?  I
>>> can then dig...
>>>
>>> Mike
>>>
>>> On Thu, Jan 13, 2011 at 10:52 AM, Stéphane Delprat
>>> <stephane.delprat@blogspirit.com>  wrote:
>>>> I understand less and less what is happening to my solr.
>>>>
>>>> I did a checkIndex (without -fix) and there was an error...
>>>>
>>>> So a did another checkIndex with -fix and then the error was gone. The
>>>> segment was alright
>>>>
>>>>
>>>> During checkIndex I do not shut down the solr server, I just make sure no
>>>> client connect to the server.
>>>>
>>>> Should I shut down the solr server during checkIndex ?
>>>>
>>>>
>>>>
>>>> first checkIndex :
>>>>
>>>>   4 of 17: name=_phe docCount=264148
>>>>     compound=false
>>>>     hasProx=true
>>>>     numFiles=9
>>>>     size (MB)=928.977
>>>>     diagnostics = {optimize=false, mergeFactor=10, os.version=2.6.26-2-amd64,
>>>> os=Linux, mergeDocStores=true, lucene.version=2.9.3 951790 - 2010-06-06
>>>> 01:30:55, source=merge, os.arch=amd64, java.version=1.6.0_20,
>>>> java.vendor=Sun Microsystems Inc.}
>>>>     has deletions [delFileName=_phe_p3.del]
>>>>     test: open reader.........OK [44824 deleted docs]
>>>>     test: fields..............OK [51 fields]
>>>>     test: field norms.........OK [51 fields]
>>>>     test: terms, freq, prox...ERROR [term post_id:562 docFreq=1 != num docs
>>>> seen 0 + num docs deleted 0]
>>>> java.lang.RuntimeException: term post_id:562 docFreq=1 != num docs seen 0
+
>>>> num docs deleted 0
>>>>         at
>>>> org.apache.lucene.index.CheckIndex.testTermIndex(CheckIndex.java:675)
>>>>         at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:530)
>>>>         at org.apache.lucene.index.CheckIndex.main(CheckIndex.java:903)
>>>>     test: stored fields.......OK [7206878 total field count; avg 32.86 fields
>>>> per doc]
>>>>     test: term vectors........OK [0 total vector count; avg 0 term/freq
>>>> vector fields per doc]
>>>> FAILED
>>>>     WARNING: fixIndex() would remove reference to this segment; full
>>>> exception:
>>>> java.lang.RuntimeException: Term Index test failed
>>>>         at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:543)
>>>>         at org.apache.lucene.index.CheckIndex.main(CheckIndex.java:903)
>>>>
>>>>
>>>> a few minutes latter :
>>>>
>>>>   4 of 18: name=_phe docCount=264148
>>>>     compound=false
>>>>     hasProx=true
>>>>     numFiles=9
>>>>     size (MB)=928.977
>>>>     diagnostics = {optimize=false, mergeFactor=10, os.version=2.6.26-2-amd64,
>>>> os=Linux, mergeDocStores=true, lucene.version=2.9.3 951790 - 2010-06-06
>>>> 01:30:55, source=merge, os.arch=amd64, java.version=1.6.0
>>>> _20, java.vendor=Sun Microsystems Inc.}
>>>>     has deletions [delFileName=_phe_p4.del]
>>>>     test: open reader.........OK [44828 deleted docs]
>>>>     test: fields..............OK [51 fields]
>>>>     test: field norms.........OK [51 fields]
>>>>     test: terms, freq, prox...OK [3200899 terms; 26804334 terms/docs pairs;
>>>> 28919124 tokens]
>>>>     test: stored fields.......OK [7206764 total field count; avg 32.86 fields
>>>> per doc]
>>>>     test: term vectors........OK [0 total vector count; avg 0 term/freq
>>>> vector fields per doc]
>>>>
>>>>
>>>> Le 12/01/2011 16:50, Michael McCandless a écrit :
>>>>>
>>>>> Curious... is it always a docFreq=1 != num docs seen 0 + num docs deleted
>>>>> 0?
>>>>>
>>>>> It looks like new deletions were flushed against the segment (del file
>>>>> changed from _ncc_22s.del to _ncc_24f.del).
>>>>>
>>>>> Are you hitting any exceptions during indexing?
>>>>>
>>>>> Mike
>>>>>
>>>>> On Wed, Jan 12, 2011 at 10:33 AM, Stéphane Delprat
>>>>> <stephane.delprat@blogspirit.com>    wrote:
>>>>>>
>>>>>> I got another corruption.
>>>>>>
>>>>>> It sure looks like it's the same type of error. (on a different field)
>>>>>>
>>>>>> It's also not linked to a merge, since the segment size did not change.
>>>>>>
>>>>>>
>>>>>> *** good segment :
>>>>>>
>>>>>>   1 of 9: name=_ncc docCount=1841685
>>>>>>     compound=false
>>>>>>     hasProx=true
>>>>>>     numFiles=9
>>>>>>     size (MB)=6,683.447
>>>>>>     diagnostics = {optimize=false, mergeFactor=10,
>>>>>> os.version=2.6.26-2-amd64,
>>>>>> os=Linux, mergeDocStores=true, lucene.version=2.9.3 951790 - 2010-06-06
>>>>>> 01:30:55, source=merge, os.arch=amd64, java.version=1.6.0
>>>>>> _20, java.vendor=Sun Microsystems Inc.}
>>>>>>     has deletions [delFileName=_ncc_22s.del]
>>>>>>     test: open reader.........OK [275881 deleted docs]
>>>>>>     test: fields..............OK [51 fields]
>>>>>>     test: field norms.........OK [51 fields]
>>>>>>     test: terms, freq, prox...OK [17952652 terms; 174113812 terms/docs
>>>>>> pairs;
>>>>>> 204561440 tokens]
>>>>>>     test: stored fields.......OK [45511958 total field count; avg
29.066
>>>>>> fields per doc]
>>>>>>     test: term vectors........OK [0 total vector count; avg 0 term/freq
>>>>>> vector fields per doc]
>>>>>>
>>>>>>
>>>>>> a few hours latter :
>>>>>>
>>>>>> *** broken segment :
>>>>>>
>>>>>>   1 of 17: name=_ncc docCount=1841685
>>>>>>     compound=false
>>>>>>     hasProx=true
>>>>>>     numFiles=9
>>>>>>     size (MB)=6,683.447
>>>>>>     diagnostics = {optimize=false, mergeFactor=10,
>>>>>> os.version=2.6.26-2-amd64,
>>>>>> os=Linux, mergeDocStores=true, lucene.version=2.9.3 951790 - 2010-06-06
>>>>>> 01:30:55, source=merge, os.arch=amd64, java.version=1.6.0
>>>>>> _20, java.vendor=Sun Microsystems Inc.}
>>>>>>     has deletions [delFileName=_ncc_24f.del]
>>>>>>     test: open reader.........OK [278167 deleted docs]
>>>>>>     test: fields..............OK [51 fields]
>>>>>>     test: field norms.........OK [51 fields]
>>>>>>     test: terms, freq, prox...ERROR [term post_id:1599104 docFreq=1
!= num
>>>>>> docs seen 0 + num docs deleted 0]
>>>>>> java.lang.RuntimeException: term post_id:1599104 docFreq=1 != num
docs
>>>>>> seen
>>>>>> 0 + num docs deleted 0
>>>>>>         at
>>>>>> org.apache.lucene.index.CheckIndex.testTermIndex(CheckIndex.java:675)
>>>>>>         at
>>>>>> org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:530)
>>>>>>         at org.apache.lucene.index.CheckIndex.main(CheckIndex.java:903)
>>>>>>     test: stored fields.......OK [45429565 total field count; avg
29.056
>>>>>> fields per doc]
>>>>>>     test: term vectors........OK [0 total vector count; avg 0 term/freq
>>>>>> vector fields per doc]
>>>>>> FAILED
>>>>>>     WARNING: fixIndex() would remove reference to this segment; full
>>>>>> exception:
>>>>>> java.lang.RuntimeException: Term Index test failed
>>>>>>         at
>>>>>> org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:543)
>>>>>>         at org.apache.lucene.index.CheckIndex.main(CheckIndex.java:903)
>>>>>>
>>>>>>
>>>>>> I'll activate infoStream for next time.
>>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>>
>>>>>> Le 12/01/2011 00:49, Michael McCandless a écrit :
>>>>>>>
>>>>>>> When you hit corruption is it always this same problem?:
>>>>>>>
>>>>>>>    java.lang.RuntimeException: term source:margolisphil docFreq=1
!=
>>>>>>> num docs seen 0 + num docs deleted 0
>>>>>>>
>>>>>>> Can you run with Lucene's IndexWriter infoStream turned on, and
catch
>>>>>>> the output leading to the corruption?  If something is somehow
messing
>>>>>>> up the bits in the deletes file that could cause this.
>>>>>>>
>>>>>>> Mike
>>>>>>>
>>>>>>> On Mon, Jan 10, 2011 at 5:52 AM, Stéphane Delprat
>>>>>>> <stephane.delprat@blogspirit.com>      wrote:
>>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> We are using :
>>>>>>>> Solr Specification Version: 1.4.1
>>>>>>>> Solr Implementation Version: 1.4.1 955763M - mark - 2010-06-17
18:06:42
>>>>>>>> Lucene Specification Version: 2.9.3
>>>>>>>> Lucene Implementation Version: 2.9.3 951790 - 2010-06-06
01:30:55
>>>>>>>>
>>>>>>>> # java -version
>>>>>>>> java version "1.6.0_20"
>>>>>>>> Java(TM) SE Runtime Environment (build 1.6.0_20-b02)
>>>>>>>> Java HotSpot(TM) 64-Bit Server VM (build 16.3-b01, mixed
mode)
>>>>>>>>
>>>>>>>> We want to index 4M docs in one core (and when it works fine
we will
>>>>>>>> add
>>>>>>>> other cores with 2M on the same server) (1 doc ~= 1kB)
>>>>>>>>
>>>>>>>> We use SOLR replication every 5 minutes to update the slave
server
>>>>>>>> (queries
>>>>>>>> are executed on the slave only)
>>>>>>>>
>>>>>>>> Documents are changing very quickly, during a normal day
we will have
>>>>>>>> approx
>>>>>>>> :
>>>>>>>> * 200 000 updated docs
>>>>>>>> * 1000 new docs
>>>>>>>> * 200 deleted docs
>>>>>>>>
>>>>>>>>
>>>>>>>> I attached the last good checkIndex : solr20110107.txt
>>>>>>>> And the corrupted one : solr20110110.txt
>>>>>>>>
>>>>>>>>
>>>>>>>> This is not the first time a segment gets corrupted on this
server,
>>>>>>>> that's
>>>>>>>> why I ran frequent "checkIndex". (but as you can see the
first segment
>>>>>>>> is
>>>>>>>> 1.800.000 docs and it works fine!)
>>>>>>>>
>>>>>>>>
>>>>>>>> I can't find any "SEVER" "FATAL" or "exception" in the Solr
logs.
>>>>>>>>
>>>>>>>>
>>>>>>>> I also attached my schema.xml and solrconfig.xml
>>>>>>>>
>>>>>>>>
>>>>>>>> Is there something wrong with what we are doing ? Do you
need other
>>>>>>>> info
>>>>>>>> ?
>>>>>>>>
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>>
>>
>> --
>> Lance Norskog
>> goksron@gmail.com
>>
>

Mime
View raw message