lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Allouche <da...@allouche.net>
Subject Re: Live index upgrading
Date Fri, 21 Jun 2019 17:11:50 GMT
Unfortunately, I cannot assume SolrCloud, because our software predates Solr.

So I would either need to switch to Solr or reimplement a work-around for the lack of index
migration. I am reluctant to switch to Solr because it increases the operational complexity.

I understand the argument: if the algorithm fₙ() used to derive index data iₙ from the
raw data rₙ changes [iₙ=fₙ(rₙ)], the index data iₙ₊₁ may not be derivable from
iₙ [∃n∄g \ iₙ=g(iₙ₊₁)].

On the application level, one could store non-tokenized content (I guess that's why ElasticSearch
has .raw fields). And traverse the index. I already have index traversal code that I use for
garbage collection of old entries. Use the non-tokenized content to build a new index. So
the progress of the conversion could be recorded as the index into LeafReader.getLiveDocs().

https://lucene.apache.org/core/8_1_1/core/org/apache/lucene/index/LeafReader.html#getLiveDocs--

Alternatively, since I do not have all the non-tokenized content in the index now, I could
use the external document id to retrieve the original document text.

Is there a convenient place to store the getLiveDocs index across process interruptions? Or
should I use something stupid like a file to store the counter?

That is still a lot of hassle, but I understand how it makes sense for Lucene to consider
index migration should be handled up the stack. 


> On 21 Jun 2019, at 18:06, Erick Erickson <erickerickson@gmail.com> wrote:
> 
> Assuming SolrCloud, reindex from scratch into a new collection then use collection aliasing
when you were ready to switch. You don’t need to stop your clients when you use CREATEALIAS.
> 
> Prior to writing the marker, Lucene would appear to work with older indexes, but there
would be subtle errors because the information needed to score docs just wasn’t there.
> 
> Here are two quotes from people who know that crystalized the problem Lucene faces for
me:
> 
> From Robert Muir: 
> 
> “I think the key issue here is Lucene is an index not a database. Because it is a lossy
index and does not retain all of the user's data, its not possible to safely migrate some
things automagically. In the norms case IndexWriter needs to re-analyze the text ("re-index")
and compute stats to get back the value, so it can be re-encoded. The function is y = f(x)
and if x is not available its not possible, so lucene can't do it.”
> 
> From Mike McCandless:
> 
> “This really is the difference between an index and a database: we do not store, precisely,
the original documents.  We store an efficient derived/computed index from them.  Yes, Solr/ES
can add database-like behavior where they hold the true original source of the document and
use that to rebuild Lucene indices over time.  But Lucene really is just a "search index"
and we need to be free to make important improvements with time.”
> 
> Best,
> Erick
> 
>> On Jun 21, 2019, at 7:10 AM, David Allouche <david@allouche.net> wrote:
>> 
>> Wow. That is annoying. What is the reason for this?
>> 
>> I assumed there was a smooth upgrade path, but apparently, by design, one has to
rebuild the index at least once every two major releases.
>> 
>> So, my question becomes, what is the recommended way of dealing with reindex-from-scratch
without service interruption? 
>> 
>> So I guess the upgrade path looks something like:
>> - Create Lucene6 index
>> - Update Lucene6 index
>> - Create Lucene7 index
>> - Separately keep track of which documents are indexed in Lucene7 and Lucene6 indexes
>> - Make updates to Lucene6 index, concurrently build Lucene7 index from scratch, user
Lucene6 index for search.
>> - When Lucene7 index is fully built, remove Lucene6 index and use Lucene7 index for
search.
>> 
>> Rinse and repeat every major version.
>> 
>> Really, isn't there something simpler already to handle Lucene major version upgrades?
>> 
>> 
>>> On 17 Jun 2019, at 18:04, Erick Erickson <erickerickson@gmail.com> wrote:
>>> 
>>> Let’s back up a bit. What version of Lucene are you using? Starting with Lucene
8, any index that’s ever been touched by Lucene 6 will not open. It does not matter if the
index has been completely rewritten. It does not matter if it’s been run through IndexUpgraderTool,
which just does a forceMerge to 1 segment. A marker is preserved when a segment is created,
and the earliest one is preserved across merges. So say you have two segments, one created
with 6 and one with 7. The Lucene 6 marker is preserved when they are merged.
>>> 
>>> Now, if any segment has the Lucene 6 marker, the index will not be opened by
Lucene.
>>> 
>>> If you’re using Lucene 7, then this error implies that one or more of your
segments was created with Lucene 5 or earlier.
>>> 
>>> So you probably need to re-index from scratch on whatever version of Lucene you
want to use.
>>> 
>>> Best,
>>> Erick
>>> 
>>> 
>>> 
>>>> On Jun 17, 2019, at 8:41 AM, David Allouche <david@allouche.net> wrote:
>>>> 
>>>> Hello,
>>>> 
>>>> I use Lucene with PyLucene on a public-facing web application. We have a
moderately large index (~24M documents, ~11GB index data), with a constant stream of new documents.
>>>> 
>>>> I recently upgraded to PyLucene 7.
>>>> 
>>>> When trying to test the new release of PyLucene 8, I encountered an IndexFormatTooOld
error because my index conversion from Lucene6 to Lucene7 was not complete.
>>>> 
>>>> I found IndexUpgrader, and I had a look at its implementation. I would very
much like to avoid putting down the service during the index upgrade, so I believe I cannot
use IndexUpgrader because I need the write lock to be held by the web application to index
new documents.
>>>> 
>>>> So I figure I could get the desired result with an IndexWriter.forceMerge(1).
But the documentation says "This is a horribly costly operation, especially when you pass
a small maxNumSegments; usually you should only call this if the index is static (will no
longer be changed)." https://lucene.apache.org/core/7_7_2/core/org/apache/lucene/index/IndexWriter.html#forceMerge-int-
>>>> 
>>>> And indeed, forceMerge tends be killed the kernel OOM killer on my development
VM. I want to avoid this failure mode in production. I could increase the VM until it works,
but I would rather have a less brutal approach to upgrading a live index. Something that could
run in the background with reasonable amounts of anonymous memory.
>>>> 
>>>> What is the recommended approach to upgrading a live index?
>>>> 
>>>> How can I know from the code that the index needs upgrading at all? I could
add a manual knob to start an upgrade, but it would be better if it occurred transparently
when I upgrade PyLucene.
>>>> 
>>>> 
>>>> 
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>> 
>>> 
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>> 
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message