lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrzej Białecki ...@getopt.org>
Subject SOLR-12259: online index schema modification - adding docValues to existing indexed data?
Date Tue, 18 Dec 2018 12:45:57 GMT
Hi,

I'm working on a use case where an existing Solr setup needs to migrate to a schema that uses
docValues for faceting, instead of uninversion. This case fits into a broader subject of SOLR-12259
(Robustly upgrade indexes). However, in this case there are two major requirements for this
migration process:

* data cannot be reindexed from scratch - I need to work with the already indexed documents
(which do contain the values needed for faceting, but these values are simply indexed and
not stored as doc values)

* indexing can’t be stopped while the schema is being changed (the conversion process needs
to work on-the-fly while the collection is online, both for searching and for updates). Collection
reloads / reopening is ok but it’s not ok to take the collection offline for several minutes
(or hours).

Together with Erick Erickson we implemented a solution that uses MergePolicy (actually MergePolicyFactory
in Solr) to enforce re-writing of segments that no longer match the schema, ie. don’t contain
docValues in a field where the new schema requires it. This merge policy determines what segments
need this conversion and then forces the “merging” (actually re-writing) of these segments
by first wrapping them into UninvertingReader to supply docValues where they are required
by the new schema but actually are missing in the segment’s data. This “AddDocValuesMergePolicy”
(ADVMP for short) is supposed to deal with the following types of segments:

* old segments created before the schema change - these don’t contain any docValues in the
target fields and so they are wrapped in UninvertingReader for merging (and for searching)
according to the new schema.

* new segments created after the schema change - if FieldInfo-s for these fields claim that
the segment already contains docValues where it should then the segment is passed as-is to
merging, otherwise it’s wrapped again. An optimisation was also put here to “mark” the
already converted segments using a marker in SegmentInfo diagnostics map so that we can avoid
re-checking and re-converting already converted data.

So, long story short, this process works very well when there’s no concurrent indexing activity
- all old segments are properly wrapped and re-written and merging with new segments works
as intended. However, in a situation with concurrent indexing it works well but only for a
short while. At some point this conversion process seems to lose large percentage of the docValues,
even though it seems that at all points the source segments are properly wrapped - the ADVMP
merge policy adds a lot of debugging information to track the source and type of segments
across many levels of merging and whether they were wrapped or not.

My working theory is that somehow this schema change produces “franken-segments” (while
they still haven’t been flushed) where only some of the most recent docs have the docValues
and earlier ones don’t. As I understand it, this should not happen in Solr because a schema
change results in a core reload. The tracking information from ADVMP  seems to indicate that
all generations of segments, both those that were flushed and merged earlier, have been properly
wrapped.

My alternate theory is that there’s some bug in the doc values merging process when UninvertingReader
is involved, because this problem occurs also when we modify ADVMP to always force the wrapping
of all segments in UninvertingReader-s. The percentage of lost doc values is sometimes quite
large, up to 50%, perhaps it’s a bug somewhere where the code accounts for the presence
of doc values in FieldCacheImpl?

Together with Erick we implemented a bunch of tests that illustrate this issue - both the
tests and the code can be found on branch "jira/solr-12259":

* code.tests.AddDVMPLuceneTest2 - this is a pure Lucene test that shows how doc values are
lost after several rounds of merging while concurrent indexing is going on. This failure is
reproducible 100%.

* code.tests.AddDvStress - this is a Solr test that repeatedly creates a collection without
doc values, starts the indexing, changes the config to use ADVMP, makes the schema change
to turn doc values on, and verifies the number of facets on the target field. This test also
fails after a while with the same symptoms as the Lucene one, so I think that solving the
Lucene test failure should solve this failure too.

Any suggestions or insights are very much appreciated - I'm running out of ideas to try...

—

Andrzej Białecki


Mime
View raw message