lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Adrien Grand <jpou...@gmail.com>
Subject Re: [DISCUSS] Opening old indices for reading
Date Thu, 31 Jan 2019 14:23:03 GMT
This looks reasonable to me.

On Tue, Jan 29, 2019 at 4:23 PM Simon Willnauer
<simon.willnauer@gmail.com> wrote:
>
> thanks folks,
>
> these are all good points. I created a first cut of what I had in mind
> [1] . It's relatively simple and from a java visibility perspective
> the only change that a user can take advantage of is this [2] and this
> [3] respectively. This would allow opening indices back to Lucene 7.0
> given that the codecs and postings formats are available. From a
> documentation perspective I added [4]. Thisi s a pure read-only change
> and doesn't allow opening these indices for writing. You can't merge
> them neither would you be able to open an index writer on top of it. I
> still need to add support to Check-Index but that's what it is
> basically.
>
> lemme know what you think,
>
> simon
> [1] https://github.com/apache/lucene-solr/commit/0c4c885214ef30627a01e320f9c861dc2521b752
> [2] https://github.com/apache/lucene-solr/commit/0c4c885214ef30627a01e320f9c861dc2521b752#diff-e0352098b027d6f41a17c068ad8d7ef0R689
> [3] https://github.com/apache/lucene-solr/commit/0c4c885214ef30627a01e320f9c861dc2521b752#diff-e3ccf9ee90355b10f2dd22ce2da6c73cR306
> [4] https://github.com/apache/lucene-solr/commit/0c4c885214ef30627a01e320f9c861dc2521b752#diff-1bedf4d0d52ff88ef8a16a6788ad7684R86
>
> On Fri, Jan 25, 2019 at 3:14 PM Michael McCandless
> <lucene@mikemccandless.com> wrote:
> >
> > Another example is long ago Lucene allowed pos=-1 to be indexed and it caused all
sorts of problems.  We also stopped allowing positions close to Integer.MAX_VALUE (https://issues.apache.org/jira/browse/LUCENE-6382).
 Yet another is allowing negative vInts which are possible but horribly inefficient (https://issues.apache.org/jira/browse/LUCENE-3738).
> >
> > We do need to be free to fix these problems and then know after N+2 releases that
no index can have the issue.
> >
> > I like the idea of providing "expert" / best effort / limited way of carrying forward
such ancient indices, but I think the huge challenge for someone using that tool on an important
index will be enumerating the list of issues that might "matter" (the 3 Adrien listed + the
3 I listed above is a start for this list) and taking appropriate steps to "correct" the index
if so.  E.g. on a norms encoding change, somehow these expert tools must decode norms the
old way, encode them the new way, and then rewrite the norms files.  Or if the index has pos=-1,
changing that to pos=0.  Or if it has negative vInts, ... etc.
> >
> > Or maybe the "special" DirectoryReader only reads stored fields?  And so you would
enumerate your _source and reindex into the latest format ...
> >
> > > Something like https://issues.apache.org/jira/browse/LUCENE-8277 would
> > > help make it harder to introduce corrupt data in an index.
> >
> > +1
> >
> > Every time we catch something like "don't allow pos = -1 into the index" we need
somehow remember to go and add the check also in addIndices.
> >
> > Mike McCandless
> >
> > http://blog.mikemccandless.com
> >
> >
> > On Fri, Jan 25, 2019 at 3:52 AM Adrien Grand <jpountz@gmail.com> wrote:
> >>
> >> Agreed with Michael that setting expectations is going to be
> >> important. The thing that I would like to make sure is that we would
> >> never refrain from moving Lucene forward because of this feature. In
> >> particular, lucene-core should be free to make assumptions that are
> >> valid for N and N-1 indices without worrying about the fact that we
> >> have this super-expert feature that allows opening older indices. Here
> >> are some assumptions that I have in mind which have not always been
> >> true:
> >>  - norms might be encoded in a different way (this changed in 7)
> >>  - all index files have a checksum (only true since Lucene 5)
> >>  - offsets are always going forward (only enforced since Lucene 7)
> >>
> >> This means that carrying indices over by just merging them with the
> >> new version to move them to a new codec won't work all the time. For
> >> instance if your index has backward offsets and new codecs assume that
> >> offsets are going forward, then merging might fail or corrupt offsets
> >> - I'd like to make sure that we would not consider this a bug.
> >>
> >> Erick, I don't think this feature would be suitable for "robust index
> >> upgrades". To me it is really a best effort and shouldn't be trusted
> >> too much.
> >>
> >> I think some users will be tempted to wrap old readers to make them
> >> look good and then add them back to an index using addIndexes?
> >> Something like https://issues.apache.org/jira/browse/LUCENE-8277 would
> >> help make it harder to introduce corrupt data in an index.
> >>
> >> On Wed, Jan 23, 2019 at 3:11 PM Simon Willnauer
> >> <simon.willnauer@gmail.com> wrote:
> >> >
> >> > Hey folks,
> >> >
> >> > tl;dr; I want to be able to open an indexreader on an old index if the
> >> > SegmentInfo version is supported and all segment codecs are available.
> >> > Today that's not possible even if I port old formats to current
> >> > versions.
> >> >
> >> > Our BWC policy for quite a while has been N-1 major versions. That's
> >> > good and I think we should keep it that way. Only recently, caused by
> >> > changes how we encode/decode norms we also hard-enforce a the
> >> > index-version-created in several places and the version a segment was
> >> > written with. These are great enforcements and I understand why. My
> >> > request here is if we can find consensus on allowing somehow (a
> >> > special DirectoryReader for instance) to open such an index for
> >> > reading only that doesn't provide the guarantees that our high level
> >> > APIs decode norms correctly for instance. This would be enough to for
> >> > instance consume stored fields etc. for reindexing or if a users are
> >> > aware do they norms decoding in the codec. I am happy to work on a
> >> > proposal how this would work. It would still enforce no writing or
> >> > anything like this. I am also all for putting such a reader into misc
> >> > and being experimental.
> >> >
> >> > simon
> >> >
> >> > ---------------------------------------------------------------------
> >> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >> > For additional commands, e-mail: dev-help@lucene.apache.org
> >> >
> >>
> >>
> >> --
> >> Adrien
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: dev-help@lucene.apache.org
> >>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>


-- 
Adrien

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message