lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <>
Subject [jira] [Commented] (LUCENE-5611) Simplify the default indexing chain
Date Sun, 20 Apr 2014 20:36:14 GMT


Michael McCandless commented on LUCENE-5611:

bq. I think the specializations in the default chain just work around lack of field reuse?
Maybe we should rethink this for Lucene 5, some way that makes it easier and more intuitive
so that this reuse isn't necessary for good performance.

Likely Field re-use would get most of that performance gain too?  But
Field reuse is a hassle for apps ... 

This patch is already big enough, and I'd like to focus on simplifying
the indexing chain, so I'll remove the specializations here and open a
followon issue ...

bq. As far as the LuceneTestCase nocommit, we have some similar situations elsewhere, like
RandomPF/RandomCodec where we "remember" for a field for that test class and are consistent.
I think thats enough for good coverage? If we want to mix things up, a test can do that manually.

Ahh right, I'll pull that same logic over.

bq. I keep going back and forth on the StoredFieldsWriter codec api change: I can live with
it (assuming javadocs are fixed, heh), and I think its ok for a step (to prevent bogus passes
on the fields), but it reminds me of the old postings API... perhaps a pull model is warranted,
where the writer actually just uses the visitor API or something simple like that. It might
actually make it cleaner, for example uncompressed stored fields wouldn't need to buffer up
in a RAMOutputStream, it could just do the bogus pass IW was doing before.

OK I'll fix the javadocs and open a new issue that we should try the
visitor API for stored fields?

bq. As far as the vectors change, I think its an ok tradeoff. If there are concerns maybe
o.a.l.document could help. But i dont think it makes sense to use conflicting vectors values
for the same field name... in the same doc.

Yeah I think it's really strange how Lucene auto-upgrades all TV
settings for all field instances by the same field name ... this is
probably unexpected and users on upgrading would see this is happening
and have to be explicit about it themselves.  I think that's a good
thing ...

bq. Are the new checks in field mandatory? What happens if a custom IndexableField does this
(tries to index vectors when not indexed)?

Good question, I'll add a test & make sure indexer catches it for
a custom IF.

> Simplify the default indexing chain
> -----------------------------------
>                 Key: LUCENE-5611
>                 URL:
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/index
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 4.9, 5.0
>         Attachments: LUCENE-5611.patch
> I think Lucene's current indexing chain has too many classes /
> hierarchy / abstractions, making it look much more complex than it
> really should be, and discouraging users from experimenting/innovating
> with their own indexing chains.
> Also, if it were easier to understand/approach, then new developers
> would more likely try to improve it ... it really should be simpler.
> So I'm exploring a pared back indexing chain, and have a starting patch
> that I think is looking ok: it seems more approachable than the
> current indexing chain, or at least has fewer strange classes.
> I also thought this could give some speedup for tiny documents (a more
> common use of Lucene lately), and it looks like, with the evil
> optimizations, this is a ~25% speedup for Geonames docs.  Even without
> those evil optos it's a bit faster.
> This is very much a work in progress / nocommits, and there are some
> behavior changes e.g. the new chain requires all fields to have the
> same TV options (rather than auto-upgrading all fields by the same
> name that the current chain does)...

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message