lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alexander Aristov <alexander.aris...@gmail.com>
Subject Re: solr keep old docs
Date Thu, 29 Dec 2011 05:41:22 GMT
Yes I have been warned that query index each time before adding doc to
index might be resource consuming. Will check it.

As for the overwrite parameter I think the name is not the best then.
People outside the "business" like me misuse it and assume what I wrote.
Overwrite shall mean what it means.

But I understand what it does in fact and so my way is to write custom
update processor plugin.

Best Regards
Alexander Aristov


On 28 December 2011 22:16, Chris Hostetter <hossman_lucene@fucit.org> wrote:

>
> : That said, writing your own update request handler
> : that detected this case isn't very difficult,
> : extend UpdateRequestProcessorFactory/UpdateRequestProcessor
> : and use it as a plugin.
>
> i can't find the thread at the moment, but the general issue that has
> caused people headaches with this type of approach in the past has been
> that the performance of doing a query on every update (to see if the doc
> is already in the index) can slow things down quite a bit -- in your
> usecase it may not be a significant bottleneck, but that's the general
> issue that has come up i nthe past.
>
> If you look at systems (like nutch) that do large scale crawling, they
> treat the crawl phrase independent from the indexing phase precisesly for
> reasons like this -- so the crawler can dedup the documents (by unique
> URL) and eliminate duplication before ever even adding them to the index.
>
> : >> > I wonder why simple the overwrite parameter doesn't work here.
>        ...
> : >> > 2. overwrite=false and uniqueID exists then newer doc must be
> skipped
> : >> since
> : >> > old exists.
>
> that is not what overwrite=false does (or was ever designed to do).
> overwrite=false is a way to tell Solr that you are already certain that
> the documents being added do not exist in the index, therefore Solr can
> save time by not attempting to overwrite an existing document.  It is
> intended for situations where you are bulk loading documents, ie: doing an
> initial build of an index from a system of record (ie: a single pass over
> adatabase that uses the same unique key) or importing documents from a
> new system of record with a completley differnet id space.
>
>
>
> -Hoss
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message