lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alexander Aristov <alexander.aris...@gmail.com>
Subject Re: solr keep old docs
Date Thu, 29 Dec 2011 19:52:49 GMT
well. The first results are ready. I have implemented custom update
processor following your suggestion using low level index reader and
termdocs.

Launched scripts which add about 10 000 docs. Indexing took about 1 minute
including commit that is quite good for me. I don't have larger datasets so
won't be able to check with heavier conditions.

If someone is interested I can send over my jar file with my update
processor.

As I said I am ready to contribute it to solr but will get back to it in
the New Year after 10 Jan.

thanks everybody.

Best Regards
Alexander Aristov


On 29 December 2011 18:12, Erick Erickson <erickerickson@gmail.com> wrote:

> I'd guess it would be much faster, assuming that
> the search savings wouldn't be swamped by the
> additional transmission time over the wire and
> parsing the request (although SolrJ uses a binary
> format, so parsing request probably isn't all
> that expensive).
>
> You could even do a hybrid approach. Pack up all
> of the IDs you are about to update, send them to
> your special *request* handler and have your
> request handler respond with the documents that
> were already in the index...
>
> Hmmm, scratch all that. Start with just stringing
> together a long set of <uniqueKeys> and just
> search for them. Something like
> q=id:(1 2 47 09873............)&fl=id
> The response should be a minimal set of data
> returned (just the ID). Then you can remove
> each document ID returned from your
> next update. No custom Solr components
> required.
>
> Solr defaults to a maxBooleanClause count
> of 1024, so your packets should have fewer IDs
> this or you should bump that config setting.
>
> This should pretty much do what I was thinking
> with custom code without having to write
> anything..
>
> Best
> Erick
>
> On Thu, Dec 29, 2011 at 8:15 AM, Alexander Aristov
> <alexander.aristov@gmail.com> wrote:
> > I have never developed for solr yet and don't know much internals but
> Today
> > I have tried one approach with searcher.
> >
> > In my update processor I get searcher and search for ID. It works but I
> > need to load test it. Will index traversal be faster (less resource
> > consuming) than search?
> >
> > Best Regards
> > Alexander Aristov
> >
> >
> > On 29 December 2011 17:03, Erick Erickson <erickerickson@gmail.com>
> wrote:
> >
> >> Hmmm, we're not communicating <G>...
> >>
> >> The update processor wouldn't search in the
> >> classic sense. It would just use lower-level
> >> index traversal to determine if the doc (identified
> >> by your unique key) was already in the index
> >> and skip indexing that document if it was. No real
> >> *searching* involved (see TermDocs.seek for one
> >> approach).
> >>
> >> The price would be that you are transmitting the
> >> document over to the Solr instance and then
> >> throwing it away.
> >>
> >> Best
> >> Erick
> >>
> >> On Thu, Dec 29, 2011 at 12:52 AM, Mikhail Khludnev
> >> <mkhludnev@griddynamics.com> wrote:
> >> > Alexander,
> >> >
> >> > I have two ideas how to implement fast dedupe externally, assuming
> your
> >> PKs
> >> > don't fit to java.util.*Map:
> >> >
> >> >   - your crawler can use inprocess RDBMS (Derby, H2) to track dupes;
> >> >   - if your crawler is stateless - it doesn't track PKs which has been
> >> >   already crawled, you can retrieve it from Solr via
> >> >   http://wiki.apache.org/solr/TermsComponent . That's blazingly fast,
> >> but
> >> >   it might be a problem with removed documents (I'm not sure). And
> it's
> >> also
> >> >   can lead to OOMException (if you have too much PKs). Let me know if
> you
> >> >   need a workaround for one of these problems.
> >> >
> >> > If you choose internal dedupe (UpdateProcessor), pls let me know if
> >> > querying one-by-one will be to slow for your and you'll need to do it
> >> > page-by-page. I did some of such paging, and will do something similar
> >> > soon, so I'm interested in it.
> >> >
> >> > Regards
> >> >
> >> > On Thu, Dec 29, 2011 at 9:34 AM, Alexander Aristov <
> >> > alexander.aristov@gmail.com> wrote:
> >> >
> >> >> Unfortunately I have a lot of duplicates  and taking that searching
> >> might
> >> >> suffer I will try with implementing update procesor.
> >> >>
> >> >> But your idea is interesting and I will consider it, thanks.
> >> >>
> >> >> Best Regards
> >> >> Alexander Aristov
> >> >>
> >> >>
> >> >> On 28 December 2011 19:12, Tanguy Moal <tanguy.moal@gmail.com>
> wrote:
> >> >>
> >> >> > Hello Alexander,
> >> >> >
> >> >> > I don't know much about your requirements in terms of size and
> >> >> > performances, but I've had a similar use case and found a pretty
> >> simple
> >> >> > workaround.
> >> >> > If your duplicate rate is not too high, you can have the
> >> >> > SignatureProcessor to generate fingerprint of documents (you
> already
> >> did
> >> >> > that).
> >> >> >
> >> >> > Simply turn off overwritting of duplicates, you can then rely
on
> >> solr's
> >> >> > grouping / field collapsing to group your search results by
> >> fingerprints.
> >> >> > You'll then have one document group per "real" document. You can
> use
> >> >> > group.sort to sort your groups by indexing date ascending, and
> >> >> > group.limit=1 to keep only the oldest one.
> >> >> > You can even use group.format = simple to serve results as if
no
> >> >> > collapsing occured, and use group.ngroups (/!\ could be expansive
> >> /!\) to
> >> >> > get the real number of deduplicated documents.
> >> >> >
> >> >> > Of course the index will be larger, as I said, I made no assumption
> >> >> > regarding your operating requirements. And search can be a bit
> slower,
> >> >> > depending on the average rate of duplicated documents.
> >> >> > But you've got your issue addressed by configuration tuning only...
> >> >> > Depending on your project's sizing, it could be time saving.
> >> >> >
> >> >> > The advantage is that you have the precious information of what
> >> content
> >> >> is
> >> >> > duplicated from where :-)
> >> >> >
> >> >> > Hope this helps,
> >> >> >
> >> >> > --
> >> >> > Tanguy
> >> >> >
> >> >> > Le 28/12/2011 15:45, Alexander Aristov a écrit :
> >> >> >
> >> >> >  Thanks Eric,
> >> >> >>
> >> >> >> it sets me direction. I will be writing new plugin and will
get
> back
> >> to
> >> >> >> the
> >> >> >> dev forum with results and then we will decide next steps.
> >> >> >>
> >> >> >> Best Regards
> >> >> >> Alexander Aristov
> >> >> >>
> >> >> >>
> >> >> >> On 28 December 2011 18:08, Erick Erickson<erickerickson@gmail.
> **com<
> >> >> erickerickson@gmail.com>>
> >> >> >>  wrote:
> >> >> >>
> >> >> >>  Well, the short answer is that nobody else has
> >> >> >>> 1>  had a similar requirement
> >> >> >>> AND
> >> >> >>> 2>  not found a suitable work around
> >> >> >>> AND
> >> >> >>> 3>  implemented the change and contributed it back.
> >> >> >>>
> >> >> >>> So, if you'd like to volunteer<G>.....
> >> >> >>>
> >> >> >>> Seriously. If you think this would be valuable and are
> >> >> >>> willing to work on it, hop on over to the dev list and
> >> >> >>> discuss it, open a JIRA and make it work. I'd start
> >> >> >>> by opening a discussion on the dev list before
> >> >> >>> opening a JIRA, just to get a sense of where the
> >> >> >>> snags would be to changing the Solr code, but that's
> >> >> >>> optional.
> >> >> >>>
> >> >> >>> That said, writing your own update request handler
> >> >> >>> that detected this case isn't very difficult,
> >> >> >>> extend UpdateRequestProcessorFactory/**UpdateRequestProcessor
> >> >> >>> and use it as a plugin.
> >> >> >>>
> >> >> >>> Best
> >> >> >>> Erick
> >> >> >>>
> >> >> >>> On Wed, Dec 28, 2011 at 6:46 AM, Alexander Aristov
> >> >> >>> <alexander.aristov@gmail.com>  wrote:
> >> >> >>>
> >> >> >>>> the problem with dedupe (SignatureUpdateProcessor
) is that it
> >> >> REPLACES
> >> >> >>>>
> >> >> >>> old
> >> >> >>>
> >> >> >>>> docs. I have tried it already.
> >> >> >>>>
> >> >> >>>> Best Regards
> >> >> >>>> Alexander Aristov
> >> >> >>>>
> >> >> >>>>
> >> >> >>>> On 28 December 2011 13:04, Lance Norskog<goksron@gmail.com>
> >>  wrote:
> >> >> >>>>
> >> >> >>>>  The SignatureUpdateProcessor is for exactly this
problem:
> >> >> >>>>>
> >> >> >>>>>
> >> >> >>>>>
> >> >> >>>>>  http://www.lucidimagination.**com/search/link?url=http://**
> >> >> >>> wiki.apache.org/solr/**Deduplication<
> >> >>
> >>
> http://www.lucidimagination.com/search/link?url=http://wiki.apache.org/solr/Deduplication
> >> >> >
> >> >> >>>
> >> >> >>>> On Tue, Dec 27, 2011 at 10:42 PM, Alexander Aristov
> >> >> >>>>> <alexander.aristov@gmail.com>  wrote:
> >> >> >>>>>
> >> >> >>>>>> I get docs from external sources and the only
place I keep
> them
> >> is
> >> >> >>>>>>
> >> >> >>>>> solr
> >> >> >>>
> >> >> >>>> index. I have no a database or other means to track
indexed docs
> >> (my
> >> >> >>>>>> personal oppinion is that it might be a huge
headache).
> >> >> >>>>>>
> >> >> >>>>>> Some docs might change slightly in there original
sources but
> I
> >> >> don't
> >> >> >>>>>>
> >> >> >>>>> need
> >> >> >>>>>
> >> >> >>>>>> that changes. In fact I need original data
only.
> >> >> >>>>>>
> >> >> >>>>>> So I have no other ways but to either check
if a document is
> >> already
> >> >> >>>>>>
> >> >> >>>>> in
> >> >> >>>
> >> >> >>>> index before I put it to solrj array (read - query
solr) or
> >> develop my
> >> >> >>>>>>
> >> >> >>>>> own
> >> >> >>>>>
> >> >> >>>>>> update chain processor and implement ID check
there and skip
> such
> >> >> >>>>>>
> >> >> >>>>> docs.
> >> >> >>>
> >> >> >>>> Maybe it's wrong place to aguee and probably it's
been discussed
> >> >> >>>>>>
> >> >> >>>>> before
> >> >> >>>
> >> >> >>>> but
> >> >> >>>>>
> >> >> >>>>>> I wonder why simple the overwrite parameter
doesn't work here.
> >> >> >>>>>>
> >> >> >>>>>> My oppinion it perfectly suits here. In combination
with
> unique
> >> ID
> >> >> it
> >> >> >>>>>>
> >> >> >>>>> can
> >> >> >>>
> >> >> >>>> cover all possible variants.
> >> >> >>>>>>
> >> >> >>>>>> cases:
> >> >> >>>>>>
> >> >> >>>>>> 1. overwrite=true and uniquID exists then
newer doc should
> >> overwrite
> >> >> >>>>>>
> >> >> >>>>> the
> >> >> >>>
> >> >> >>>> old one.
> >> >> >>>>>>
> >> >> >>>>>> 2. overwrite=false and uniqueID exists then
newer doc must be
> >> >> skipped
> >> >> >>>>>>
> >> >> >>>>> since
> >> >> >>>>>
> >> >> >>>>>> old exists.
> >> >> >>>>>>
> >> >> >>>>>> 3. uniqueID doesn't exist then newer doc just
gets added
> >> regardless
> >> >> if
> >> >> >>>>>>
> >> >> >>>>> old
> >> >> >>>>>
> >> >> >>>>>> exists or not.
> >> >> >>>>>>
> >> >> >>>>>>
> >> >> >>>>>> Best Regards
> >> >> >>>>>> Alexander Aristov
> >> >> >>>>>>
> >> >> >>>>>>
> >> >> >>>>>> On 27 December 2011 22:53, Erick Erickson<erickerickson@gmail.
> >> >> **com<erickerickson@gmail.com>
> >> >> >>>>>> >
> >> >> >>>>>>
> >> >> >>>>> wrote:
> >> >> >>>>>
> >> >> >>>>>> Mikhail is right as far as I know, the assumption
built into
> >> Solr is
> >> >> >>>>>>>
> >> >> >>>>>> that
> >> >> >>>>>
> >> >> >>>>>> duplicate IDs (when<uniqueKey>  is defined)
should trigger the
> >> old
> >> >> >>>>>>> document to be replaced.
> >> >> >>>>>>>
> >> >> >>>>>>> what is your system-of-record? By that
I mean what does your
> >> SolrJ
> >> >> >>>>>>> program do to send data to Solr? Is there
any way you could
> just
> >> >> >>>>>>> *not* send documents that are already
in the Solr index based
> >> on,
> >> >> >>>>>>> for instance, any timestamp associated
with your
> >> system-of-record
> >> >> >>>>>>> and the last time you did an incremental
index?
> >> >> >>>>>>>
> >> >> >>>>>>> Best
> >> >> >>>>>>> Erick
> >> >> >>>>>>>
> >> >> >>>>>>> On Tue, Dec 27, 2011 at 6:38 AM, Alexander
Aristov
> >> >> >>>>>>> <alexander.aristov@gmail.com>  wrote:
> >> >> >>>>>>>
> >> >> >>>>>>>> Hi
> >> >> >>>>>>>>
> >> >> >>>>>>>> I am not using database. All needed
data is in solr index
> >> that's
> >> >> >>>>>>>>
> >> >> >>>>>>> why I
> >> >> >>>
> >> >> >>>>  want
> >> >> >>>>>>>
> >> >> >>>>>>>> to skip excessive checks.
> >> >> >>>>>>>>
> >> >> >>>>>>>> I will check DIH but not sure if it
helps.
> >> >> >>>>>>>>
> >> >> >>>>>>>> I am fluent with Java and it's not
a problem for me to
> write a
> >> >> >>>>>>>>
> >> >> >>>>>>> class
> >> >> >>>
> >> >> >>>> or
> >> >> >>>>>
> >> >> >>>>>> so
> >> >> >>>>>>>
> >> >> >>>>>>>> but I want to check first  maybe there
are any ways
> >> (workarounds)
> >> >> >>>>>>>>
> >> >> >>>>>>> to
> >> >> >>>
> >> >> >>>> make
> >> >> >>>>>
> >> >> >>>>>> it working without codding, just by playing
around with
> >> >> >>>>>>>>
> >> >> >>>>>>> configuration
> >> >> >>>
> >> >> >>>> and
> >> >> >>>>>
> >> >> >>>>>> params. I don't want to go away from default
solr
> implementation.
> >> >> >>>>>>>>
> >> >> >>>>>>>> Best Regards
> >> >> >>>>>>>> Alexander Aristov
> >> >> >>>>>>>>
> >> >> >>>>>>>>
> >> >> >>>>>>>> On 27 December 2011 09:33, Mikhail
Khludnev<
> >> >> >>>>>>>>
> >> >> >>>>>>> mkhludnev@griddynamics.com
> >> >> >>>>>
> >> >> >>>>>> wrote:
> >> >> >>>>>>>>
> >> >> >>>>>>>>  On Tue, Dec 27, 2011 at 12:26 AM,
Alexander Aristov<
> >> >> >>>>>>>>> alexander.aristov@gmail.com>
 wrote:
> >> >> >>>>>>>>>
> >> >> >>>>>>>>>  Hi people,
> >> >> >>>>>>>>>>
> >> >> >>>>>>>>>> I urgently need your help!
> >> >> >>>>>>>>>>
> >> >> >>>>>>>>>> I have solr 3.3 configured
and running. I do uncremental
> >> >> >>>>>>>>>>
> >> >> >>>>>>>>> indexing 4
> >> >> >>>
> >> >> >>>>  times a
> >> >> >>>>>>>>>
> >> >> >>>>>>>>>> day using bulk updates. Some
documents are identical to
> some
> >> >> >>>>>>>>>>
> >> >> >>>>>>>>> extent
> >> >> >>>
> >> >> >>>>  and I
> >> >> >>>>>>>
> >> >> >>>>>>>> wish to skip them, not to index.
> >> >> >>>>>>>>>> But here is the problem as
I could not find a way to tell
> >> solr
> >> >> >>>>>>>>>>
> >> >> >>>>>>>>> ignore
> >> >> >>>>>
> >> >> >>>>>> new
> >> >> >>>>>>>
> >> >> >>>>>>>> duplicate docs and keep old indexed
docs. I don't care that
> >> it's
> >> >> >>>>>>>>>>
> >> >> >>>>>>>>> new.
> >> >> >>>>>
> >> >> >>>>>>  Just
> >> >> >>>>>>>>>
> >> >> >>>>>>>>>> determine by ID that such
document is in the index already
> >> and
> >> >> >>>>>>>>>>
> >> >> >>>>>>>>> that's
> >> >> >>>>>
> >> >> >>>>>> it.
> >> >> >>>>>>>
> >> >> >>>>>>>> I use solrj for indexing. I have tried
setting
> overwrite=false
> >> >> >>>>>>>>>>
> >> >> >>>>>>>>> and
> >> >> >>>
> >> >> >>>>  dedupe
> >> >> >>>>>>>
> >> >> >>>>>>>> apprache but nothing helped me. I
either have that a newer
> doc
> >> >> >>>>>>>>>>
> >> >> >>>>>>>>> overwrites
> >> >> >>>>>>>
> >> >> >>>>>>>> old one or I get duplicate.
> >> >> >>>>>>>>>>
> >> >> >>>>>>>>>> I think it's a very simple
and basic feature and it must
> >> exist.
> >> >> >>>>>>>>>>
> >> >> >>>>>>>>> What
> >> >> >>>>>
> >> >> >>>>>> did
> >> >> >>>>>>>
> >> >> >>>>>>>> I
> >> >> >>>>>>>>>
> >> >> >>>>>>>>>> make wrong or didn't do?
> >> >> >>>>>>>>>>
> >> >> >>>>>>>>>>  I guess, because  the mainstream
approach is
> delta-import ,
> >> >> when
> >> >> >>>>>>>>>
> >> >> >>>>>>>> you
> >> >> >>>
> >> >> >>>>  have
> >> >> >>>>>>>
> >> >> >>>>>>>> "updated" timestamps in your DB and
"last-import" timestamp
> >> stored
> >> >> >>>>>>>>> somewhere. You can check how it
works in DIH.
> >> >> >>>>>>>>>
> >> >> >>>>>>>>>
> >> >> >>>>>>>>>  Tried google but I couldn't find
a solution there althoght
> >> many
> >> >> >>>>>>>>>>
> >> >> >>>>>>>>> people
> >> >> >>>>>
> >> >> >>>>>>  encounted such problem.
> >> >> >>>>>>>>>>
> >> >> >>>>>>>>>>
> >> >> >>>>>>>>>>  it's definitely can be done
by overriding
> >> >> >>>>>>>>>
> >> o.a.s.update.**DirectUpdateHandler2.addDoc(**AddUpdateCommand),
> >> >> >>>>>>>>> but I
> >> >> >>>>>>>>>
> >> >> >>>>>>>> suggest
> >> >> >>>>>>>
> >> >> >>>>>>>> to start from implementing your own
> >> >> >>>>>>>>> http://wiki.apache.org/solr/**UpdateRequestProcessor<
> >> >> http://wiki.apache.org/solr/UpdateRequestProcessor>- search for
> >> >> >>>>>>>>>
> >> >> >>>>>>>> PK,
> >> >> >>>
> >> >> >>>>  bypass
> >> >> >>>>>>>
> >> >> >>>>>>>> chain call if it's found. Then if
you meet performance
> issues
> >> on
> >> >> >>>>>>>>>
> >> >> >>>>>>>> querying
> >> >> >>>>>>>
> >> >> >>>>>>>> your PKs one by one, (but only after
that) you can batch
> your
> >> >> >>>>>>>>>
> >> >> >>>>>>>> searches,
> >> >> >>>>>
> >> >> >>>>>>  there are couple of optimization techniques
for huge
> disjunction
> >> >> >>>>>>>>>
> >> >> >>>>>>>> queries
> >> >> >>>>>
> >> >> >>>>>>  like PK:(2 OR 4 OR 5 OR 6).
> >> >> >>>>>>>>>
> >> >> >>>>>>>>>
> >> >> >>>>>>>>>  I start considering that I must
query index to check if a
> doc
> >> >> >>>>>>>>>>
> >> >> >>>>>>>>> to be
> >> >> >>>
> >> >> >>>>  added
> >> >> >>>>>>>
> >> >> >>>>>>>> is in the index already and do not
add it to array but I
> have
> >> so
> >> >> >>>>>>>>>>
> >> >> >>>>>>>>> many
> >> >> >>>>>
> >> >> >>>>>>  docs
> >> >> >>>>>>>>>
> >> >> >>>>>>>>>> that I am affraid it's not
a good solution.
> >> >> >>>>>>>>>>
> >> >> >>>>>>>>>> Best Regards
> >> >> >>>>>>>>>> Alexander Aristov
> >> >> >>>>>>>>>>
> >> >> >>>>>>>>>>
> >> >> >>>>>>>>>
> >> >> >>>>>>>>> --
> >> >> >>>>>>>>> Sincerely yours
> >> >> >>>>>>>>> Mikhail Khludnev
> >> >> >>>>>>>>> Lucid Certified
> >> >> >>>>>>>>> Apache Lucene/Solr Developer
> >> >> >>>>>>>>> Grid Dynamics
> >> >> >>>>>>>>>
> >> >> >>>>>>>>>
> >> >> >>>>>
> >> >> >>>>> --
> >> >> >>>>> Lance Norskog
> >> >> >>>>> goksron@gmail.com
> >> >> >>>>>
> >> >> >>>>>
> >> >> >
> >> >>
> >> >
> >> >
> >> >
> >> > --
> >> > Sincerely yours
> >> > Mikhail Khludnev
> >> > Lucid Certified
> >> > Apache Lucene/Solr Developer
> >> > Grid Dynamics
> >> >
> >> > <http://www.griddynamics.com>
> >> >  <mkhludnev@griddynamics.com>
> >>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message