lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dmitry Emets <emet...@gmail.com>
Subject Re: Deduplication of search result with custom with custom sort
Date Fri, 09 Oct 2020 13:26:26 GMT
6_500_000 is the total count of groups in the entire collection. I only
return the top 1000 to users.
I use Lucene where I have documents that can have the same docvalue, and I
want to deduplicate this documents by this docvalue during search.
Also, i sort my documents by multiple fields and because of this i can`t
use DiversifiedTopDocsCollector that works with relevance score only.

пт, 9 окт. 2020 г. в 16:02, Erick Erickson <erickerickson@gmail.com>:

> This is going to be fairly painful. You need to keep a list 6.5M
> items long, sorted.
>
> Before diving in there, I’d really back up and ask what the use-case
> is. Returning 6.5M docs to a user is useless, so are you’re doing
> some kind of analytics maybe? In which case, and again
> assuming you’re using Solr, Streaming Aggregation might
> be a better option.
>
> This really sounds like an XY problem. You’re trying to solve problem X
> and asking how to accomplish it with Y. What I’m questioning
> is whether Y (grouping) is a good approach or not. Perhaps if
> you explained X there’d be a better suggestion.
>
> Best,
> Erick
>
> > On Oct 9, 2020, at 8:19 AM, Dmitry Emets <emetsds@gmail.com> wrote:
> >
> > I have 12_000_000 documents, 6_500_000 groups
> >
> > With sort: It takes around 1 sec without grouping, 2 sec with grouping
> and
> > 12 sec with setAllGroups(true)
> > Without sort: It takes around 0.2 sec without grouping, 0.6 sec with
> > grouping and 10 sec with setAllGroups(true)
> >
> > Thank you, Erick, I will look into it
> >
> > пт, 9 окт. 2020 г. в 14:32, Erick Erickson <erickerickson@gmail.com>:
> >
> >> At the Solr level, CollapsingQParserPlugin see:
> >>
> https://lucene.apache.org/solr/guide/8_6/collapse-and-expand-results.html
> >>
> >> You could perhaps steal some ideas from that if you
> >> need this at the Lucene level.
> >>
> >> Best,
> >> Erick
> >>
> >>> On Oct 9, 2020, at 7:25 AM, Diego Ceccarelli (BLOOMBERG/ LONDON) <
> >> dceccarelli4@bloomberg.net> wrote:
> >>>
> >>> Is the field that you are using to dedupe stored as a docvalue?
> >>>
> >>> From: java-user@lucene.apache.org At: 10/09/20 12:18:04To:
> >> java-user@lucene.apache.org
> >>> Subject: Deduplication of search result with custom with custom sort
> >>>
> >>> Hi,
> >>> I need to deduplicate search results by specific field and I have no
> idea
> >>> how to implement this properly.
> >>> I have tried grouping with setGroupDocsLimit(1) and it gives me
> expected
> >>> results, but has not very good performance.
> >>> I think that I need something like DiversifiedTopDocsCollector, but
> >>> suitable for collecting TopFieldDocs.
> >>> Is there any possibility to achieve deduplication with existing lucene
> >>> components, or do I need to implement my own
> >> DiversifiedTopFieldsCollector?
> >>>
> >>>
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> >>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message