lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ravikumar Govindarajan <ravikumar.govindara...@gmail.com>
Subject Re: Multi-IDF for a single term possible?
Date Tue, 03 Dec 2019 14:03:55 GMT
>
> it is enough to give each its own field.
>

I kind of over-simplified the problem at hand. Apologies.

DOC_TYPE is just one aspect of the problem. The other one is that, it is
actually shared index where there are multiple-users (100-3000 users per
index). There are many hundreds of such shared-indexes in our cluster

Search happens per-user & it doesn't make sense to have a single IDF. We
are ideally looking at some lucene extensions/tricks to store & retrieve
IDF in <User/DOC_TYPE> pairs.

Is there any reason why you are not storing each DOC_TYPE in its own index?


There are some common-fields across all DOC_TYPES (Ex: content/attachment
et al..)  & to provide unified-search for a user, we colocate them in a
single index

--
Ravi

On Tue, Dec 3, 2019 at 6:30 PM Diego Ceccarelli (BLOOMBERG/ LONDON) <
dceccarelli4@bloomberg.net> wrote:

> Hi Ravi,
> Can you give more details on how you store an entity into lucene? what is
> a doc type?
> what fields do you have?
>
> Cheers
>
> From: java-user@lucene.apache.org At: 12/03/19 12:50:40To:
> java-user@lucene.apache.org
> Subject: Multi-IDF for a single term possible?
>
> Hello,
>
> We are using TF-IDF for scoring (Yet to migrate to BM25). Different
> entities (DOC_TYPES) are crunched & stored together in a single index.
>
> When it comes to IDF, I find that there is a single value computed across
> documents & stored as part of TermStats, whereas our documents are not
> homogeneous. So, a single IDF value doesn't work for us
>
> We would like to compute IDF for each <Term/DOC_TYPE> pair, store it &
> later use the paired-IDF values during query time. Is something like this
> possible via Codecs or other mechanisms?
>
> Any help is much appreciated
>
> --
> Ravi
>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message