lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF subversion and git services (JIRA)" <>
Subject [jira] [Commented] (LUCENE-7854) Indexing custom term frequencies
Date Fri, 09 Jun 2017 21:53:18 GMT


ASF subversion and git services commented on LUCENE-7854:

Commit 5844ed4ac95373cbdb512e84b8ad08f78c2baf57 in lucene-solr's branch refs/heads/master
from [~thetaphi]
[;h=5844ed4 ]

LUCENE-7854: Add a new DelimitedTermFrequencyTokenFilter that allows to mark tokens with a
custom term frequency

> Indexing custom term frequencies
> --------------------------------
>                 Key: LUCENE-7854
>                 URL:
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: master (7.0)
>         Attachments: LUCENE-7854.patch, LUCENE-7854.patch, LUCENE-7854.patch, LUCENE-7854.patch,
LUCENE-7854.patch, LUCENE-7854.patch, LUCENE-7854.patch
> When you index a field with {{IndexOptions.DOCS_AND_FREQS}}, Lucene will store just the
docID and term frequency (how many times that term occurred in that document) for all documents
that have a given term.
> We compute that term frequency by counting how many times a given token appeared in the
field during analysis.
> But it can be useful, in expert use cases, to customize what Lucene stores as the term
frequency, e.g. to hold custom scoring signals that are a function of term and document (this
is my use case).  Users have also asked for this before, e.g. see
> One way to do this today is to stuff your custom data into a {{byte[]}} payload.  But
that's quite inefficient, forcing you to index positions, and pay the overhead of retrieving
payloads at search time.
> Another approach is "token stuffing": just enumerate the same token N times where N is
the custom number you want to store, but that's also inefficient when N gets high.
> I think we can make this simple to do in Lucene.  I have a working version, using my
own custom indexing chain, but the required changes are quite simple so I think we can add
it to Lucene's default indexing chain?
> I created a new token attribute, {{TermDocFrequencyAttribute}}, and tweaked the indexing
chain to use that attribute's value as the term frequency if it's present, and if the index
options are {{DOCS_AND_FREQS}} for that field.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message