lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ben Heuwing <>
Subject Re: Sort Facet Values by "Interestingness"?
Date Wed, 03 Aug 2016 15:22:10 GMT
Hi Joel,

thank you, this sounds great!

As to your first proposal: I am a bit out of my depth here, as I have 
not worked with streaming expressions so far. But I will try out your 
example using the facet() expression on a simple use case as soon as you 
publish it.

Using the TermsComponent directly, would that imply that I have to 
retrieve all possible candidates and then sent them back as a  
terms.list to get their df? However, I assume that this would still be 
faster than having 2 repeated facet-calls. Or did you suggest to use the 
component in a customized RequestHandler?



Am 03.08.2016 um 14:57 schrieb Joel Bernstein:
> Also the TermsComponent now can export the docFreq for a list of terms and
> the numDocs for the index. This can be used as a general purpose mechanism
> for scoring facets with a callback.
> Joel Bernstein
> On Wed, Aug 3, 2016 at 8:52 AM, Joel Bernstein<>  wrote:
>> What you're describing is implemented with Graph aggregations in this
>> ticket using tf-idf. Other scoring methods can be implemented as well.
>> I'll update this thread with a description of how this can be used with
>> the facet() streaming expression as well as with graph queries later today.
>> Joel Bernstein
>> On Wed, Aug 3, 2016 at 8:18 AM,<>  wrote:
>>> Dear everybody,
>>> as the JSON-API now makes configuration of facets and sub-facets easier,
>>> there appears to be a lot of potential to enable instant calculation of
>>> facet-recommendations for a query, that is, to sort facets by their
>>> relative importance/interestingess/signficance for a current query relative
>>> to the complete collection or relative to a result set defined by a
>>> different query.
>>> An example would be to show the most typical terms which are used in
>>> descriptions of horror-movies, in contrast to the most popular ones for
>>> this query, as these may include terms that occur as often in other genres.
>>> This feature has been discussed earlier in the context of solr:
>>> *
>>> *
>>> In elasticsearch, the specific feature that I am looking for is called
>>> Significant Terms Aggregation:
>>> As of now, I have two questions:
>>> a) Are there workarounds in the current solr-implementation or known
>>> patches that implement such a sort-option for fields with a large number of
>>> possible values, e.g. text-fields? (for smaller vocabularies it is easy to
>>> do this client-side with two queries)
>>> b) Are there plans to implement this in facet.pivot or in the
>>> facet.json-API?
>>> The first step could be to define "interestingness" as a sort-option for
>>> facets and to define interestingness as facet-count in the result-set as
>>> compared to the complete collection: documentfrequency_termX(bucket) *
>>> inverse_documentfrequency_termX(collection)
>>> As an extension, the JSON-API could be used to change the domain used as
>>> base for the comparison. Another interesting option would be to compare
>>> facet-counts against a current parent-facet for nested facets, e.g. the 5
>>> most interesting terms by genre for a query on 70s movies, returning the
>>> terms specific to horror, comedy, action etc. compared to all terminology
>>> at the time (i.e. in the parent-query).
>>> A call-back-function could be used to define other measures of
>>> interestingness such as the log-likelihood-ratio (
>>> Most
>>> measures need at least the following 4 values: document-frequency for a
>>> term for the result-set, document-frequency for the result-set,
>>> document-frequency for a term in the index (or base-domain),
>>> document-frequency in the index (or base-domain).
>>> I guess, this feature might be of interest for those who want to do some
>>> small-scale term-analysis in addition to search, e.g. as in my case in
>>> digital humanities projects. But it might also be an interesting navigation
>>> device, e.g. when searching on job-offers to show the skills that are most
>>> distinctive for a category.
>>> It would be great to know, if others are interested in this feature. If
>>> there are any implementations out there or if anybody else is working on
>>> this, a pointer would be a great start. In the absence of existing
>>> solutions: Perhaps somebody has some idea on where and how to start
>>> implementing this?
>>> Best regards,
>>> Ben


Ben Heuwing, Dr. phil.
Wissenschaftlicher Mitarbeiter
Institut für Informationswissenschaft und Sprachtechnologie
Universität Hildesheim

Universitätsplatz 1
D-31141 Hildesheim

Lübeckerstraße 3
Raum L017

+49(0)5121 883-30316

Dissertationsschrift publiziert: /Usability-Ergebnisse als 
Wissensressource in Organisationen/ - Print 
<> | Online 

View raw message