lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ben Heuwing <heuw...@uni-hildesheim.de>
Subject Re: Sort Facet Values by "Interestingness"?
Date Wed, 03 Aug 2016 15:22:10 GMT
Hi Joel,

thank you, this sounds great!

As to your first proposal: I am a bit out of my depth here, as I have 
not worked with streaming expressions so far. But I will try out your 
example using the facet() expression on a simple use case as soon as you 
publish it.

Using the TermsComponent directly, would that imply that I have to 
retrieve all possible candidates and then sent them back as a  
terms.list to get their df? However, I assume that this would still be 
faster than having 2 repeated facet-calls. Or did you suggest to use the 
component in a customized RequestHandler?

Regards,

Ben

Am 03.08.2016 um 14:57 schrieb Joel Bernstein:
> Also the TermsComponent now can export the docFreq for a list of terms and
> the numDocs for the index. This can be used as a general purpose mechanism
> for scoring facets with a callback.
>
> https://issues.apache.org/jira/browse/SOLR-9243
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
> On Wed, Aug 3, 2016 at 8:52 AM, Joel Bernstein<joelsolr@gmail.com>  wrote:
>
>> What you're describing is implemented with Graph aggregations in this
>> ticket using tf-idf. Other scoring methods can be implemented as well.
>>
>> https://issues.apache.org/jira/browse/SOLR-9193
>>
>> I'll update this thread with a description of how this can be used with
>> the facet() streaming expression as well as with graph queries later today.
>>
>>
>>
>> Joel Bernstein
>> http://joelsolr.blogspot.com/
>>
>> On Wed, Aug 3, 2016 at 8:18 AM,<heuwing@uni-hildesheim.de>  wrote:
>>
>>> Dear everybody,
>>>
>>> as the JSON-API now makes configuration of facets and sub-facets easier,
>>> there appears to be a lot of potential to enable instant calculation of
>>> facet-recommendations for a query, that is, to sort facets by their
>>> relative importance/interestingess/signficance for a current query relative
>>> to the complete collection or relative to a result set defined by a
>>> different query.
>>>
>>> An example would be to show the most typical terms which are used in
>>> descriptions of horror-movies, in contrast to the most popular ones for
>>> this query, as these may include terms that occur as often in other genres.
>>>
>>> This feature has been discussed earlier in the context of solr:
>>> *
>>> http://stackoverflow.duapp.com/questions/26399264/how-can-i-sort-facets-by-their-tf-idf-score-rather-than-popularity
>>> *
>>> http://lucene.472066.n3.nabble.com/Facets-with-an-IDF-concept-td504070.html
>>>
>>> In elasticsearch, the specific feature that I am looking for is called
>>> Significant Terms Aggregation:
>>> https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-significantterms-aggregation.html#search-aggregations-bucket-significantterms-aggregation
>>>
>>> As of now, I have two questions:
>>>
>>> a) Are there workarounds in the current solr-implementation or known
>>> patches that implement such a sort-option for fields with a large number of
>>> possible values, e.g. text-fields? (for smaller vocabularies it is easy to
>>> do this client-side with two queries)
>>> b) Are there plans to implement this in facet.pivot or in the
>>> facet.json-API?
>>>
>>> The first step could be to define "interestingness" as a sort-option for
>>> facets and to define interestingness as facet-count in the result-set as
>>> compared to the complete collection: documentfrequency_termX(bucket) *
>>> inverse_documentfrequency_termX(collection)
>>>
>>> As an extension, the JSON-API could be used to change the domain used as
>>> base for the comparison. Another interesting option would be to compare
>>> facet-counts against a current parent-facet for nested facets, e.g. the 5
>>> most interesting terms by genre for a query on 70s movies, returning the
>>> terms specific to horror, comedy, action etc. compared to all terminology
>>> at the time (i.e. in the parent-query).
>>>
>>> A call-back-function could be used to define other measures of
>>> interestingness such as the log-likelihood-ratio (
>>> http://tdunning.blogspot.de/2008/03/surprise-and-coincidence.html). Most
>>> measures need at least the following 4 values: document-frequency for a
>>> term for the result-set, document-frequency for the result-set,
>>> document-frequency for a term in the index (or base-domain),
>>> document-frequency in the index (or base-domain).
>>>
>>> I guess, this feature might be of interest for those who want to do some
>>> small-scale term-analysis in addition to search, e.g. as in my case in
>>> digital humanities projects. But it might also be an interesting navigation
>>> device, e.g. when searching on job-offers to show the skills that are most
>>> distinctive for a category.
>>>
>>> It would be great to know, if others are interested in this feature. If
>>> there are any implementations out there or if anybody else is working on
>>> this, a pointer would be a great start. In the absence of existing
>>> solutions: Perhaps somebody has some idea on where and how to start
>>> implementing this?
>>>
>>> Best regards,
>>>
>>> Ben
>>>
>>>
>>>

-- 

Ben Heuwing, Dr. phil.
Wissenschaftlicher Mitarbeiter
Institut für Informationswissenschaft und Sprachtechnologie
Universität Hildesheim

Postanschrift:
Universitätsplatz 1
D-31141 Hildesheim


Büro:
Lübeckerstraße 3
Raum L017

+49(0)5121 883-30316
heuwing@uni-hildesheim.de
Homepage 
<https://www.uni-hildesheim.de/fb3/institute/iwist/mitglieder/heuwing/>

Dissertationsschrift publiziert: /Usability-Ergebnisse als 
Wissensressource in Organisationen/ - Print 
<http://www.vwh-verlag.de/vwh/?p=995> | Online 
<http://nbn-resolving.de/urn:nbn:de:gbv:hil2-opus4-3914>


Mime
View raw message