lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jan Høydahl (JIRA) <j...@apache.org>
Subject [jira] [Closed] (SOLR-1599) Improve IDF and relevance by separately indexing different entity types sharing a common schema
Date Fri, 07 Aug 2015 11:42:45 GMT

     [ https://issues.apache.org/jira/browse/SOLR-1599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Jan Høydahl closed SOLR-1599.
-----------------------------
    Resolution: Won't Fix

Closing very old issue that appears to not be a real problem any more. Please feel free to
re-open should anyone feel this issue needs a resolution.

With SolrCloud it is a no-brainer to create multiple collections for this particular use case.

> Improve IDF and relevance by separately indexing different entity types sharing a common
schema
> -----------------------------------------------------------------------------------------------
>
>                 Key: SOLR-1599
>                 URL: https://issues.apache.org/jira/browse/SOLR-1599
>             Project: Solr
>          Issue Type: New Feature
>          Components: Schema and Analysis
>            Reporter: Graham P
>   Original Estimate: 504h
>  Remaining Estimate: 504h
>
> In Solr 1.4, the IDF (Inverse Document Frequency) is calculated on all of the documents
in an index.  This introduces relevance problems when using a single schema to store multiple
entity types, for example to support "search for tracks" and "search for artists".   The ranking
for search on the _name_ field of _track_ entities will be (much?) more accurate if the IDF
for the name field does not include counts from _artist_ entities.  The effect on ranking
would be most pronounced for query terms that have a low document frequency for _track_ entities
but a high frequency for _artist_ entities, or visa versa.
> The current work-around to make the IDF be entity-specific is to use a separate Solr
core for each entity type sharing the schema - and repeating the process of copying solrconfig.xml
and schema.xml to all the cores.  This would be more complicated with replication, and more
so with sharding, to maintain a core for _artists_ and a core for _tracks_ on each node.
> David Smiley, author of "Solr 1.4 Enterprise Search Server", has filed SOLR-1158, where
he suggests calculating _numDocs_ after the application of filters.  He recognises however
that the document frequency (DF_t) for each query term in a _track_ search would also needs
to exclude _artist_ entities from the DF_t total to get the correct IDF_t=log(N/DF_t).   DF_t
must be calculated at index time, when Solr does not know what filters will be applied.
> I suggest having a metadata field _entitytype_ specified on submitting a batch of documents.
The the schema would specify a list of allowed entity types and a default entity type. For
example, document could say either entitytype="track" or entitytype="artist".  Each each entity
type has an independent set of document frequencies, so the term "foo" will have a DF for
entitytype="artist" and a different DF for entitytype="track".   This might be implemented
by instantiating a separate Lucene index for each configured entity type.  Filtering on entitytype="artist"
would be implemented by searching only the _artist_ index, analogous to searching only on
the _artist_ core in the multi-core workaround.
> With this solution (entity type metadata field implemented with separate Lucene indeces)
a single Solr core can support many different entity types that share a common schema but
use partially overlapping subsets of fields, instead of configuring, replicating and sharding
a Solr core for every entity type.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message