lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alexander S. (JIRA)" <>
Subject [jira] [Commented] (SOLR-4787) Join Contrib
Date Mon, 17 Mar 2014 12:35:47 GMT


Alexander S. commented on SOLR-4787:

Nvm, there were 3 missing "}" at the end of, the build was successful,
testing now.

> Join Contrib
> ------------
>                 Key: SOLR-4787
>                 URL:
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>    Affects Versions: 4.2.1
>            Reporter: Joel Bernstein
>            Priority: Minor
>             Fix For: 4.8
>         Attachments: SOLR-4787-deadlock-fix.patch, SOLR-4787-pjoin-long-keys.patch, SOLR-4787.patch,
SOLR-4787.patch, SOLR-4787.patch, SOLR-4787.patch, SOLR-4787.patch, SOLR-4787.patch, SOLR-4787.patch,
SOLR-4787.patch, SOLR-4787.patch, SOLR-4787.patch, SOLR-4787.patch, SOLR-4787.patch, SOLR-4787.patch,
SOLR-4787.patch, SOLR-4797-hjoin-multivaluekeys-nestedJoins.patch, SOLR-4797-hjoin-multivaluekeys-trunk.patch
> This contrib provides a place where different join implementations can be contributed
to Solr. This contrib currently includes 3 join implementations. The initial patch was generated
from the Solr 4.3 tag. Because of changes in the FieldCache API this patch will only build
with Solr 4.2 or above.
> *HashSetJoinQParserPlugin aka hjoin*
> The hjoin provides a join implementation that filters results in one core based on the
results of a search in another core. This is similar in functionality to the JoinQParserPlugin
but the implementation differs in a couple of important ways.
> The first way is that the hjoin is designed to work with int and long join keys only.
So, in order to use hjoin, int or long join keys must be included in both the to and from
> The second difference is that the hjoin builds memory structures that are used to quickly
connect the join keys. So, the hjoin will need more memory then the JoinQParserPlugin to perform
the join.
> The main advantage of the hjoin is that it can scale to join millions of keys between
cores and provide sub-second response time. The hjoin should work well with up to two million
results from the fromIndex and tens of millions of results from the main query.
> The hjoin supports the following features:
> 1) Both lucene query and PostFilter implementations. A *"cost"* > 99 will turn on
the PostFilter. The PostFilter will typically outperform the Lucene query when the main query
results have been narrowed down.
> 2) With the lucene query implementation there is an option to build the filter with threads.
This can greatly improve the performance of the query if the main query index is very large.
The "threads" parameter turns on threading. For example *threads=6* will use 6 threads to
build the filter. This will setup a fixed threadpool with six threads to handle all hjoin
requests. Once the threadpool is created the hjoin will always use it to build the filter.
Threading does not come into play with the PostFilter.
> 3) The *size* local parameter can be used to set the initial size of the hashset used
to perform the join. If this is set above the number of results from the fromIndex then the
you can avoid hashset resizing which improves performance.
> 4) Nested filter queries. The local parameter "fq" can be used to nest a filter query
within the join. The nested fq will filter the results of the join query. This can point to
another join to support nested joins.
> 5) Full caching support for the lucene query implementation. The filterCache and queryResultCache
should work properly even with deep nesting of joins. Only the queryResultCache comes into
play with the PostFilter implementation because PostFilters are not cacheable in the filterCache.
> The syntax of the hjoin is similar to the JoinQParserPlugin except that the plugin is
referenced by the string "hjoin" rather then "join".
> fq=\{!hjoin fromIndex=collection2 from=id_i to=id_i threads=6 fq=$qq\}user:customer1&qq=group:5
> The example filter query above will search the fromIndex (collection2) for "user:customer1"
applying the local fq parameter to filter the results. The lucene filter query will be built
using 6 threads. This query will generate a list of values from the "from" field that will
be used to filter the main query. Only records from the main query, where the "to" field is
present in the "from" list will be included in the results.
> The solrconfig.xml in the main query core must contain the reference to the hjoin.
> <queryParser name="hjoin" class="org.apache.solr.joins.HashSetJoinQParserPlugin"/>
> And the join contrib lib jars must be registed in the solrconfig.xml.
>  <lib dir="../../../contrib/joins/lib" regex=".*\.jar" />
> After issuing the "ant dist" command from inside the solr directory the joins contrib
jar will appear in the solr/dist directory. Place the the solr-joins-4.*-.jar  in the WEB-INF/lib
directory of the solr webapplication. This will ensure that the top level Solr classloader
loads these classes rather then the core's classloaded. 
> *BitSetJoinQParserPlugin aka bjoin*
> The bjoin behaves exactly like the hjoin but uses a BitSet instead of a HashSet to perform
the underlying join. Because of this the bjoin is much faster and can provide sub-second response
times on result sets of tens of millions of records from the fromIndex and hundreds of millions
of records from the main query.
> But there are limitations to how the bjoin can be used. The bjoin treats the join keys
as addresses in a BitSet and uses the Lucene OpenBitSet implementation which performs very
well but is not sparse. So the BitSet memory is dictated by the size of the join keys. For
example a bitset with a max join key of 200,000,000 will need 25 MB of memory. For this reason
the BitSet join does not support long join keys. In order to keep memory usage down the join
keys should also be packed at the low end, for example from 1 to 50,000,000. 
> Below is a sampe bjoin:
> fq=\{!bjoin fromIndex=collection2 from=id_i to=id_i threads=6 fq=$qq\}user:customer1&qq=group:5
> To register the bjoin the solrconfig.xml in the main query core must contain the reference
to the bjoin.
> <queryParser name="bjoin" class="org.apache.solr.joins.BitSetJoinQParserPlugin"/>
> *ValueSourceJoinParserPlugin aka vjoin*
> The second implementation is the ValueSourceJoinParserPlugin aka "vjoin". This implements
a ValueSource function query that can return a value from a second core based on join keys
and limiting query. The limiting query can be used to select a specific subset of data from
the join core. This allows customer specific relevance data to be stored in a separate core
and then joined in the main query.
> The vjoin is called using the "vjoin" function query. For example:
> bf=vjoin(fromCore, fromKey, fromVal, toKey, query)
> This example shows "vjoin" being called by the edismax boost function parameter. This
example will return the "fromVal" from the "fromCore". The "fromKey" and "toKey" are used
to link the records from the main query to the records in the "fromCore". The "query" is used
to select a specific set of records to join with in fromCore.
> Currently the fromKey and toKey must be longs but this will change in future versions.
Like the pjoin, the "join" SolrCache is used to hold the join memory structures.
> To configure the vjoin you must register the ValueSource plugin in the solrconfig.xml
as follows:
> <valueSourceParser name="vjoin" class="org.apache.solr.joins.ValueSourceJoinParserPlugin"

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message