lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joel Bernstein (JIRA)" <>
Subject [jira] [Assigned] (SOLR-6568) Join Discovery Contrib
Date Fri, 26 Sep 2014 15:28:33 GMT


Joel Bernstein reassigned SOLR-6568:

    Assignee: Joel Bernstein

> Join Discovery Contrib
> ----------------------
>                 Key: SOLR-6568
>                 URL:
>             Project: Solr
>          Issue Type: New Feature
>            Reporter: Joel Bernstein
>            Assignee: Joel Bernstein
>            Priority: Minor
>             Fix For: 5.0
> This contribution was commissioned by the *NCBI* (National Center for Biotechnology Information).

> The Join Discovery Contrib is a set of Solr plugins that support large scale joins and
"join facets" between Solr cores. 
> There are two different Join implementations included in this contribution. Both implementations
are designed to work with integer join keys. It is very common in large BioInformatic and
Genomic databases to use integer primary and foreign keys. Integer keys allow Bioinformatic
and Genomic search engines and discovery tools to perform complex operations on large data
sets very efficiently. 
> The Join Discovery Contrib provides features that will be applicable to anyone working
with the freely available databases from the NCBI and likely a large number of other BioInformatic
and Genomic databases. These features are not specific though to Bioinformatics and Genomics,
they can be used in any datasets where integer
> keys are used to define the primary and foreign keys.
> What is is included in this contrib:
> 1) A new JoinComponent. This component is used instead of the standard QueryComponent.
It facilitates very large scale relational joins between two Solr indexes (cores). The join
algorithm used in this component is known as a *parallel partitioned merge join*. This is
an algorithm which partitions the results from both sides of the join and then sorts and merges
the partitions in parallel. 
>  Below are some of it's features:
> * Sub-second performance on very large joins. The parallel join algorithm is capable
of sub-second performance on joins with tens of millions of records on both sides of the join.
> * The JoinComponent returns "tuples" with fields from both sides of the join. The initial
release returns the primary keys from both sides of the join and the join key. 
> * The tuples also include, and are ranked by, a combined score from both sides of the
> * Special purpose memory-mapped on-disk indexes to support \*:\* joins. This makes it
possible to join an entire index with a sub-set of another index with sub-second performance.

> * Support for very fast one-to-one, one-to-many and many-to-many joins. Fast many-to-many
joins make it possible to join between indexes on multi-value fields. 
> 2) A new JoinFacetComponent. This component provides facets for both indexes involved
in the join. 
> 3) The BitSetJoinQParserPlugin. A very fast parallel filter join based on bitsets that
supports infinite levels of nesting. It can be used as a filter query in combination with
the JoinComponent or with the standard query
> component. 

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message