lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dyer, James" <James.D...@ingramcontent.com>
Subject RE: DIH nested entities don't work
Date Fri, 16 Nov 2012 17:28:42 GMT
Maarten,

Here is a sample set-up that lets you build your caches in parallel and then index off the
caches in a subsequent step.  See below for the solrconfig.xml snippet and the text of the
4 data-config.xml files.  In this example it builds a cache for the parent also, but this
is not necessary.  But I guess its cleaner looking to just cache everything and then the final
step works against caches only.

Here's how it works.  First, begin a full import for each of the cache builders by issuing
these commands all at once.  Each of these builds a cache:
/solrcore/dih-parent?command=full-import
/solrcore/dih-child1?command=full-import
/solrcore/dih-child2?command=full-import

You then need to poll each of these handler's status screen and wait until they all finish.
 Once done, issue this command.  This reads back the caches and indexes the data to your solr
core:
/solrcore/dih-master?command=full-import

The tricky thing here is automating it all.  You'll need something that issues the commands
and then polls the responses, etc.  For my case, I ended up writing a very hacky program that
runs 12 cache-building handlers at once, starting a new one when one finishes, until all 50
or so are complete.  It then runs the master dih handlers (an additional complexity for our
situation, not shown here, is I'm using the DIH Cache partitioning feature to make multiple
partitions, then I have multiple master handlers that each index a slice of the data at the
same time, making the "master" step finish faster on a multi-processor machine)

Another thing that is very confusing with all this is that to build the caches, you send all
the cache params as request parameters, included in solrconfig.xml here.  But for the master
indexing, these are parameters on the entity in data-config.xml.  It would be better (perhaps
should this feature ever get committed) maybe if this changed to allow all the configuration
to occur in data-config.xml for both building caches and reading caches.

One last thing is you might want to open a JIRA issue about JDBCDataSource not honoring the
JDBC Driver parameter that you're trying to pass through.  https://issues.apache.org/jira/browse/SOLR
 If you don't have an account you need to create one to open a new issue.

James Dyer
E-Commerce Systems
Ingram Content Group
(615) 213-4311

<!-- 4 handlers declared in solrconfig.xml -->
<requestHandler name="/dih-parent" class="org.apache.solr.handler.dataimport.DataImportHandler">
  <lst name="defaults">
    <str name="config">dataconfig-parent.xml</str>
    <str name="clean">true</str>
    <str name="persistCacheBaseDir">/path/to/caches</str>
    <str name="persistCacheName">PARENT</str>
    <str name="persistCacheImpl">org.apache.solr.handler.dataimport.BerkleyBackedCache</str>
    <!-- ID is Oracle's "number" type which the JDBC driver brings in as a BigDecimal.
 
         The field always contains an Integer so we can optimize for that case 
         See org.apache.solr.handler.dataimport.DIHCacheTypes
    -->
    <str name="persistCacheFieldNames">ID,                 SOME_DATA</str> 
    <str name="persistCacheFieldTypes">BIGDECIMAL_INTEGER, STRING</str>
    <str name="cacheKey">ID</str>
    <str name="writerImpl">org.apache.solr.handler.dataimport.DIHCacheWriter</str>
    <!-- all bdb-je caches being built at the same time share this 100mb cache -->
    <str name="berkleyInternalCacheSize">100000000</str> 
    <str name="berkleyInternalShared">true</str>
  </lst>
</requestHandler>
<requestHandler name="/dih-child1" class="org.apache.solr.handler.dataimport.DataImportHandler">
  <lst name="defaults">
    <str name="config">dataconfig-child1.xml</str>
    <str name="clean">true</str>
    <str name="persistCacheBaseDir">/path/to/caches</str>
    <str name="persistCacheName">CHILD1</str>
    <str name="persistCacheImpl">org.apache.solr.handler.dataimport.BerkleyBackedCache</str>
    <str name="persistCacheFieldNames">PARENT_ID,          CHILD_ONE_DATA</str>

    <str name="persistCacheFieldTypes">BIGDECIMAL_INTEGER, STRING</str>
    <str name="cacheKey">PARENT_ID</str>
    <str name="writerImpl">org.apache.solr.handler.dataimport.DIHCacheWriter</str>
    <str name="berkleyInternalCacheSize">100000000</str>
    <str name="berkleyInternalShared">true</str>
  </lst>
</requestHandler>
<requestHandler name="/dih-child2" class="org.apache.solr.handler.dataimport.DataImportHandler">
  <lst name="defaults">
    <str name="config">dataconfig-child2.xml</str>
    <str name="clean">true</str>
    <str name="persistCacheBaseDir">/path/to/caches</str>
    <str name="persistCacheName">CHILD2</str>
    <str name="persistCacheImpl">org.apache.solr.handler.dataimport.BerkleyBackedCache</str>
    <str name="persistCacheFieldNames">PARENT_ID,          CHILD_TWO_DATA</str>

    <str name="persistCacheFieldTypes">BIGDECIMAL_INTEGER, STRING</str>
    <str name="cacheKey">PARENT_ID</str>
    <str name="writerImpl">org.apache.solr.handler.dataimport.DIHCacheWriter</str>
    <str name="berkleyInternalCacheSize">100000000</str>
    <str name="berkleyInternalShared">true</str>
  </lst>
</requestHandler>
<requestHandler name="/dih-master" class="org.apache.solr.handler.dataimport.DataImportHandler">
<lst name="defaults">
  <str name="config">dataconfig-master.xml</str>
  <str name="clean">true</str>
  <str name="commit">true</str>
  <str name="optimize">false</str>
</lst>
</requestHandler>


<!-- dataconfig-parent.xml -->
<dataConfig>
  <dataSource name="zzz" driver="xxx" url="ccc" />
  <document name="PARENT">
    <entity name="PARENT" dataSource="zzz" query="SELECT ID, SOME_DATA FROM PARENT" />
  </document>
</dataConfig>

<!-- dataconfig-child1.xml -->
<dataConfig>
  <dataSource name="zzz" driver="xxx" url="ccc" />
  <document name="CHILD1">
    <entity name="CHILD1" dataSource="zzz" query="SELECT PARENT_ID, CHILD_ONE_DATA FROM
CHILD1" />
  </document>
</dataConfig>

<!-- dataconfig-child2.xml -->
<dataConfig>
  <dataSource name="zzz" driver="xxx" url="ccc" />
  <document name="CHILD2">
    <entity name="CHILD2" dataSource="zzz" query="SELECT PARENT_ID, CHILD_TWO_DATA FROM
CHILD2" />
  </document>
</dataConfig>

<!-- dataconfig-master.xml -->
<dataConfig>
  <document name="MASTER">
    <entity name="PARENT"
      processor="org.apache.solr.handler.dataimport.DIHCacheProcessor"
      cacheKey="ID"      
      persistCacheBaseDir="/path/to/caches"
      persistCacheImpl="org.apache.solr.handler.dataimport.BerkleyBackedCache"
      persistCacheName="PARENT"
      berkleyInternalCacheSize="100000000" <!-- all bdb-je caches share this 100mb cache
-->
      berkleyInternalShared="true"
    >
      <entity
        name="CHILD1"
        processor="org.apache.solr.handler.dataimport.DIHCacheProcessor"
        cacheKey="PARENT_ID"
        cacheLookup="PARENT.ID"        
        persistCacheBaseDir="/path/to/caches"
        persistCacheImpl="org.apache.solr.handler.dataimport.BerkleyBackedCache"
        persistCacheName="CHILD1"
        berkleyInternalCacheSize="100000000"
        berkleyInternalShared="true"        
      />
      <entity
        name="CHILD2"
        processor="org.apache.solr.handler.dataimport.DIHCacheProcessor"
        cacheKey="PARENT_ID"
        cacheLookup="PARENT.ID"        
        persistCacheBaseDir="/path/to/caches"
        persistCacheImpl="org.apache.solr.handler.dataimport.BerkleyBackedCache"
        persistCacheName="CHILD2"        
        berkleyInternalCacheSize="100000000"
        berkleyInternalShared="true"
      />
    </entity>
  </document>
</dataConfig>




-----Original Message-----
From: mroosendaal [mailto:mroosendaal@yahoo.com] 
Sent: Friday, November 16, 2012 8:19 AM
To: solr-user@lucene.apache.org
Subject: RE: DIH nested entities don't work

Hi,

You are correct about not wanting to index everything every day, however for
this PoC i need a 'bootstrap' mechanism which basically does what Endeca
does.

The 'defaultRowPrefetch' in the solrconfig.xml does not seem to take, i'll
have a closer look.

With the long time, it appeard that one of the views i was reading was also
by far the biggest with over 4mln entries. Other views should take much less
time.

With regards to the parallel processing, i have the 2 classes you mention
and packaged them. The documentation in the patch was not clear on how to
exactly do that. My assumption is that
* for every entity you have to define a DIH in the solrconfig and refer to
aspecific data-config-<entity>.xml
* define 1 importhandler for the join in the solrconfig 
* what isn't clear is how a data-config-<entity>.xml should look like (for
example, i see no reference in the documention to a cacheName)
* and how the data-config-join.xml should should look like


Mime
View raw message