lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tom Evans <tevans...@googlemail.com>
Subject Re: Using DIH FileListEntityProcessor with SolrCloud
Date Tue, 06 Dec 2016 12:18:54 GMT
On Fri, Dec 2, 2016 at 4:36 PM, Chris Rogers
<chris.rogers@bodleian.ox.ac.uk> wrote:
> Hi all,
>
> A question regarding using the DIH FileListEntityProcessor with SolrCloud (solr 6.3.0,
zookeeper 3.4.8).
>
> I get that the config in SolrCloud lives on the Zookeeper node (a different server from
the solr nodes in my setup).
>
> With this in mind, where is the baseDir attribute in the FileListEntityProcessor config
relative to? I’m seeing the config in the Solr GUI, and I’ve tried setting it as an absolute
path on my Zookeeper server, but this doesn’t seem to work… any ideas how this should
be setup?
>
> My DIH config is below:
>
> <dataConfig>
>   <dataSource type="FileDataSource"/>
>   <document>
>     <!-- this outer processor generates a list of files satisfying the conditions
>          specified in the attributes -->
>     <entity name="f" processor="FileListEntityProcessor"
>             fileName=".*xml"
>             newerThan="'NOW-5YEARS'"
>             recursive="true"
>             rootEntity="false"
>             dataSource="null"
>             baseDir="/home/bodl-zoo-svc/files/">
>
>       <!-- this processor extracts content using Xpath from each file found -->
>
>       <entity name="tei" processor="XPathEntityProcessor"
>               forEach="/TEI" url="${f.fileAbsolutePath}" transformer="RegexTransformer"
>
>         <field column="manuscript_title" name="manuscript_title" xpath="/TEI/teiHeader/fileDesc/titleStmt/title"/>
>         <field column="repository" name="repository" xpath="/TEI/teiHeader/fileDesc/publicationStmt/publisher"/>
>         <field column="id" name="id" xpath="/TEI/teiHeader/fileDesc/sourceDesc/msDesc/msIdentifier/altIdentifier/idno"/>
>       </entity>
>
>     </entity>
>
>   </document>
> </dataConfig>
>
>
> This same script worked as expected on a single solr node (i.e. not in SolrCloud mode).
>
> Thanks,
> Chris
>

Hey Chris

We hit the same problem moving from non-cloud to cloud, we had a
collection that loaded its DIH config from various XML files listing
the DB queries to run. We wrote a simple DataSource plugin function to
load the config from Zookeeper instead of local disk to avoid having
to distribute those config files around the cluster.

https://issues.apache.org/jira/browse/SOLR-8557

Cheers

Tom

Mime
View raw message