lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alexandre Rafalovitch <arafa...@gmail.com>
Subject Re: DIH issue with streaming xml file
Date Mon, 12 Jun 2017 18:25:40 GMT
Solr 6.5.1 DIH setup has - somewhat broken - RSS example (redone as
ATOM example in 6.6) that shows how to get stuff from https URL. You
can see the atom example here:
https://github.com/apache/lucene-solr/blob/releases/lucene-solr/6.6.0/solr/example/example-DIH/solr/atom/conf/atom-data-config.xml


The main issue however is that you are not saying what format is that
list of file on the server. Is that a plain list? Is it XML with
files? Are you doing directory listing?

Regards,
   Alex.
----
http://www.solr-start.com/ - Resources for Solr users, new and experienced


On 12 June 2017 at 14:11, Miller, William K - Norman, OK - Contractor
<William.K.Miller@usps.gov.invalid> wrote:
> Thank you for your response.  That is the issue that I am having.  I cannot figure out
how to get the list of files from the remote server.  I have tried changing the parent Entity
Processor to the XPathEntityProcessor and the baseDir to a url using https.  This did not
work as it was looking for a "foreach" attribute.  Is there an Entity Processor that can be
used to get the list of files from an https source or am I going to have to use solrj or create
a custom entity processor?
>
>
>
>
> ~~~~~~~~~~~~~~~~~~~~~~~
> William Kevin Miller
>
> ECS Federal, Inc.
> USPS/MTSC
> (405) 573-2158
>
>
> -----Original Message-----
> From: Alexandre Rafalovitch [mailto:arafalov@gmail.com]
> Sent: Monday, June 12, 2017 12:57 PM
> To: solr-user
> Subject: Re: DIH issue with streaming xml file
>
> How do you get a list of URLs for the files on the remote server? That's probably the
first issue. Once you have the URLs in an outside entity or two, you can feed them one by
one into the inner entity.
>
> Regards,
>    Alex.
>
> ----
> http://www.solr-start.com/ - Resources for Solr users, new and experienced
>
> On 12 June 2017 at 09:39, Miller, William K - Norman, OK - Contractor < William.K.Miller@usps.gov.invalid>
wrote:
>
>> I am using Solr 6.5.1 and working on importing xml files using the
>> DataImportHandler.  I am wanting to get the files from a remote
>> server, but I am dealing with multiple xml files in multiple folders.
>> I am using a nested entity in my dataConfig.  Below is an example of
>> how I have my dataConfig set up.  I got most of this from an online
>> reference.  In this example I am getting the xml files from a folder
>> on the Solr server, but as I mentioned above I want to get the files
>> from a remote server.  I have looked at the different Entity
>> Processors for the DIH, but have not seen anything that seems to work.
>> Is there a way to configure the below code to let me do this?
>>
>>
>>
>>
>>
>> <dataConfig>
>>
>>
>>
>>                 <dataSource name="hbk" encoding="UTF-8"
>> type="FileDataSource" />
>>
>>                 <document name="hbk">
>>
>>                                 <!--
>>
>>             Pickupdir fetches all files matching the filename regex in
>> the supplied directory
>>
>>             and passes them to other entities which parse the file
>> contents.
>>
>>         -->
>>
>>
>>
>>                                 <entity
>>
>>             name="pickupdir"
>>
>>             processor="FileListEntityProcessor"
>>
>>             rootEntity="false"
>>
>>             dataSource="null"
>>
>>             fileName="^[\w\d-]+\.xml$"
>>
>>             baseDir="/var/solr/data/hbk/data/xml/"
>>
>>             recursive="true"
>>
>>
>>
>>         >
>>
>>                                                 <!--
>>
>>
>> Pickupxmlfile parses standard Solr update XML.
>>
>>                                                 -->
>>
>>
>>
>>                                                 <entity
>>
>>                                                                 name="xml"
>>
>>
>> pk="itemId"
>>
>>
>> processor="XPathEntityProcessor"
>>
>>
>> transformer="RegexTransformer,TemplateTransformer"
>>
>>
>> datasource="pickupdir"
>>
>>
>> stream="true"
>>
>>
>> xsl="/var/solr/data/hbk/data/xsl/solr_timdex.xsl"
>>
>>
>> url="${pickupdir.fileAbsolutePath}"
>>
>>
>> forEach="/eflow/section | /eflow/section/item"
>>
>>                                                 >
>>
>>
>>
>>                                                                 <field
>> column="sectionId" xpath="/eflow/section/@id" commonField="true" />
>>
>>                                                                 <field
>> column="sectionTitle" xpath="/eflow/section/@title" commonField="true"
>> />
>>
>>                                                                 <field
>> column="sectionNo" xpath="/eflow/section/@secno" commonField="true" />
>>
>>                                                                 <field
>> column="hbkNo" xpath="/eflow/section/@hbkno" commonField="true" />
>>
>>                                                                 <field
>> column="volumeNo" xpath="/eflow/section/@volno" commonField="true" />
>>
>>
>>
>>                                                                 <field
>> column="itemId" xpath="/eflow/section/item/@id" />
>>
>>                                                                 <field
>> column="itemTitle" xpath="/eflow/section/item/@title" />
>>
>>                                                                 <field
>> column="itemNo" xpath="/eflow/section/item/@mit" />
>>
>>                                                                 <field
>> column="itemFile" xpath="/eflow/section/item/@file" />
>>
>>                                                                 <field
>> column="itemType" xpath="/eflow/section/item/@type" />
>>
>>                                                 </entity>
>>
>>                                 </entity>
>>
>>                 </document>
>>
>> </dataConfig>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> ~~~~~~~~~~~~~~~~~~~~~~~
>>
>> William Kevin Miller
>>
>> [image: ecsLogo]
>>
>> ECS Federal, Inc.
>>
>> USPS/MTSC
>>
>> (405) 573-2158
>>
>>
>>

Mime
View raw message