lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Derek Werthmuller <dwert...@ctg.albany.edu>
Subject RE: DataimportHandler development issue
Date Fri, 21 Jan 2011 20:10:05 GMT
It seems the proper xpath statement to select the href for the link child
when rel="self" is
/feed/link[@rel='self']/string(@ref) for the root

/feed/entry/link[@rel='alternate']/string(@ref) should get the childern .

But it doesn't work in the DIH, does work on other xpath query processors.

Can the DIH handle compound xpath statements?


 

-----Original Message-----
From: Gora Mohanty [mailto:gora@mimirtech.com] 
Sent: Friday, January 14, 2011 3:08 AM
To: solr-user@lucene.apache.org
Subject: Re: DataimportHandler development issue

On Fri, Jan 14, 2011 at 12:17 AM, Derek Werthmuller
<dwerthmu@ctg.albany.edu> wrote:

> Its not clear why its not working.  Advice?
> Also is this the best way to load data?  We intent on loading several 
> thousand docbook documents once we understand how this all works.  We 
> stuck with the rss/atom example since we didn't want to deal with 
> schema changes yet.
> Thanks
>        Derek
>
> example-DIH/solr/rss/conf/rss-data-config.xml  modified source:
> <dataConfig>
> <dataSource type="URLDataSource" />
> <document>
> <entity name="slashdot"
> pk="link"
> url="http://twitter.com/statuses/user_timeline/existdb.rss"
> processor="XPathEntityProcessor"
> forEach="/rss/channel | /rss/channel/item"
> transformer="DateFormatTransformer">
>
> <field column="source" xpath="/rss/channel/title" commonField="true" 
> /> <field column="source-link" xpath="/rss/channel/link" 
> commonField="true" /> <field column="subject" 
> xpath="/rss/channel/subject" commonField="true" />
>
> <field column="title" xpath="/rss/channel/item/title" /> <field 
> column="link" xpath="/rss/channel/item/link" /> <field 
> column="description" xpath="/rss/channel/item/description" /> <field 
> column="creator" xpath="/rss/channel/item/creator" /> <field 
> column="item-subject" xpath="/rss/channel/item/subject" /> <field 
> column="date" xpath="/rss/channel/item/date"
> dateTimeFormat="yyyy-MM-dd'T'hh:mm:ss" /> <field 
> column="slash-department" xpath="/rss/channel/item/department" /> 
> <field column="slash-section" xpath="/rss/channel/item/section" /> 
> <field column="slash-comments" xpath="/rss/channel/item/comments" /> 
> </entity>
>
> <entity name="twitter"
> pk="link"
> url="http://twitter.com/statuses/user_timeline/ctg_ualbany.atom"
> processor="XPathEntityProcessor"
> forEach="/feed | /feed/entry"
> transformer="DateFormatTransformer">
>
> <field column="source" xpath="/feed/title" commonField="true" /> 
> <field column="source-link" xpath="/feed/link" commonField="true" /> 
> <field column="subject" xpath="/feed/subtitle" commonField="true" />
>
> <field column="title" xpath="/feed/entry/title" /> <field 
> column="link" xpath="/feed/entry/link" /> <field column="description" 
> xpath="/feed/entry/description" /> <field column="creator" 
> xpath="/feed/entry/creator" /> <field column="item-subject" 
> xpath="/feed/entry/subject" /> <field column="date" 
> xpath="/rss/channel/item/date"
> dateTimeFormat="yyyy-MM-dd'T'hh:mm:ss" /> <field 
> column="slash-department" xpath="/feed/entry/department" /> <field 
> column="slash-section" xpath="/feed/entry/section" /> <field 
> column="slash-comments" xpath="/feed/entry/comments" /> </entity> 
> </document> </dataConfig>

Your problem is the second entity in the DIH configuration file. The Solr
schema defines the unique key to be the field "link". As noted in the
comments in schema.xml, this means that this field is required.
Solr is not able to populate the "link" field from the Atom feed. I have not
tracked down why this is so, but it is probably because there is more than
one link node under /feed/entry, and the "link" field is not multi-valued.
Change the xpath to, say, "/feed/entry/id", and the import works. Also,
while this is not necessarily an issue, please note that several other
fields have incorrect xpaths for this entity.

To answer your other question, this way of importing data should work fine.

Regards,
Gora

Mime
View raw message