lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gora Mohanty <g...@mimirtech.com>
Subject Re: DataimportHandler development issue
Date Fri, 14 Jan 2011 08:08:01 GMT
On Fri, Jan 14, 2011 at 12:17 AM, Derek Werthmuller
<dwerthmu@ctg.albany.edu> wrote:

> Its not clear why its not working.  Advice?
> Also is this the best way to load data?  We intent on loading several
> thousand docbook documents once we understand how this all works.  We stuck
> with the rss/atom example since we didn't want to deal with schema changes
> yet.
> Thanks
>        Derek
>
> example-DIH/solr/rss/conf/rss-data-config.xml  modified source:
> <dataConfig>
> <dataSource type="URLDataSource" />
> <document>
> <entity name="slashdot"
> pk="link"
> url="http://twitter.com/statuses/user_timeline/existdb.rss"
> processor="XPathEntityProcessor"
> forEach="/rss/channel | /rss/channel/item"
> transformer="DateFormatTransformer">
>
> <field column="source" xpath="/rss/channel/title" commonField="true" />
> <field column="source-link" xpath="/rss/channel/link" commonField="true" />
> <field column="subject" xpath="/rss/channel/subject" commonField="true" />
>
> <field column="title" xpath="/rss/channel/item/title" />
> <field column="link" xpath="/rss/channel/item/link" />
> <field column="description" xpath="/rss/channel/item/description" />
> <field column="creator" xpath="/rss/channel/item/creator" />
> <field column="item-subject" xpath="/rss/channel/item/subject" />
> <field column="date" xpath="/rss/channel/item/date"
> dateTimeFormat="yyyy-MM-dd'T'hh:mm:ss" />
> <field column="slash-department" xpath="/rss/channel/item/department" />
> <field column="slash-section" xpath="/rss/channel/item/section" />
> <field column="slash-comments" xpath="/rss/channel/item/comments" />
> </entity>
>
> <entity name="twitter"
> pk="link"
> url="http://twitter.com/statuses/user_timeline/ctg_ualbany.atom"
> processor="XPathEntityProcessor"
> forEach="/feed | /feed/entry"
> transformer="DateFormatTransformer">
>
> <field column="source" xpath="/feed/title" commonField="true" />
> <field column="source-link" xpath="/feed/link" commonField="true" />
> <field column="subject" xpath="/feed/subtitle" commonField="true" />
>
> <field column="title" xpath="/feed/entry/title" />
> <field column="link" xpath="/feed/entry/link" />
> <field column="description" xpath="/feed/entry/description" />
> <field column="creator" xpath="/feed/entry/creator" />
> <field column="item-subject" xpath="/feed/entry/subject" />
> <field column="date" xpath="/rss/channel/item/date"
> dateTimeFormat="yyyy-MM-dd'T'hh:mm:ss" />
> <field column="slash-department" xpath="/feed/entry/department" />
> <field column="slash-section" xpath="/feed/entry/section" />
> <field column="slash-comments" xpath="/feed/entry/comments" />
> </entity>
> </document>
> </dataConfig>

Your problem is the second entity in the DIH configuration file. The
Solr schema defines the unique key to be the field "link". As noted in
the comments in schema.xml, this means that this field is required.
Solr is not able to populate the "link" field from the Atom feed. I have
not tracked down why this is so, but it is probably because there is
more than one link node under /feed/entry, and the "link" field is not
multi-valued. Change the xpath to, say, "/feed/entry/id", and the
import works. Also, while this is not necessarily an issue, please
note that several other fields have incorrect xpaths for this entity.

To answer your other question, this way of importing data should
work fine.

Regards,
Gora

Mime
View raw message