manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Field mapping for RSS feed
Date Wed, 03 Aug 2011 12:13:27 GMT
>>>>>>
I find it odd that I would be the first person to have this problem.
You'd think it would be very common.
<<<<<<

Actually, I've not encountered this before even though the RSS
connector is one of the most widely used connectors.  The only
situation this ever came up in before was when some MetaCarta clients
wanted to use the description field as primary content, which is why
it is an option for the "Dechromed Content" tab.  But new feature
requests are always welcome.

Also, as you might guess by the Derby and HSQLDB issue that you
encountered, most of our users use PostgreSQL.  The Derby and HSQLDB
database support was added to simplify setup and allow tests to be
written that did not involve installing another package first.
However, each of these databases has known problems, some minor and
some more major.  Thus you might want to consider going to PostgreSQL
in the future if you plan on doing any serious crawling.

Thanks again!
Karl

On Tue, Aug 2, 2011 at 2:56 PM, K McGonigal <kmcgoniga@gmail.com> wrote:
> Hi Karl,
>
> Thank you for your quick response. I've opened a Jira ticket for this,
> though I don't really understand what sort of solution you had in mind so I
> didn't propose anything.
>
> I'm afraid I don't understand exactly what the Dechromed Content options do
> either. I read about them in the End User Documentation, but there wasn't
> much there yet.
>
> I find it odd that I would be the first person to have this problem. You'd
> think it would be very common.
>
>
> Kate
>
>
> On Tue, Aug 2, 2011 at 11:05 AM, Karl Wright <daddywri@gmail.com> wrote:
>>
>> I just looked at the code.  It's not a bug rather than an oversight of
>> sorts.  The "description" or "content" fields are indexed as the
>> primary content of the document if the "chrome" mode is selected
>> accordingly.  If "None" is the "chrome" mode, then the item-level
>> description field is ignored even when present.
>>
>> So I recommend simply adding a new kind of "description" field for
>> when the "chrome" mode is set to "None".  "item/description" may be
>> its name, or maybe the full XPath, your choice.  Propose something in
>> the ticket and I'll respond.
>>
>> Thanks!
>> Karl
>>
>>
>> On Tue, Aug 2, 2011 at 11:47 AM, Karl Wright <daddywri@gmail.com> wrote:
>> > Hi Kate,
>> >
>> > The field mapping won't do the trick because the RSS connector is
>> > currently very selective about what fields it extracts - it by no
>> > means extracts all of them, so the ones that it *does* extract from
>> > the feed are "special".
>> >
>> > The behavior you describe sounds like a bug to me.  I'll go spelunking
>> > through the code at first opportunity.  In the meantime, could you
>> > create a Jira ticket describing the behavior you see vs. the behavior
>> > you want?
>> >
>> > Thanks!
>> > Karl
>> >
>> > On Tue, Aug 2, 2011 at 11:41 AM, K McGonigal <kmcgoniga@gmail.com>
>> > wrote:
>> >> Hi,
>> >>
>> >> I'm trying to use ManifoldCF to index an RSS feed into Solr.  It sort
>> >> of
>> >> works, but my main problem at the moment is that the *channel*
>> >> description
>> >> from the RSS feed is written to the "description" field in Solr when I
>> >> would
>> >> really like the *item* description to be written instead.
>> >>
>> >> I have a typical RSS feed with the general structure:
>> >>
>> >> <rss>
>> >>     <channel>
>> >>         <title></title>
>> >>         <link></link>
>> >>         <description> *** the description I don't want ***
>> >> </description>
>> >>         <item>
>> >>             <title></title>
>> >>             <link></link>
>> >>             <pubDate></pubDate>
>> >>             <description> *** the description I do want ***
>> >> </description>
>> >>             <author></author>
>> >>             <category></category>
>> >>         </item>
>> >>     </channel>
>> >> </rss>
>> >>
>> >> I tried setting up the  field mapping on the job with the XPath address
>> >> of
>> >> the second description, i.e. "/rss/channel/item/description" as the
>> >> source,
>> >> but that did not work.
>> >>
>> >> I suspect I'm overlooking something simple, but I've spent 2 days
>> >> trying to
>> >> solve it.  I would be grateful for any help.
>> >>
>> >>
>> >> Kate McGonigal
>> >>
>> >>
>> >>
>> >
>
>

Mime
View raw message