manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Field mapping for RSS feed
Date Thu, 04 Aug 2011 10:11:02 GMT
Hi Kate,

I did two additional check-ins yesterday evening.  Would you be so
kind as to synch up from trunk and try again?  I apologize for the
confusion.

Karl

On Wed, Aug 3, 2011 at 8:13 AM, Karl Wright <daddywri@gmail.com> wrote:
>>>>>>>
> I find it odd that I would be the first person to have this problem.
> You'd think it would be very common.
> <<<<<<
>
> Actually, I've not encountered this before even though the RSS
> connector is one of the most widely used connectors.  The only
> situation this ever came up in before was when some MetaCarta clients
> wanted to use the description field as primary content, which is why
> it is an option for the "Dechromed Content" tab.  But new feature
> requests are always welcome.
>
> Also, as you might guess by the Derby and HSQLDB issue that you
> encountered, most of our users use PostgreSQL.  The Derby and HSQLDB
> database support was added to simplify setup and allow tests to be
> written that did not involve installing another package first.
> However, each of these databases has known problems, some minor and
> some more major.  Thus you might want to consider going to PostgreSQL
> in the future if you plan on doing any serious crawling.
>
> Thanks again!
> Karl
>
> On Tue, Aug 2, 2011 at 2:56 PM, K McGonigal <kmcgoniga@gmail.com> wrote:
>> Hi Karl,
>>
>> Thank you for your quick response. I've opened a Jira ticket for this,
>> though I don't really understand what sort of solution you had in mind so I
>> didn't propose anything.
>>
>> I'm afraid I don't understand exactly what the Dechromed Content options do
>> either. I read about them in the End User Documentation, but there wasn't
>> much there yet.
>>
>> I find it odd that I would be the first person to have this problem. You'd
>> think it would be very common.
>>
>>
>> Kate
>>
>>
>> On Tue, Aug 2, 2011 at 11:05 AM, Karl Wright <daddywri@gmail.com> wrote:
>>>
>>> I just looked at the code.  It's not a bug rather than an oversight of
>>> sorts.  The "description" or "content" fields are indexed as the
>>> primary content of the document if the "chrome" mode is selected
>>> accordingly.  If "None" is the "chrome" mode, then the item-level
>>> description field is ignored even when present.
>>>
>>> So I recommend simply adding a new kind of "description" field for
>>> when the "chrome" mode is set to "None".  "item/description" may be
>>> its name, or maybe the full XPath, your choice.  Propose something in
>>> the ticket and I'll respond.
>>>
>>> Thanks!
>>> Karl
>>>
>>>
>>> On Tue, Aug 2, 2011 at 11:47 AM, Karl Wright <daddywri@gmail.com> wrote:
>>> > Hi Kate,
>>> >
>>> > The field mapping won't do the trick because the RSS connector is
>>> > currently very selective about what fields it extracts - it by no
>>> > means extracts all of them, so the ones that it *does* extract from
>>> > the feed are "special".
>>> >
>>> > The behavior you describe sounds like a bug to me.  I'll go spelunking
>>> > through the code at first opportunity.  In the meantime, could you
>>> > create a Jira ticket describing the behavior you see vs. the behavior
>>> > you want?
>>> >
>>> > Thanks!
>>> > Karl
>>> >
>>> > On Tue, Aug 2, 2011 at 11:41 AM, K McGonigal <kmcgoniga@gmail.com>
>>> > wrote:
>>> >> Hi,
>>> >>
>>> >> I'm trying to use ManifoldCF to index an RSS feed into Solr.  It sort
>>> >> of
>>> >> works, but my main problem at the moment is that the *channel*
>>> >> description
>>> >> from the RSS feed is written to the "description" field in Solr when
I
>>> >> would
>>> >> really like the *item* description to be written instead.
>>> >>
>>> >> I have a typical RSS feed with the general structure:
>>> >>
>>> >> <rss>
>>> >>     <channel>
>>> >>         <title></title>
>>> >>         <link></link>
>>> >>         <description> *** the description I don't want ***
>>> >> </description>
>>> >>         <item>
>>> >>             <title></title>
>>> >>             <link></link>
>>> >>             <pubDate></pubDate>
>>> >>             <description> *** the description I do want
***
>>> >> </description>
>>> >>             <author></author>
>>> >>             <category></category>
>>> >>         </item>
>>> >>     </channel>
>>> >> </rss>
>>> >>
>>> >> I tried setting up the  field mapping on the job with the XPath address
>>> >> of
>>> >> the second description, i.e. "/rss/channel/item/description" as the
>>> >> source,
>>> >> but that did not work.
>>> >>
>>> >> I suspect I'm overlooking something simple, but I've spent 2 days
>>> >> trying to
>>> >> solve it.  I would be grateful for any help.
>>> >>
>>> >>
>>> >> Kate McGonigal
>>> >>
>>> >>
>>> >>
>>> >
>>
>>
>

Mime
View raw message