manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Field mapping for RSS feed
Date Thu, 04 Aug 2011 17:08:42 GMT
>>>>>>
I guess the only caveat is that, to use this, one has to know to add a
"summary" field to their Solr schema. Long-term, I wonder if the
"field mapping" feature could be used to let users map any RSS element
(based on its XPath "address") to any Solr field?
<<<<<<

The problem is that there are three different kinds of feeds that the
RSS connector supports, and they have different names for each kind of
item element.  The RSS connector attempts to normalize all that mess
into something more standard.

>>>>>>
But I wonder if it is working properly for Dechromed Content =
"Dechromed content, if present, in 'description' field". When I use
that, nothing is sent to Solr, although the job terminates OK and
doesn't hang like it was doing before. Is that what is to be expected?
<<<<<<

I'll look at this.  The behavior you should see is an indexing
operation per document but the content should just include the
description string.

>>>>>>
I'm actually still confused by all the dechromed options because I
thought that the item description was used as the dechromed content.
So does "Dechromed content, if present, in 'description' field" mean
that the contents of the item description element will be used for
indexing instead of the web page specified by the link?
<<<<<<

Your understanding is correct.  When I tried this last night I looked
at the Simple History and it looked like the description data was sent
to the Solr index (based on the reported size).  I'll have to see
whether it actually gets there though.

Karl

On Thu, Aug 4, 2011 at 1:00 PM, K McGonigal <kmcgoniga@gmail.com> wrote:
> Works great now (with  Dechromed Content = "No dechromed content").
> Thanks!!
>
> I guess the only caveat is that, to use this, one has to know to add a
> "summary" field to their Solr schema. Long-term, I wonder if the "field
> mapping" feature could be used to let users map any RSS element (based on
> its XPath "address") to any Solr field?
>
> But I wonder if it is working properly for Dechromed Content = "Dechromed
> content, if present, in 'description' field". When I use that, nothing is
> sent to Solr, although the job terminates OK and doesn't hang like it was
> doing before. Is that what is to be expected?
>
> I'm actually still confused by all the dechromed options because I thought
> that the item description was used as the dechromed content. So does
> "Dechromed content, if present, in 'description' field" mean that the
> contents of the item description element will be used for indexing instead
> of the web page specified by the link?
>
> Sorry for all these questions. I appreciate your patience.
>
>
> On Thu, Aug 4, 2011 at 5:11 AM, Karl Wright <daddywri@gmail.com> wrote:
>>
>> Hi Kate,
>>
>> I did two additional check-ins yesterday evening.  Would you be so
>> kind as to synch up from trunk and try again?  I apologize for the
>> confusion.
>>
>> Karl
>>
>> On Wed, Aug 3, 2011 at 8:13 AM, Karl Wright <daddywri@gmail.com> wrote:
>> >>>>>>>
>> > I find it odd that I would be the first person to have this problem.
>> > You'd think it would be very common.
>> > <<<<<<
>> >
>> > Actually, I've not encountered this before even though the RSS
>> > connector is one of the most widely used connectors.  The only
>> > situation this ever came up in before was when some MetaCarta clients
>> > wanted to use the description field as primary content, which is why
>> > it is an option for the "Dechromed Content" tab.  But new feature
>> > requests are always welcome.
>> >
>> > Also, as you might guess by the Derby and HSQLDB issue that you
>> > encountered, most of our users use PostgreSQL.  The Derby and HSQLDB
>> > database support was added to simplify setup and allow tests to be
>> > written that did not involve installing another package first.
>> > However, each of these databases has known problems, some minor and
>> > some more major.  Thus you might want to consider going to PostgreSQL
>> > in the future if you plan on doing any serious crawling.
>> >
>> > Thanks again!
>> > Karl
>> >
>> > On Tue, Aug 2, 2011 at 2:56 PM, K McGonigal <kmcgoniga@gmail.com> wrote:
>> >> Hi Karl,
>> >>
>> >> Thank you for your quick response. I've opened a Jira ticket for this,
>> >> though I don't really understand what sort of solution you had in mind
>> >> so I
>> >> didn't propose anything.
>> >>
>> >> I'm afraid I don't understand exactly what the Dechromed Content
>> >> options do
>> >> either. I read about them in the End User Documentation, but there
>> >> wasn't
>> >> much there yet.
>> >>
>> >> I find it odd that I would be the first person to have this problem.
>> >> You'd
>> >> think it would be very common.
>> >>
>> >>
>> >> Kate
>> >>
>> >>
>> >> On Tue, Aug 2, 2011 at 11:05 AM, Karl Wright <daddywri@gmail.com>
>> >> wrote:
>> >>>
>> >>> I just looked at the code.  It's not a bug rather than an oversight
of
>> >>> sorts.  The "description" or "content" fields are indexed as the
>> >>> primary content of the document if the "chrome" mode is selected
>> >>> accordingly.  If "None" is the "chrome" mode, then the item-level
>> >>> description field is ignored even when present.
>> >>>
>> >>> So I recommend simply adding a new kind of "description" field for
>> >>> when the "chrome" mode is set to "None".  "item/description" may be
>> >>> its name, or maybe the full XPath, your choice.  Propose something
in
>> >>> the ticket and I'll respond.
>> >>>
>> >>> Thanks!
>> >>> Karl
>> >>>
>> >>>
>> >>> On Tue, Aug 2, 2011 at 11:47 AM, Karl Wright <daddywri@gmail.com>
>> >>> wrote:
>> >>> > Hi Kate,
>> >>> >
>> >>> > The field mapping won't do the trick because the RSS connector
is
>> >>> > currently very selective about what fields it extracts - it by
no
>> >>> > means extracts all of them, so the ones that it *does* extract
from
>> >>> > the feed are "special".
>> >>> >
>> >>> > The behavior you describe sounds like a bug to me.  I'll go
>> >>> > spelunking
>> >>> > through the code at first opportunity.  In the meantime, could
you
>> >>> > create a Jira ticket describing the behavior you see vs. the
>> >>> > behavior
>> >>> > you want?
>> >>> >
>> >>> > Thanks!
>> >>> > Karl
>> >>> >
>> >>> > On Tue, Aug 2, 2011 at 11:41 AM, K McGonigal <kmcgoniga@gmail.com>
>> >>> > wrote:
>> >>> >> Hi,
>> >>> >>
>> >>> >> I'm trying to use ManifoldCF to index an RSS feed into Solr. 
It
>> >>> >> sort
>> >>> >> of
>> >>> >> works, but my main problem at the moment is that the *channel*
>> >>> >> description
>> >>> >> from the RSS feed is written to the "description" field in
Solr
>> >>> >> when I
>> >>> >> would
>> >>> >> really like the *item* description to be written instead.
>> >>> >>
>> >>> >> I have a typical RSS feed with the general structure:
>> >>> >>
>> >>> >> <rss>
>> >>> >>     <channel>
>> >>> >>         <title></title>
>> >>> >>         <link></link>
>> >>> >>         <description> *** the description I don't
want ***
>> >>> >> </description>
>> >>> >>         <item>
>> >>> >>             <title></title>
>> >>> >>             <link></link>
>> >>> >>             <pubDate></pubDate>
>> >>> >>             <description> *** the description
I do want ***
>> >>> >> </description>
>> >>> >>             <author></author>
>> >>> >>             <category></category>
>> >>> >>         </item>
>> >>> >>     </channel>
>> >>> >> </rss>
>> >>> >>
>> >>> >> I tried setting up the  field mapping on the job with the
XPath
>> >>> >> address
>> >>> >> of
>> >>> >> the second description, i.e. "/rss/channel/item/description"
as the
>> >>> >> source,
>> >>> >> but that did not work.
>> >>> >>
>> >>> >> I suspect I'm overlooking something simple, but I've spent
2 days
>> >>> >> trying to
>> >>> >> solve it.  I would be grateful for any help.
>> >>> >>
>> >>> >>
>> >>> >> Kate McGonigal
>> >>> >>
>> >>> >>
>> >>> >>
>> >>> >
>> >>
>> >>
>> >
>
>

Mime
View raw message