manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From K McGonigal <kmcgon...@gmail.com>
Subject Re: Field mapping for RSS feed
Date Thu, 04 Aug 2011 17:00:32 GMT
Works great now (with  Dechromed Content = "No dechromed content").
Thanks!!

I guess the only caveat is that, to use this, one has to know to add a
"summary" field to their Solr schema. Long-term, I wonder if the "field
mapping" feature could be used to let users map any RSS element (based on
its XPath "address") to any Solr field?

But I wonder if it is working properly for Dechromed Content = "Dechromed
content, if present, in 'description' field". When I use that, nothing is
sent to Solr, although the job terminates OK and doesn't hang like it was
doing before. Is that what is to be expected?

I'm actually still confused by all the dechromed options because I thought
that the item description was used as the dechromed content. So does
"Dechromed content, if present, in 'description' field" mean that the
contents of the item description element will be used for indexing instead
of the web page specified by the link?

Sorry for all these questions. I appreciate your patience.


On Thu, Aug 4, 2011 at 5:11 AM, Karl Wright <daddywri@gmail.com> wrote:

> Hi Kate,
>
> I did two additional check-ins yesterday evening.  Would you be so
> kind as to synch up from trunk and try again?  I apologize for the
> confusion.
>
> Karl
>
> On Wed, Aug 3, 2011 at 8:13 AM, Karl Wright <daddywri@gmail.com> wrote:
> >>>>>>>
> > I find it odd that I would be the first person to have this problem.
> > You'd think it would be very common.
> > <<<<<<
> >
> > Actually, I've not encountered this before even though the RSS
> > connector is one of the most widely used connectors.  The only
> > situation this ever came up in before was when some MetaCarta clients
> > wanted to use the description field as primary content, which is why
> > it is an option for the "Dechromed Content" tab.  But new feature
> > requests are always welcome.
> >
> > Also, as you might guess by the Derby and HSQLDB issue that you
> > encountered, most of our users use PostgreSQL.  The Derby and HSQLDB
> > database support was added to simplify setup and allow tests to be
> > written that did not involve installing another package first.
> > However, each of these databases has known problems, some minor and
> > some more major.  Thus you might want to consider going to PostgreSQL
> > in the future if you plan on doing any serious crawling.
> >
> > Thanks again!
> > Karl
> >
> > On Tue, Aug 2, 2011 at 2:56 PM, K McGonigal <kmcgoniga@gmail.com> wrote:
> >> Hi Karl,
> >>
> >> Thank you for your quick response. I've opened a Jira ticket for this,
> >> though I don't really understand what sort of solution you had in mind
> so I
> >> didn't propose anything.
> >>
> >> I'm afraid I don't understand exactly what the Dechromed Content options
> do
> >> either. I read about them in the End User Documentation, but there
> wasn't
> >> much there yet.
> >>
> >> I find it odd that I would be the first person to have this problem.
> You'd
> >> think it would be very common.
> >>
> >>
> >> Kate
> >>
> >>
> >> On Tue, Aug 2, 2011 at 11:05 AM, Karl Wright <daddywri@gmail.com>
> wrote:
> >>>
> >>> I just looked at the code.  It's not a bug rather than an oversight of
> >>> sorts.  The "description" or "content" fields are indexed as the
> >>> primary content of the document if the "chrome" mode is selected
> >>> accordingly.  If "None" is the "chrome" mode, then the item-level
> >>> description field is ignored even when present.
> >>>
> >>> So I recommend simply adding a new kind of "description" field for
> >>> when the "chrome" mode is set to "None".  "item/description" may be
> >>> its name, or maybe the full XPath, your choice.  Propose something in
> >>> the ticket and I'll respond.
> >>>
> >>> Thanks!
> >>> Karl
> >>>
> >>>
> >>> On Tue, Aug 2, 2011 at 11:47 AM, Karl Wright <daddywri@gmail.com>
> wrote:
> >>> > Hi Kate,
> >>> >
> >>> > The field mapping won't do the trick because the RSS connector is
> >>> > currently very selective about what fields it extracts - it by no
> >>> > means extracts all of them, so the ones that it *does* extract from
> >>> > the feed are "special".
> >>> >
> >>> > The behavior you describe sounds like a bug to me.  I'll go
> spelunking
> >>> > through the code at first opportunity.  In the meantime, could you
> >>> > create a Jira ticket describing the behavior you see vs. the behavior
> >>> > you want?
> >>> >
> >>> > Thanks!
> >>> > Karl
> >>> >
> >>> > On Tue, Aug 2, 2011 at 11:41 AM, K McGonigal <kmcgoniga@gmail.com>
> >>> > wrote:
> >>> >> Hi,
> >>> >>
> >>> >> I'm trying to use ManifoldCF to index an RSS feed into Solr.  It
> sort
> >>> >> of
> >>> >> works, but my main problem at the moment is that the *channel*
> >>> >> description
> >>> >> from the RSS feed is written to the "description" field in Solr
when
> I
> >>> >> would
> >>> >> really like the *item* description to be written instead.
> >>> >>
> >>> >> I have a typical RSS feed with the general structure:
> >>> >>
> >>> >> <rss>
> >>> >>     <channel>
> >>> >>         <title></title>
> >>> >>         <link></link>
> >>> >>         <description> *** the description I don't want ***
> >>> >> </description>
> >>> >>         <item>
> >>> >>             <title></title>
> >>> >>             <link></link>
> >>> >>             <pubDate></pubDate>
> >>> >>             <description> *** the description I do want ***
> >>> >> </description>
> >>> >>             <author></author>
> >>> >>             <category></category>
> >>> >>         </item>
> >>> >>     </channel>
> >>> >> </rss>
> >>> >>
> >>> >> I tried setting up the  field mapping on the job with the XPath
> address
> >>> >> of
> >>> >> the second description, i.e. "/rss/channel/item/description" as
the
> >>> >> source,
> >>> >> but that did not work.
> >>> >>
> >>> >> I suspect I'm overlooking something simple, but I've spent 2 days
> >>> >> trying to
> >>> >> solve it.  I would be grateful for any help.
> >>> >>
> >>> >>
> >>> >> Kate McGonigal
> >>> >>
> >>> >>
> >>> >>
> >>> >
> >>
> >>
> >
>

Mime
View raw message