manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Field mapping for RSS feed
Date Thu, 04 Aug 2011 21:26:42 GMT
I confirmed that the solr requests actually do get through fine:

Aug 4, 2011 5:24:15 PM org.apache.solr.update.processor.LogUpdateProcessor finis
h
INFO: {add=[http://www.onemansjazz.ca/content/view/328/30/]} 0 463
Aug 4, 2011 5:24:15 PM org.apache.solr.core.SolrCore execute
INFO: [] webapp=/solr path=/update/extract params={literal.source=http://www.one
mansjazz.ca/component/option,com_rss/feed,RSS2.0/no_html,1/&literal.category=Rad
io+-+Play+lists&literal.id=http://www.onemansjazz.ca/content/view/328/30/&litera
l.title=July+2,+2011+Playlist&literal.pubdate=1309523437000} status=0 QTime=463

Aug 4, 2011 5:24:15 PM org.apache.solr.update.processor.LogUpdateProcessor finis
h
INFO: {add=[http://www.onemansjazz.ca/content/view/330/50/]} 0 464
Aug 4, 2011 5:24:15 PM org.apache.solr.core.SolrCore execute
INFO: [] webapp=/solr path=/update/extract params={literal.source=http://www.one
mansjazz.ca/component/option,com_rss/feed,RSS2.0/no_html,1/&literal.category=New
s+-+General&literal.id=http://www.onemansjazz.ca/content/view/330/50/&literal.ti
tle=Listener+Survey&literal.pubdate=1310475289000} status=0 QTime=464
Aug 4, 2011 5:24:15 PM org.apache.solr.update.processor.LogUpdateProcessor finis
h
INFO: {add=[http://www.onemansjazz.ca/content/view/331/30/]} 0 466
Aug 4, 2011 5:24:15 PM org.apache.solr.core.SolrCore execute
INFO: [] webapp=/solr path=/update/extract params={literal.source=http://www.one
mansjazz.ca/component/option,com_rss/feed,RSS2.0/no_html,1/&literal.category=Rad
io+-+Play+lists&literal.id=http://www.onemansjazz.ca/content/view/331/30/&litera
l.title=July+16,+2011+Playlist&literal.pubdate=1310718848000} status=0 QTime=466

Aug 4, 2011 5:24:15 PM org.apache.solr.update.processor.LogUpdateProcessor finis
h
INFO: {add=[http://www.onemansjazz.ca/content/view/329/30/]} 0 464
Aug 4, 2011 5:24:15 PM org.apache.solr.core.SolrCore execute
INFO: [] webapp=/solr path=/update/extract params={literal.source=http://www.one
mansjazz.ca/component/option,com_rss/feed,RSS2.0/no_html,1/&literal.category=Rad
io+-+Play+lists&literal.id=http://www.onemansjazz.ca/content/view/329/30/&litera
l.title=July+9,+2011+Playlist&literal.pubdate=1310070625000} status=0 QTime=464

So I'm not sure what you are seeing.

Karl

On Thu, Aug 4, 2011 at 1:08 PM, Karl Wright <daddywri@gmail.com> wrote:
>>>>>>>
> I guess the only caveat is that, to use this, one has to know to add a
> "summary" field to their Solr schema. Long-term, I wonder if the
> "field mapping" feature could be used to let users map any RSS element
> (based on its XPath "address") to any Solr field?
> <<<<<<
>
> The problem is that there are three different kinds of feeds that the
> RSS connector supports, and they have different names for each kind of
> item element.  The RSS connector attempts to normalize all that mess
> into something more standard.
>
>>>>>>>
> But I wonder if it is working properly for Dechromed Content =
> "Dechromed content, if present, in 'description' field". When I use
> that, nothing is sent to Solr, although the job terminates OK and
> doesn't hang like it was doing before. Is that what is to be expected?
> <<<<<<
>
> I'll look at this.  The behavior you should see is an indexing
> operation per document but the content should just include the
> description string.
>
>>>>>>>
> I'm actually still confused by all the dechromed options because I
> thought that the item description was used as the dechromed content.
> So does "Dechromed content, if present, in 'description' field" mean
> that the contents of the item description element will be used for
> indexing instead of the web page specified by the link?
> <<<<<<
>
> Your understanding is correct.  When I tried this last night I looked
> at the Simple History and it looked like the description data was sent
> to the Solr index (based on the reported size).  I'll have to see
> whether it actually gets there though.
>
> Karl
>
> On Thu, Aug 4, 2011 at 1:00 PM, K McGonigal <kmcgoniga@gmail.com> wrote:
>> Works great now (with  Dechromed Content = "No dechromed content").
>> Thanks!!
>>
>> I guess the only caveat is that, to use this, one has to know to add a
>> "summary" field to their Solr schema. Long-term, I wonder if the "field
>> mapping" feature could be used to let users map any RSS element (based on
>> its XPath "address") to any Solr field?
>>
>> But I wonder if it is working properly for Dechromed Content = "Dechromed
>> content, if present, in 'description' field". When I use that, nothing is
>> sent to Solr, although the job terminates OK and doesn't hang like it was
>> doing before. Is that what is to be expected?
>>
>> I'm actually still confused by all the dechromed options because I thought
>> that the item description was used as the dechromed content. So does
>> "Dechromed content, if present, in 'description' field" mean that the
>> contents of the item description element will be used for indexing instead
>> of the web page specified by the link?
>>
>> Sorry for all these questions. I appreciate your patience.
>>
>>
>> On Thu, Aug 4, 2011 at 5:11 AM, Karl Wright <daddywri@gmail.com> wrote:
>>>
>>> Hi Kate,
>>>
>>> I did two additional check-ins yesterday evening.  Would you be so
>>> kind as to synch up from trunk and try again?  I apologize for the
>>> confusion.
>>>
>>> Karl
>>>
>>> On Wed, Aug 3, 2011 at 8:13 AM, Karl Wright <daddywri@gmail.com> wrote:
>>> >>>>>>>
>>> > I find it odd that I would be the first person to have this problem.
>>> > You'd think it would be very common.
>>> > <<<<<<
>>> >
>>> > Actually, I've not encountered this before even though the RSS
>>> > connector is one of the most widely used connectors.  The only
>>> > situation this ever came up in before was when some MetaCarta clients
>>> > wanted to use the description field as primary content, which is why
>>> > it is an option for the "Dechromed Content" tab.  But new feature
>>> > requests are always welcome.
>>> >
>>> > Also, as you might guess by the Derby and HSQLDB issue that you
>>> > encountered, most of our users use PostgreSQL.  The Derby and HSQLDB
>>> > database support was added to simplify setup and allow tests to be
>>> > written that did not involve installing another package first.
>>> > However, each of these databases has known problems, some minor and
>>> > some more major.  Thus you might want to consider going to PostgreSQL
>>> > in the future if you plan on doing any serious crawling.
>>> >
>>> > Thanks again!
>>> > Karl
>>> >
>>> > On Tue, Aug 2, 2011 at 2:56 PM, K McGonigal <kmcgoniga@gmail.com>
wrote:
>>> >> Hi Karl,
>>> >>
>>> >> Thank you for your quick response. I've opened a Jira ticket for this,
>>> >> though I don't really understand what sort of solution you had in mind
>>> >> so I
>>> >> didn't propose anything.
>>> >>
>>> >> I'm afraid I don't understand exactly what the Dechromed Content
>>> >> options do
>>> >> either. I read about them in the End User Documentation, but there
>>> >> wasn't
>>> >> much there yet.
>>> >>
>>> >> I find it odd that I would be the first person to have this problem.
>>> >> You'd
>>> >> think it would be very common.
>>> >>
>>> >>
>>> >> Kate
>>> >>
>>> >>
>>> >> On Tue, Aug 2, 2011 at 11:05 AM, Karl Wright <daddywri@gmail.com>
>>> >> wrote:
>>> >>>
>>> >>> I just looked at the code.  It's not a bug rather than an oversight
of
>>> >>> sorts.  The "description" or "content" fields are indexed as the
>>> >>> primary content of the document if the "chrome" mode is selected
>>> >>> accordingly.  If "None" is the "chrome" mode, then the item-level
>>> >>> description field is ignored even when present.
>>> >>>
>>> >>> So I recommend simply adding a new kind of "description" field for
>>> >>> when the "chrome" mode is set to "None".  "item/description" may
be
>>> >>> its name, or maybe the full XPath, your choice.  Propose something
in
>>> >>> the ticket and I'll respond.
>>> >>>
>>> >>> Thanks!
>>> >>> Karl
>>> >>>
>>> >>>
>>> >>> On Tue, Aug 2, 2011 at 11:47 AM, Karl Wright <daddywri@gmail.com>
>>> >>> wrote:
>>> >>> > Hi Kate,
>>> >>> >
>>> >>> > The field mapping won't do the trick because the RSS connector
is
>>> >>> > currently very selective about what fields it extracts - it
by no
>>> >>> > means extracts all of them, so the ones that it *does* extract
from
>>> >>> > the feed are "special".
>>> >>> >
>>> >>> > The behavior you describe sounds like a bug to me.  I'll go
>>> >>> > spelunking
>>> >>> > through the code at first opportunity.  In the meantime, could
you
>>> >>> > create a Jira ticket describing the behavior you see vs. the
>>> >>> > behavior
>>> >>> > you want?
>>> >>> >
>>> >>> > Thanks!
>>> >>> > Karl
>>> >>> >
>>> >>> > On Tue, Aug 2, 2011 at 11:41 AM, K McGonigal <kmcgoniga@gmail.com>
>>> >>> > wrote:
>>> >>> >> Hi,
>>> >>> >>
>>> >>> >> I'm trying to use ManifoldCF to index an RSS feed into
Solr.  It
>>> >>> >> sort
>>> >>> >> of
>>> >>> >> works, but my main problem at the moment is that the *channel*
>>> >>> >> description
>>> >>> >> from the RSS feed is written to the "description" field
in Solr
>>> >>> >> when I
>>> >>> >> would
>>> >>> >> really like the *item* description to be written instead.
>>> >>> >>
>>> >>> >> I have a typical RSS feed with the general structure:
>>> >>> >>
>>> >>> >> <rss>
>>> >>> >>     <channel>
>>> >>> >>         <title></title>
>>> >>> >>         <link></link>
>>> >>> >>         <description> *** the description I
don't want ***
>>> >>> >> </description>
>>> >>> >>         <item>
>>> >>> >>             <title></title>
>>> >>> >>             <link></link>
>>> >>> >>             <pubDate></pubDate>
>>> >>> >>             <description> *** the description
I do want ***
>>> >>> >> </description>
>>> >>> >>             <author></author>
>>> >>> >>             <category></category>
>>> >>> >>         </item>
>>> >>> >>     </channel>
>>> >>> >> </rss>
>>> >>> >>
>>> >>> >> I tried setting up the  field mapping on the job with
the XPath
>>> >>> >> address
>>> >>> >> of
>>> >>> >> the second description, i.e. "/rss/channel/item/description"
as the
>>> >>> >> source,
>>> >>> >> but that did not work.
>>> >>> >>
>>> >>> >> I suspect I'm overlooking something simple, but I've spent
2 days
>>> >>> >> trying to
>>> >>> >> solve it.  I would be grateful for any help.
>>> >>> >>
>>> >>> >>
>>> >>> >> Kate McGonigal
>>> >>> >>
>>> >>> >>
>>> >>> >>
>>> >>> >
>>> >>
>>> >>
>>> >
>>
>>
>

Mime
View raw message