manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Trouble indexing a Twitter search in RSS format
Date Tue, 16 Aug 2011 09:49:56 GMT
Hi Kate,

Another point you should be aware of - this site has a robots
exclusion for crawlers, so unless you override that you will not be
able to fetch either feeds or content.  There are two ways to do the
override - you can set it to just allow the feed itself, or you can
set it to allow both feed and content.  If you select the former, then
any secondary (document) fetches will be disallowed.

Should you crawl repeatedly when the site owner says "no robots", you
can also wind up being blocked by the site owner.  In that case your
crawls will all cease to work suddenly and without warning.

Thanks,
Karl


On Tue, Aug 16, 2011 at 5:44 AM, Karl Wright <daddywri@gmail.com> wrote:
> Using your twitter RSS feed, dechromed mode="description", and chromed
> mode="skip", and turning off robots exclusion, I get a number of
> indexing operations. The following Solr log output corresponds to one
> such:
>
> INFO: {add=[http://twitter.com/DraRositaperez/statuses/103103998965456896]} 0 2
> Aug 16, 2011 5:28:52 AM org.apache.solr.core.SolrCore execute
> INFO: [] webapp=/solr path=/update/extract
> params={literal.source=http://search.twitter.com/search.rss?q%3DCampylobacter&literal.id=http://twitter.com/DraRositaperez/statuses/103103998965456896&literal.title=RT+@MicrobeWorld:+Campylobacter+bacteria:+Campylobacter+bacteria+are+the+number-one+cause+of+food-related+gastrointestinal+illness...+http://t.co/0Bk8mTm&literal.pubdate=1313416883000}
> status=0 QTime=2
>
> The document's source, title, and pubdate seem to all be set.  The
> feed's "description" field is the actual content that is being indexed
> into Solr, so that is not present in the Solr url but should be
> present in the post data.  So the only question, then, is the
> "summary" field.  Looking at the feed itself, I see <title> fields and
> <description> fields, but no <content> fields, so it makes sense that
> there would be no summary metadata.
>
> Hope this helps.  Does this agree with what you are seeing?
> Karl
>
>>
>> For the rest, I suspect that you have been running the same job over
>> and over again to get the results you describe.  However, you should
>> be aware that ManifoldCF is an incremental crawler.  It will NOT
>> reindex content that has not changed between job runs.
>>
>> So the only result that is definitely weird is:
>>
>>> case 4)  "Dechromed content, if present, in 'description' field" and "Never
>>> use chromed content"
>>>                      --> Ingests but both "description" and "summary"
fields
>>> ARE EMPTY in Solr
>>
>

Mime
View raw message