manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Trouble indexing a Twitter search in RSS format
Date Mon, 15 Aug 2011 22:21:33 GMT
Never mind on the ticket  - I created it.  CONNECTORS-239.

Karl


On Mon, Aug 15, 2011 at 6:05 PM, Karl Wright <daddywri@gmail.com> wrote:
>> Also, there appears to be a little bug in that if "Use chromed content if no
>> dechromed content found" is selected, when you go back to edit that job, it
>> is not selected (i.e. neither of the bottom two radio buttons are active).
>> Should I open a JIRA ticket for that?
>
> Yes please.
>
> For the rest, I suspect that you have been running the same job over
> and over again to get the results you describe.  However, you should
> be aware that ManifoldCF is an incremental crawler.  It will NOT
> reindex content that has not changed between job runs.
>
> So the only result that is definitely weird is:
>
>> case 4)  "Dechromed content, if present, in 'description' field" and "Never
>> use chromed content"
>>                      --> Ingests but both "description" and "summary"
fields
>> ARE EMPTY in Solr
>
> I'd like to play with this one here, if you can give me the URL in
> question that you are using.
>
> Karl
>
> On Mon, Aug 15, 2011 at 4:07 PM, K McGonigal <kmcgoniga@gmail.com> wrote:
>> That makes sense, but my  RSS feed DOES have a "description" field within
>> the "item" field.
>>
>> Upon further experimentation with the two sets of dechromed radio buttons, I
>> found the following.
>>
>> case 1)  "No dechromed content" and "Use chromed content if no dechromed
>> content found"
>>                      --> Ingests to both "description" and
"summary" fields
>> in Solr
>> e.g.
>>>
>>> INFO: {add=[http://twitter.com/MicrobeWorld/statuses/103102842524545025]}
>>> 0 0
>>> 15-Aug-2011 2:51:26 PM org.apache.solr.core.SolrCore execute
>>> INFO: [] webapp=/solr path=/update/extract
>>> params={literal.service=Twitter&liter
>>>
>>> al.source=http://search.twitter.com/search.rss?q%3DCampylobacter&literal.summary
>>>
>>> =<em>Campylobacter</em>+bacteria:+<em>Campylobacter</em>+bacteria+are+the+number
>>>
>>> -one+cause+of+food-related+gastrointestinal+illness...+<a+href%3D"http://t.co/0B
>>>
>>> k8mTm">http://t.co/0Bk8mTm</a>&literal.id=http://twitter.com/MicrobeWorld/status
>>>
>>> es/103102842524545025&literal.title=Campylobacter+bacteria:+Campylobacter+bacter
>>>
>>> ia+are+the+number-one+cause+of+food-related+gastrointestinal+illness...+http://t
>>> .co/0Bk8mTm&literal.pubdate=1313416607000} status=0 QTime=0
>>> 15-Aug-2011 2:51:31 PM org.apache.solr.update.processor.LogUpdateProcessor
>>> finis
>>> h
>>
>>
>> case 2)  "No dechromed content" and "Never use chromed content"
>>                      --> didn't ingest
>>
>> case 3)  "Dechromed content, if present, in 'description' field" and "Use
>> chromed content if no dechromed content found"
>>                      --> didn't ingest
>>
>> case 4)  "Dechromed content, if present, in 'description' field" and "Never
>> use chromed content"
>>                      --> Ingests but both "description" and
"summary" fields
>> ARE EMPTY in Solr
>> e.g.
>>>
>>> INFO: {add=[http://twitter.com/MicrobeWorld/statuses/103102842524545025]}
>>> 0 0
>>> 15-Aug-2011 3:04:02 PM org.apache.solr.update.processor.LogUpdateProcessor
>>> finis
>>> h
>>
>>
>> I hope that is all to be expected.
>>
>> Also, there appears to be a little bug in that if "Use chromed content if no
>> dechromed content found" is selected, when you go back to edit that job, it
>> is not selected (i.e. neither of the bottom two radio buttons are active).
>> Should I open a JIRA ticket for that?
>>
>>
>> On Mon, Aug 15, 2011 at 11:49 AM, Karl Wright <daddywri@gmail.com> wrote:
>>>
>>> The behavior depends on the setting of the other pair of radio buttons
>>> on that tab.  You can select "Use chromed content if not found" or
>>> "Never use chromed content".  So, if the feed has no "description"
>>> field for the document, and the dechromed content setting is
>>> "description field", and the other setting is "Never use chromed
>>> content", no document will be indexed.
>>>
>>> Karl
>>>
>>>
>>> On Mon, Aug 15, 2011 at 12:44 PM, K McGonigal <kmcgoniga@gmail.com> wrote:
>>> > I deleted my twitter RSS job and created another one and now it works!
>>> >
>>> > Doing some experimentation, I see that when Dechromed Content is set to
>>> > "No
>>> > dechromed content" it ingests fine, but when set to "if present, in
>>> > 'description' field" it doesn't do the ingestion (nothing is added to
>>> > Solr).  Is that to be expected?
>>> >
>>> >
>>> > Kate
>>> >
>>> >
>>> > On Mon, Aug 15, 2011 at 10:48 AM, Karl Wright <daddywri@gmail.com>
>>> > wrote:
>>> >>
>>> >> Regardless of the twitter sign-in issue, I'd still expect the RSS
>>> >> connector to index whatever it finds at the redirected page, even if
>>> >> it's not very useful stuff.  Could you send me a screen shot of the
>>> >> view page for the RSS connection and for the RSS job?  Also, if you
>>> >> could delete the job that contains the twitter RSS feed and recreated
>>> >> it, then crawl, I'd like to see the simple history for that crawl.
>>> >>
>>> >> Thanks,
>>> >> Karl
>>> >>
>>> >> On Mon, Aug 15, 2011 at 11:38 AM, K McGonigal <kmcgoniga@gmail.com>
>>> >> wrote:
>>> >> > Hmm, that's odd the URLs didn't work for you.  I've asked other
>>> >> > people
>>> >> > here
>>> >> > to try them and they had no problems.
>>> >> >
>>> >> > After your suggestion I tried the web connector (but still with
no
>>> >> > access
>>> >> > credentials) and it did pretty well ingesting the RSS feed, so
I
>>> >> > might
>>> >> > be
>>> >> > able to just use that.
>>> >> >
>>> >> > I'm still mystified as to why the RSS connector couldn't handle
it
>>> >> > though. I
>>> >> > turned on DEBUG logging in Manifold, but that did not show anything
>>> >> > unusual.
>>> >> >
>>> >> > Thanks,
>>> >> > Kate
>>> >> >
>>> >> > On Fri, Aug 12, 2011 at 3:58 PM, Karl Wright <daddywri@gmail.com>
>>> >> > wrote:
>>> >> >>
>>> >> >> When I drop any of these URLs into my browser, I get redirected
to a
>>> >> >> login screen.  Therefore it looks to me like Twitter does
some kind
>>> >> >> of
>>> >> >> session-based login, tracked with cookies.  That would require
>>> >> >> maintenance of session cookies which the RSS connector simply
does
>>> >> >> not
>>> >> >> do, and the coding of a login sequence as well.
>>> >> >>
>>> >> >> This is not a straightforward feature to add to the RSS connector,
>>> >> >> by
>>> >> >> any
>>> >> >> means.
>>> >> >>
>>> >> >> The web connector does have support for login sequencing and
cookie
>>> >> >> session maintenance, and it does know how to chase RSS feeds,
so
>>> >> >> that
>>> >> >> might be an option for you to try.  The problem is that most
login
>>> >> >> sequences are non-trivial to set up and you will need a lot
of
>>> >> >> patience and web spelunking skills to get it right.  The
>>> >> >> documentation
>>> >> >> is of some help but really could use a good example.
>>> >> >>
>>> >> >>
>>> >> >> Hope this helps.
>>> >> >> Karl
>>> >> >>
>>> >> >> On Fri, Aug 12, 2011 at 4:42 PM, K McGonigal <kmcgoniga@gmail.com>
>>> >> >> wrote:
>>> >> >> > Sorry to bother everyone again but I'm having trouble
with an RSS
>>> >> >> > connector
>>> >> >> > job on a Twitter search. When I try to run a job on
>>> >> >> > http://search.twitter.com/search.rss?q=Campylobacter the
fetch
>>> >> >> > appears
>>> >> >> > to
>>> >> >> > work OK, but the document ingestion does not occur.
>>> >> >> >
>>> >> >> > I was wondering if it is just my setup, or could it be
the
>>> >> >> > redirection
>>> >> >> > that
>>> >> >> > Twitter does on the links. For instance, a link shown
in the RSS
>>> >> >> > feed
>>> >> >> > as
>>> >> >> > http://twitter.com/VashinkaInuiel/statuses/101493222852923393
>>> >> >> > redirects
>>> >> >> > to
>>> >> >> > http://twitter.com/#!/VashinkaInuiel/statuses/101493222852923393
>>> >> >> > when
>>> >> >> > it
>>> >> >> > is
>>> >> >> > followed.
>>> >> >> >
>>> >> >> > Any help is very appreciated.
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >
>>> >> >
>>> >
>>> >
>>
>>
>

Mime
View raw message