manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stephane Gamard <steph...@gamard.net>
Subject Re: Web Connector and dates
Date Tue, 25 Jun 2013 13:14:05 GMT
Hi Karl, 

I hear you about the web date. I was hoping Manifold would give me access to the date it crawled
the document, and would update that date in case the page had been updated (in a later fetch).
Would that kind of information be available? 


On June 25, 2013 at 3:06:04 PM, Karl Wright (daddywri@gmail.com) wrote:

Hi Stephane,

Web connector content does not in general include a date - it is not in the content, and there
is no way to generate it out of nothing.  Thus the Web connector has no facility for processing
dates, and does not attempt to do anything with them even when the documents it is crawling
were referenced by an RSS feed.

The date for content indexed by the RSS connector comes, if present, from fields in the RSS
feed.  The dates are carried down from the feed to the referenced content.  This is one
specialization that makes the RSS connector different from the more general Web connector.

As for your observation that you are seeing no dates at all in Solr, as usual I must request
that you include the Solr log info output for a document that you think should have a date
attached but doesn't.  This info output shows all the arguments passed to Solr from ManifoldCF,
and their names.  It should be obvious what is going on if we can see one of those lines.

Thanks,
Karl



On Tue, Jun 25, 2013 at 8:55 AM, Stephane Gamard <stephane@gamard.net> wrote:
Hi All, 

I'm getting more and more confused with the datum of ingested content. Karl explained to me
the (not yet documented) pudateiso metadata for RSS connector, and now I'm mixing it with
content from web connector as well. My ingested content from the web connector has no date.
I've did the following to make sure it would get something (tried multiple config): 


on my solr-output:



And on my job:


The ingested content have none of the datum fields (test and/or _date) populated. Is the web-connector
abiding to the same rules as the file and other connectors as described here: https://issues.apache.org/jira/browse/CONNECTORS-657
Mime
View raw message