manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stephane Gamard <steph...@gamard.net>
Subject Re: Web Connector and dates
Date Tue, 25 Jun 2013 13:27:00 GMT
Hi Karl, 

As per your recommendation I've filled in https://issues.apache.org/jira/browse/CONNECTORS-735.
I did not know you were keeping track of version. This wold work for me (for the moment) as
to be able to have the latest content from a crawl in Solr. How can I add the content's version
to the ingestion?

Cheers, 


On June 25, 2013 at 3:19:15 PM, Karl Wright (daddywri@gmail.com) wrote:

Hi Stephane,

This is tricky, because the date is included in the index and yet the version of the document
better not include the date, or there can be no incremental behavior.  However, it is possible
to do this.  If you need such a feature, please create a ticket.  I'm very behind at the
moment so it is unlikely to be worked on promptly, but I will get to it as soon as I can.

Karl





On Tue, Jun 25, 2013 at 9:14 AM, Stephane Gamard <stephane@gamard.net> wrote:
Hi Karl, 

I hear you about the web date. I was hoping Manifold would give me access to the date it crawled
the document, and would update that date in case the page had been updated (in a later fetch).
Would that kind of information be available? 


On June 25, 2013 at 3:06:04 PM, Karl Wright (daddywri@gmail.com) wrote:

Hi Stephane,

Web connector content does not in general include a date - it is not in the content, and there
is no way to generate it out of nothing.  Thus the Web connector has no facility for processing
dates, and does not attempt to do anything with them even when the documents it is crawling
were referenced by an RSS feed.

The date for content indexed by the RSS connector comes, if present, from fields in the RSS
feed.  The dates are carried down from the feed to the referenced content.  This is one
specialization that makes the RSS connector different from the more general Web connector.

As for your observation that you are seeing no dates at all in Solr, as usual I must request
that you include the Solr log info output for a document that you think should have a date
attached but doesn't.  This info output shows all the arguments passed to Solr from ManifoldCF,
and their names.  It should be obvious what is going on if we can see one of those lines.

Thanks,
Karl



On Tue, Jun 25, 2013 at 8:55 AM, Stephane Gamard <stephane@gamard.net> wrote:
Hi All, 

I'm getting more and more confused with the datum of ingested content. Karl explained to me
the (not yet documented) pudateiso metadata for RSS connector, and now I'm mixing it with
content from web connector as well. My ingested content from the web connector has no date.
I've did the following to make sure it would get something (tried multiple config): 


on my solr-output:



And on my job:


The ingested content have none of the datum fields (test and/or _date) populated. Is the web-connector
abiding to the same rules as the file and other connectors as described here: https://issues.apache.org/jira/browse/CONNECTORS-657
Mime
View raw message