The ManifoldCF version is a packed string designed only to assist in incremental crawling.  You would not want to index it.


As per your recommendation I've filled in https://issues.apache.org/jira/browse/CONNECTORS-735. I did not know you were keeping track of version. This wold work for me (for the moment) as to be able to have the latest content from a crawl in Solr. How can I add the content's version to the ingestion?


This is tricky, because the date is included in the index and yet the version of the document better not include the date, or there can be no incremental behavior.  However, it is possible to do this.  If you need such a feature, please create a ticket.  I'm very behind at the moment so it is unlikely to be worked on promptly, but I will get to it as soon as I can.


I hear you about the web date. I was hoping Manifold would give me access to the date it crawled the document, and would update that date in case the page had been updated (in a later fetch). Would that kind of information be available? 

Web connector content does not in general include a date - it is not in the content, and there is no way to generate it out of nothing.  Thus the Web connector has no facility for processing dates, and does not attempt to do anything with them even when the documents it is crawling were referenced by an RSS feed.

The date for content indexed by the RSS connector comes, if present, from fields in the RSS feed.  The dates are carried down from the feed to the referenced content.  This is one specialization that makes the RSS connector different from the more general Web connector.

As for your observation that you are seeing no dates at all in Solr, as usual I must request that you include the Solr log info output for a document that you think should have a date attached but doesn't.  This info output shows all the arguments passed to Solr from ManifoldCF, and their names.  It should be obvious what is going on if we can see one of those lines.


I'm getting more and more confused with the datum of ingested content. Karl explained to me the (not yet documented) pudateiso metadata for RSS connector, and now I'm mixing it with content from web connector as well. My ingested content from the web connector has no date. I've did the following to make sure it would get something (tried multiple config): 

The ingested content have none of the datum fields (test and/or _date) populated. Is the web-connector abiding to the same rules as the file and other connectors as described here: https://issues.apache.org/jira/browse/CONNECTORS-657