manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Web Connector and dates
Date Mon, 01 Jul 2013 10:26:47 GMT
Hi Stephane,

This was committed last week.  Every document from every repository
connector now receives an ingestion date as an attribute of the document.
For Solr, there is a new field under the "Schema" tab where you can tell it
where to go.

Thanks,
Karl


On Tue, Jun 25, 2013 at 9:27 AM, Stephane Gamard <stephane@gamard.net>wrote:

> Hi Karl,
>
>
> As per your recommendation I've filled in
> https://issues.apache.org/jira/browse/CONNECTORS-735. I did not know you
> were keeping track of version. This wold work for me (for the moment) as to
> be able to have the latest content from a crawl in Solr. How can I add the
> content's version to the ingestion?
>
>
> Cheers,
>
>
>
> On June 25, 2013 at 3:19:15 PM, Karl Wright (daddywri@gmail.com) wrote:
>
> Hi Stephane,
>
> This is tricky, because the date is included in the index and yet the
> version of the document better not include the date, or there can be no
> incremental behavior.  However, it is possible to do this.  If you need
> such a feature, please create a ticket.  I'm very behind at the moment so
> it is unlikely to be worked on promptly, but I will get to it as soon as I
> can.
>
> Karl
>
>
>
>
>
> On Tue, Jun 25, 2013 at 9:14 AM, Stephane Gamard <stephane@gamard.net>wrote:
>
>> Hi Karl,
>>
>>
>> I hear you about the web date. I was hoping Manifold would give me access
>> to the date it crawled the document, and would update that date in case the
>> page had been updated (in a later fetch). Would that kind of information be
>> available?
>>
>>
>>
>> On June 25, 2013 at 3:06:04 PM, Karl Wright (daddywri@gmail.com) wrote:
>>
>> Hi Stephane,
>>
>> Web connector content does not in general include a date - it is not in
>> the content, and there is no way to generate it out of nothing.  Thus the
>> Web connector has no facility for processing dates, and does not attempt to
>> do anything with them even when the documents it is crawling were
>> referenced by an RSS feed.
>>
>> The date for content indexed by the RSS connector comes, if present, from
>> fields in the RSS feed.  The dates are carried down from the feed to the
>> referenced content.  This is one specialization that makes the RSS
>> connector different from the more general Web connector.
>>
>> As for your observation that you are seeing no dates at all in Solr, as
>> usual I must request that you include the Solr log info output for a
>> document that you think should have a date attached but doesn't.  This info
>> output shows all the arguments passed to Solr from ManifoldCF, and their
>> names.  It should be obvious what is going on if we can see one of those
>> lines.
>>
>> Thanks,
>> Karl
>>
>>
>>
>> On Tue, Jun 25, 2013 at 8:55 AM, Stephane Gamard <stephane@gamard.net>wrote:
>>
>>> Hi All,
>>>
>>>
>>> I'm getting more and more confused with the datum of ingested content.
>>> Karl explained to me the (not yet documented) pudateiso metadata for RSS
>>> connector, and now I'm mixing it with content from web connector as well.
>>> My ingested content from the web connector has no date. I've did the
>>> following to make sure it would get something (tried multiple config):
>>>
>>>
>>>
>>> on my solr-output:
>>>
>>>
>>> And on my job:
>>>
>>> The ingested content have none of the datum fields (test and/or _date)
>>> populated. Is the web-connector abiding to the same rules as the file and
>>> other connectors as described here:
>>> https://issues.apache.org/jira/browse/CONNECTORS-657
>>>
>>>
>>
>

Mime
View raw message