manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Re: Re: Should a document with an empty version string always be reingested?
Date Fri, 04 Mar 2016 12:59:58 GMT
I have done enough research to confirm that at least one of the MCF shipped
connectors also relies on the empty version string: the JDBC connector.

I've therefore opened CONNECTORS-1283 and attached a patch.

Karl

On Fri, Mar 4, 2016 at 7:42 AM, Karl Wright <daddywri@gmail.com> wrote:

> Hi Markus,
>
> I agree that is one key bit of code, and I agree with your analysis.
>
> There obviously needs to be a way to signal "I don't have a meaningful
> document version string", and an empty string is not unreasonable for this
> purpose.  However, there's more to it than that.
>
> Specifically, the pipeline code is designed to make intelligent decisions
> on an output connection by output connection basis whether to index the
> document in that connector.  There is also an API concern: specifically, we
> *expect* that the caller will have checked whether a document needs to be
> indexed or not at the root level.  So the whole clause you have mentioned
> is, theoretically, unnecessary, if the connector is written right.  But we
> can't count on that.
>
> I will look through other connectors to see if there is any problem with
> an empty string being used as a signal for "don't care".  I will get back
> to you.
>
> Karl
>
>
> On Fri, Mar 4, 2016 at 7:32 AM, Markus Schuch <markus_schuch@web.de>
> wrote:
>
>> Hi Karl,
>>
>> yes i am sure ingestDocumentWithException is called twice. The First call
>> in the first run, the second call in the second run. Both calls happen with
>> same arguments.
>>
>> I think the interesting part is in the IncrementalIngester:
>> The old version and the new version are compared. And an empty string is
>> treated like any other version.
>>
>>   boolean needToReindex = (oldDocumentVersion == null);
>>   if (needToReindex == false)
>>   {
>>     needToReindex = (!oldDocumentVersion.equals(newDocumentVersion) ||
>>
>> !oldOutputVersion.equals(fullSpec.getStageDescriptionString(outputStage).getVersionString())
>> ||
>>
>>   !oldAuthorityName.equals((newAuthorityNameString==null)?"":newAuthorityNameString));
>>   }
>>   if (needToReindex == false)
>>   {
>>     needToReindex =
>> (!oldTransformationVersion.equals(newTransformationVersion));
>>   }
>>
>> In my case old version and new version both are "" and needToReindex
>> stays false.
>>
>> I think this comparison had the same result in 1.7 but due
>> to CONNECTORS-1153 needToReindex was the outputVersion check was buggy.
>>
>> The question remains: shouldn't an empty version trigger reingestion?
>>
>> Regards
>> Markus
>>
>> *Gesendet:* Freitag, 04. März 2016 um 13:21 Uhr
>> *Von:* "Karl Wright" <daddywri@gmail.com>
>> *An:* "user@manifoldcf.apache.org" <user@manifoldcf.apache.org>
>> *Betreff:* Re: Re: Should a document with an empty version string always
>> be reingested?
>> Hi Markus,
>>
>> If you called ingestDocumentWithVersions() more than once, you should
>> have seen two indexing attempts.
>>
>> Are you sure this is indeed getting called twice?
>>
>> I've looked briefly at the code and can find no reason why there would be
>> version-sensitive incremental behavior in this method call. I will go back
>> and look more carefully and get back to you.
>>
>> Karl
>>
>>
>> On Fri, Mar 4, 2016 at 6:40 AM, Markus Schuch <markus_schuch@web.de>
>> wrote:
>>>
>>>
>>> Hi Karl,
>>>
>>> thanks for the fast response.
>>>
>>> We have a simple connector (written before 1.7), that produces documents
>>> from an XML file and we use the empty version string to trigger ingestion
>>> on every job run. Meaning the empty version string is considered as
>>> "alwaysRefetch" and the created document is always sent down the pipeline
>>> along with this empty version string.
>>> (the connector was relying on the 1.x BaseRepositoryConnector)
>>>
>>> I noticed the backward compatibility code in the BaseRepositoryConnector
>>> in 1.7+ and i used this code to wire our custom connector code to the new
>>> 2.3 interface.
>>> I debugged the document processing and - as expected -
>>> ingestDocumentWithException is still called every time, as before, since an
>>> empty version string is still considered as alwaysRefetch. But the sent
>>> document is only ingested to the ouputrepository at the first time the job
>>> runs. On consecutive runs the output step stays inactive.
>>>
>>> I think we can boil my issue down to a specific question about one
>>> method of IProcessActivity interface:
>>>
>>>   ingestDocumentWithException(String documentIdentifier, String version,
>>> String documentURI, RepositoryDocument data)
>>>
>>>
>>> Let's assume the following example flow (starting from an empty and
>>> clean MCF 2.3 system):
>>>
>>> (1) In a first run of my job
>>>
>>>       ingestDocumentWithException( "identiferX", "", "documentUriX",
>>> repoDoc) // second param is empty version string
>>>
>>>     is called. This leads to ingestion of the document with the URI
>>> "documentUriX".
>>>
>>> (2) In a second run of my job
>>>
>>>       ingestDocumentWithException( "identiferX", "", "documentUriX",
>>> repoDoc) // second param is empty version string
>>>
>>>     is called again (with the same arguments).
>>>
>>> What is the expected behavior here?
>>> Should the document be ingested again or not?
>>> And if not, how should i trigger ingestion? By sending always a null
>>> version down the pipeline?
>>>
>>> The actual behavior
>>> - In 1.7 it is ingested again.
>>> - in 2.3 it is _not_ ingested again.
>>>
>>> Regards,
>>> Markus
>>>
>>>
>>>
>>>
>>>
>>> Gesendet: Freitag, 04. März 2016 um 12:11 Uhr
>>> Von: "Karl Wright" <daddywri@gmail.com>
>>> An: "user@manifoldcf.apache.org" <user@manifoldcf.apache.org>
>>> Betreff: Re: Should a document with an empty version string always be
>>> reingested?
>>>
>>> Hi Markus,
>>>
>>> The canonical way that a connector handles incrementality changed from
>>> 1.7 to 1.10.  We maintained backwards compatibility through the inclusion
>>> of legacy base connector methods.  CONNECTORS-1153 reported a problem in
>>> one of those base connector methods, which has been fixed by 1.10.  I can't
>>> tell whether this applies to your situation.
>>>
>>> On 2.x the base connector methods no longer have all of the legacy base
>>> connector methods at all, so if you have a custom connector you will need
>>> to rework your connector class to adhere to the newer model.  Specifically,
>>> there is no such method anymore as "getDocumentVersions()".  Instead, your
>>> connector must signal its disposition of any document using the
>>> IProcessActivity methods available for that purpose.
>>>
>>> Can you describe in more detail what you are doing here?
>>> (a) Is this a custom connector?
>>> (b) Was it developed on 1.7 or before?
>>> (c) Are you trying to run it on 1.10 or on 2.x?
>>>
>>> That will help me give you better responses.
>>>
>>> Karl
>>>
>>>
>>> On Fri, Mar 4, 2016 at 5:28 AM, Markus Schuch <markus_schuch@web.de>
>>> wrote:
>>>
>>> Hi,
>>>
>>> we ran on MCF 1.7 for quite a while and in this environment a document
>>> send to the ingestion pipeline together with an empty version string was
>>> always reingested.
>>> On MCF 2.3 this is no longer the case.
>>>
>>> I found
>>> https://issues.apache.org/jira/browse/CONNECTORS-1153[https://issues.apache.org/jira/browse/CONNECTORS-1153]
>>> and may be the 1.7 behavior we were relying on was always a bug.
>>>
>>> Question:
>>> Is the new 2.3 behavior the expected case how the ingestion pipeline
>>> handles an empty version string?
>>> And how can "always reingestion" be triggered?
>>>
>>> Thanks in Advance,
>>> Markus
>>>
>>
>

Mime
View raw message