manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Markus Schuch" <markus_sch...@web.de>
Subject Re: Re: Re: Should a document with an empty version string always be reingested?
Date Fri, 04 Mar 2016 12:32:15 GMT
<html><head></head><body><div style="font-family: Verdana;font-size:
12.0px;"><div>Hi Karl,</div>

<div>&nbsp;</div>

<div>yes i am sure&nbsp;ingestDocumentWithException is called twice. The First call
in the first run, the second call in the second run. Both calls happen with same arguments.</div>

<div>&nbsp;</div>

<div>I think the interesting part is in the IncrementalIngester:</div>

<div>The old version and the new version are compared. And an empty string is treated
like any other version.</div>

<div>&nbsp;</div>

<div>&nbsp; boolean needToReindex = (oldDocumentVersion == null);<br/>
&nbsp; if (needToReindex == false)<br/>
&nbsp; {<br/>
&nbsp; &nbsp;&nbsp;needToReindex = (!oldDocumentVersion.equals(newDocumentVersion)
&#124;&#124;<br/>
&nbsp; &nbsp; !oldOutputVersion.equals(fullSpec.getStageDescriptionString(outputStage).getVersionString())
&#124;&#124;<br/>
&nbsp; &nbsp;&nbsp;!oldAuthorityName.equals((newAuthorityNameString==null)?&quot;&quot;:newAuthorityNameString));<br/>
&nbsp; }<br/>
&nbsp; if (needToReindex == false)<br/>
&nbsp; {<br/>
&nbsp; &nbsp; needToReindex = (!oldTransformationVersion.equals(newTransformationVersion));<br/>
&nbsp; }</div>

<div>
<div>&nbsp;</div>

<div>In my case old version and new version both are&nbsp;&quot;&quot; and
needToReindex stays false.</div>

<div>&nbsp;</div>

<div>I think this comparison had the same result in 1.7 but due to&nbsp;CONNECTORS-1153
needToReindex was the outputVersion check was buggy.</div>

<div>&nbsp;</div>

<div>The question remains: shouldn&#39;t an empty version trigger reingestion?</div>

<div>&nbsp;</div>

<div>Regards</div>

<div>Markus</div>

<div>&nbsp;</div>

<div name="quote" style="margin:10px 5px 5px 10px; padding: 10px 0 10px 10px; border-left:2px
solid #C3D9E5; word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;">
<div style="margin:0 0 10px 0;"><b>Gesendet:</b>&nbsp;Freitag, 04. M&auml;rz
2016 um 13:21 Uhr<br/>
<b>Von:</b>&nbsp;&quot;Karl Wright&quot; &lt;daddywri@gmail.com&gt;<br/>
<b>An:</b>&nbsp;&quot;user@manifoldcf.apache.org&quot; &lt;user@manifoldcf.apache.org&gt;<br/>
<b>Betreff:</b>&nbsp;Re: Re: Should a document with an empty version string
always be reingested?</div>

<div name="quoted-content">
<div>Hi Markus,
<div>&nbsp;</div>

<div>If you called ingestDocumentWithVersions() more than once, you should have seen
two indexing attempts.</div>

<div>&nbsp;</div>

<div>Are you sure this is indeed getting called twice?</div>

<div>&nbsp;</div>

<div>I&#39;ve looked briefly at the code and can find no reason why there would
be version-sensitive incremental behavior in this method call. I will go back and look more
carefully and get back to you.</div>

<div>&nbsp;</div>

<div>Karl</div>

<div>&nbsp;</div>
</div>

<div class="gmail_extra">&nbsp;
<div class="gmail_quote">On Fri, Mar 4, 2016 at 6:40 AM, Markus Schuch <span>&lt;<a
href="markus_schuch@web.de" target="_parent">markus_schuch@web.de</a>&gt;</span>
wrote:

<blockquote class="gmail_quote" style="margin: 0 0 0 0.8ex;border-left: 1.0px rgb(204,204,204)
solid;padding-left: 1.0ex;"><br/>
Hi Karl,<br/>
<br/>
thanks for the fast response.<br/>
<br/>
We have a simple connector (written before 1.7), that produces documents from an XML file
and we use the empty version string to trigger ingestion on every job run. Meaning the empty
version string is considered as &quot;alwaysRefetch&quot; and the created document
is always sent down the pipeline along with this empty version string.<br/>
(the connector was relying on the 1.x BaseRepositoryConnector)<br/>
&nbsp;<br/>
I noticed the backward compatibility code in the BaseRepositoryConnector in 1.7+ and i used
this code to wire our custom connector code to the new 2.3 interface.<br/>
I debugged the document processing and - as expected - ingestDocumentWithException is still
called every time, as before, since an empty version string is still considered as alwaysRefetch.
But the sent document is only ingested to the ouputrepository at the first time the job runs.
On consecutive runs the output step stays inactive.<br/>
&nbsp;<br/>
I think we can boil my issue down to a specific question about one method of IProcessActivity
interface:<br/>
&nbsp;<br/>
&nbsp; ingestDocumentWithException(String documentIdentifier,&nbsp;String version,
String documentURI, RepositoryDocument data)<br/>
&nbsp;<br/>
<br/>
Let&#39;s assume the following example flow (starting from an empty and clean MCF 2.3
system):<br/>
&nbsp;<br/>
(1) In a first run of my job&nbsp;<br/>
<br/>
&nbsp; &nbsp;&nbsp; &nbsp;ingestDocumentWithException( &quot;identiferX&quot;,
&quot;&quot;, &quot;documentUriX&quot;, repoDoc)&nbsp;// second param
is empty version string<br/>
<br/>
&nbsp; &nbsp; is called. This leads to ingestion of the document with the URI &quot;documentUriX&quot;.<br/>
<br/>
(2) In a second run of my job<br/>
<br/>
&nbsp; &nbsp; &nbsp; ingestDocumentWithException( &quot;identiferX&quot;,
&quot;&quot;, &quot;documentUriX&quot;, repoDoc) // second param is empty
version string<br/>
<br/>
&nbsp; &nbsp; is called again (with the same arguments).<br/>
<br/>
What is the expected behavior here?<br/>
Should the document be ingested again or not?<br/>
And if not, how should i trigger ingestion? By sending always a null version down the pipeline?<br/>
<br/>
The actual behavior<br/>
- In 1.7 it is ingested again.<br/>
- in 2.3 it is _not_ ingested again.<br/>
<br/>
Regards,<br/>
Markus<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
Gesendet:&nbsp;Freitag, 04. M&auml;rz 2016 um 12:11 Uhr<br/>
Von:&nbsp;&quot;Karl Wright&quot; &lt;<a href="daddywri@gmail.com" target="_parent">daddywri@gmail.com</a>&gt;<br/>
An:&nbsp;&quot;<a href="user@manifoldcf.apache.org" target="_parent">user@manifoldcf.apache.org</a>&quot;
&lt;<a href="user@manifoldcf.apache.org" target="_parent">user@manifoldcf.apache.org</a>&gt;<br/>
Betreff:&nbsp;Re: Should a document with an empty version string always be reingested?<br/>
<br/>
<span>Hi Markus,<br/>
&nbsp;<br/>
The canonical way that a connector handles incrementality changed from 1.7 to 1.10.&nbsp;
We maintained backwards compatibility through the inclusion of legacy base connector methods.&nbsp;
CONNECTORS-1153 reported a problem in one of those base connector methods, which has been
fixed by 1.10.&nbsp; I can&#39;t tell whether this applies to your situation.<br/>
&nbsp;<br/>
On 2.x the base connector methods no longer have all of the legacy base connector methods
at all, so if you have a custom connector you will need to rework your connector class to
adhere to the newer model.&nbsp; Specifically, there is no such method anymore as &quot;getDocumentVersions()&quot;.&nbsp;
Instead, your connector must signal its disposition of any document using the IProcessActivity
methods available for that purpose.<br/>
&nbsp;<br/>
Can you describe in more detail what you are doing here?<br/>
(a) Is this a custom connector?<br/>
(b) Was it developed on 1.7 or before?<br/>
(c) Are you trying to run it on 1.10 or on 2.x?<br/>
&nbsp;<br/>
That will help me give you better responses.<br/>
&nbsp;<br/>
Karl<br/>
&nbsp;<br/>
&nbsp;<br/>
On Fri, Mar 4, 2016 at 5:28 AM, Markus Schuch &lt;<a href="markus_schuch@web.de" target="_parent">markus_schuch@web.de</a>&gt;
wrote:<br/>
<br/>
Hi,<br/>
&nbsp;<br/>
we ran on MCF 1.7 for quite a while and in this environment a document send to the ingestion
pipeline together with an empty version string was always reingested.<br/>
On MCF 2.3 this is no longer the case.<br/>
&nbsp;</span><br/>
I found&nbsp;<a href="https://issues.apache.org/jira/browse/CONNECTORS-1153[https://issues.apache.org/jira/browse/CONNECTORS-1153]"
target="_blank">https://issues.apache.org/jira/browse/CONNECTORS-1153[https://issues.apache.org/jira/browse/CONNECTORS-1153]</a>
and may be the 1.7 behavior we were relying on&nbsp;was always a bug.

<div class="HOEnZb">
<div class="h5">&nbsp;<br/>
Question:<br/>
Is the new 2.3 behavior the expected case how the ingestion&nbsp;pipeline handles an empty
version string?<br/>
And how can &quot;always&nbsp;reingestion&quot; be triggered?<br/>
&nbsp;<br/>
Thanks in Advance,<br/>
Markus</div>
</div>
</blockquote>
</div>
</div>
</div>
</div>
</div></div></body></html>

Mime
View raw message