manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Olivier Tavard <olivier.tav...@francelabs.com>
Subject Re: Logging and Document filter transformation connector
Date Wed, 17 Oct 2018 14:35:09 GMT
Hi Karl,

I  opened a ticket on JIRA, it will be simpler to discuss on it : https://issues.apache.org/jira/projects/CONNECTORS/issues/CONNECTORS-1547

Thanks,

Olivier 


> Le 11 oct. 2018 à 19:25, Karl Wright <daddywri@gmail.com> a écrit :
> 
> The fact that the history is different for the two suggests that the mechanism is different.
 You can turn on connector logging and that should help figure out why the png is being rejected.
 Once we know that it should be possible to consider improvements to the history.
> 
> Karl
> 
> On Thu, Oct 11, 2018, 10:41 AM Olivier Tavard <olivier.tavard@francelabs.com <mailto:olivier.tavard@francelabs.com>>
wrote:
> Hello Karl,
> 
> OK thanks for the detailed explanation.
> So I understand that we cannot add a distinct result code if the repository connector
has no knowledge of the pipeline.
> My problem is that sometimes we do not have any activity status about an excluded file.
> 
> To be more precise, I created a job that only keeps doc and docx extensions (web repository
connector and document filter transformation connector). If you look at the screenshot, you
will see that the html and the png files are excluded by the repository connector as expected
but only the html file has a specific activity log entry with a explicit result code (EXCLUDEURL)
:
> 
> The png file has only a "fech activity" and has a 200 result code. I had to activate
the debug mode to find a log line about the exclusion of the png file :
> "Removing url 'https://www.datafari.com/assets/img/img_feature_phone_list.png <https://www.datafari.com/assets/img/img_feature_phone_list.png>'
because it had the wrong content type ('image/png')"
> The code related to this is located l. 902 in the WebcrawlerConnector and it contains
only :
> activityResultCode = null; 
> 
> At the other hand for the html file, the section is l. 1366 and it has explicit code
to handle that :
> 
> errorCode = activities.EXCLUDED_URL;
>         errorDesc = "Rejected due to URL ('"+documentIdentifier+"')";
>         activities.noDocument(documentIdentifier,versionString);
> 
> I do not understand why for the html file the log activity is present with a specific
result code and not for the png file for example. Would it be possible to have the same log
entry for all the files  ?
> 
> Thanks,
> Best regards,
> 
> Olivier 
> 
>> Le 11 oct. 2018 à 16:00, Karl Wright <daddywri@gmail.com <mailto:daddywri@gmail.com>>
a écrit :
>> 
>> Hi Olivier,
>> 
>> The Repository connector has no knowledge of what the pipeline looks like.  It simply
asks the framework whether the mime type, length, etc. is acceptable to the downstream pipeline.
 It's the connector's responsibility to note the reason for the rejection in the simple history,
but it does not have any knowledge whatsoever of which connector rejected the document, and
therefore cannot say which transformer or output rejected the document.
>> 
>> Transformation and output connectors which respond to checks for document mime type
or length checks likewise do not have any knowledge of the upstream connector that is doing
the checking.
>> 
>> Karl
>> 
>> 
>> 
>> On Thu, Oct 11, 2018 at 9:31 AM Olivier Tavard <olivier.tavard@francelabs.com
<mailto:olivier.tavard@francelabs.com>> wrote:
>> Hello,
>> 
>> I have a question regarding the Document filter transformation connector and the
log about it.
>> I would like to have a look of all the documents excluded by the rules configured
in the Document filter transformation connector by looking at the Simple history or by the
MCF log but it is not easy so far.
>> 
>> Let’s say that I want to crawl a website and I want to index html pages only. So
I configure a web repository connector with a Document filter transformation connector and
I create the rule with only one allowed mime type content and one file extension. So far so
good, the job works well but if I want to visualize on the MCF log or by the simple history
all the files that were excluded by the transformation connector it is quickly complicated
: I have to search manually all the files that were fetched but not processed by Tika transformation
connector or ingested by the output connector.
>> 
>> Of my understanding of the code, the document filter transformation connector can
communicate directly with the repo transformation connector to indicate the rules of exclusion
of the documents and so the document that need to be excluded are not processed in the Document
filter transformation connector but directly excluded by the web repo connector.
>> So in the simple history, I can see that a document that will be excluded is in "activity
fetch" and that’s it, there is no additional information about it.
>> Could it be possible to add a log entry with an explicit result code as excluded
by "document filter connector" or something like when the document is excluded by the repository
connector?
>>  
>> Thank you,
>> Best regards,
>> Olivier 
>> 
> 
> <simple_history_web_job_document_filter.jpg>


Mime
View raw message