manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Olivier Tavard <olivier.tav...@francelabs.com>
Subject Re: Logging and Document filter transformation connector
Date Thu, 11 Oct 2018 14:41:30 GMT
Hello Karl,

OK thanks for the detailed explanation.
So I understand that we cannot add a distinct result code if the repository connector has
no knowledge of the pipeline.
My problem is that sometimes we do not have any activity status about an excluded file.

To be more precise, I created a job that only keeps doc and docx extensions (web repository
connector and document filter transformation connector). If you look at the screenshot, you
will see that the html and the png files are excluded by the repository connector as expected
but only the html file has a specific activity log entry with a explicit result code (EXCLUDEURL)
:

The png file has only a "fech activity" and has a 200 result code. I had to activate the debug
mode to find a log line about the exclusion of the png file :
"Removing url 'https://www.datafari.com/assets/img/img_feature_phone_list.png' because it
had the wrong content type ('image/png')"
The code related to this is located l. 902 in the WebcrawlerConnector and it contains only
:
activityResultCode = null; 

At the other hand for the html file, the section is l. 1366 and it has explicit code to handle
that :

errorCode = activities.EXCLUDED_URL;
        errorDesc = "Rejected due to URL ('"+documentIdentifier+"')";
        activities.noDocument(documentIdentifier,versionString);

I do not understand why for the html file the log activity is present with a specific result
code and not for the png file for example. Would it be possible to have the same log entry
for all the files  ?

Thanks,
Best regards,

Olivier 

> Le 11 oct. 2018 à 16:00, Karl Wright <daddywri@gmail.com> a écrit :
> 
> Hi Olivier,
> 
> The Repository connector has no knowledge of what the pipeline looks like.  It simply
asks the framework whether the mime type, length, etc. is acceptable to the downstream pipeline.
 It's the connector's responsibility to note the reason for the rejection in the simple history,
but it does not have any knowledge whatsoever of which connector rejected the document, and
therefore cannot say which transformer or output rejected the document.
> 
> Transformation and output connectors which respond to checks for document mime type or
length checks likewise do not have any knowledge of the upstream connector that is doing the
checking.
> 
> Karl
> 
> 
> 
> On Thu, Oct 11, 2018 at 9:31 AM Olivier Tavard <olivier.tavard@francelabs.com <mailto:olivier.tavard@francelabs.com>>
wrote:
> Hello,
> 
> I have a question regarding the Document filter transformation connector and the log
about it.
> I would like to have a look of all the documents excluded by the rules configured in
the Document filter transformation connector by looking at the Simple history or by the MCF
log but it is not easy so far.
> 
> Let’s say that I want to crawl a website and I want to index html pages only. So I
configure a web repository connector with a Document filter transformation connector and I
create the rule with only one allowed mime type content and one file extension. So far so
good, the job works well but if I want to visualize on the MCF log or by the simple history
all the files that were excluded by the transformation connector it is quickly complicated
: I have to search manually all the files that were fetched but not processed by Tika transformation
connector or ingested by the output connector.
> 
> Of my understanding of the code, the document filter transformation connector can communicate
directly with the repo transformation connector to indicate the rules of exclusion of the
documents and so the document that need to be excluded are not processed in the Document filter
transformation connector but directly excluded by the web repo connector.
> So in the simple history, I can see that a document that will be excluded is in "activity
fetch" and that’s it, there is no additional information about it.
> Could it be possible to add a log entry with an explicit result code as excluded by "document
filter connector" or something like when the document is excluded by the repository connector?
>  
> Thank you,
> Best regards,
> Olivier 
> 


Mime
View raw message