Hi Olivier,

The Repository connector has no knowledge of what the pipeline looks like.  It simply asks the framework whether the mime type, length, etc. is acceptable to the downstream pipeline.  It's the connector's responsibility to note the reason for the rejection in the simple history, but it does not have any knowledge whatsoever of which connector rejected the document, and therefore cannot say which transformer or output rejected the document.

Transformation and output connectors which respond to checks for document mime type or length checks likewise do not have any knowledge of the upstream connector that is doing the checking.


On Thu, Oct 11, 2018 at 9:31 AM Olivier Tavard <olivier.tavard@francelabs.com> wrote:

I have a question regarding the Document filter transformation connector and the log about it.
I would like to have a look of all the documents excluded by the rules configured in the Document filter transformation connector by looking at the Simple history or by the MCF log but it is not easy so far.

Let’s say that I want to crawl a website and I want to index html pages only. So I configure a web repository connector with a Document filter transformation connector and I create the rule with only one allowed mime type content and one file extension. So far so good, the job works well but if I want to visualize on the MCF log or by the simple history all the files that were excluded by the transformation connector it is quickly complicated : I have to search manually all the files that were fetched but not processed by Tika transformation connector or ingested by the output connector.

Of my understanding of the code, the document filter transformation connector can communicate directly with the repo transformation connector to indicate the rules of exclusion of the documents and so the document that need to be excluded are not processed in the Document filter transformation connector but directly excluded by the web repo connector.
So in the simple history, I can see that a document that will be excluded is in "activity fetch" and that’s it, there is no additional information about it.
Could it be possible to add a log entry with an explicit result code as excluded by "document filter connector" or something like when the document is excluded by the repository connector?
Thank you,
Best regards,