manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Logging and Document filter transformation connector
Date Thu, 11 Oct 2018 14:00:47 GMT
Hi Olivier,

The Repository connector has no knowledge of what the pipeline looks like.
It simply asks the framework whether the mime type, length, etc. is
acceptable to the downstream pipeline.  It's the connector's responsibility
to note the reason for the rejection in the simple history, but it does not
have any knowledge whatsoever of which connector rejected the document, and
therefore cannot say which transformer or output rejected the document.

Transformation and output connectors which respond to checks for document
mime type or length checks likewise do not have any knowledge of the
upstream connector that is doing the checking.

Karl



On Thu, Oct 11, 2018 at 9:31 AM Olivier Tavard <
olivier.tavard@francelabs.com> wrote:

> Hello,
>
> I have a question regarding the Document filter transformation connector
> and the log about it.
> I would like to have a look of all the documents excluded by the rules
> configured in the Document filter transformation connector by looking at
> the Simple history or by the MCF log but it is not easy so far.
>
> Let’s say that I want to crawl a website and I want to index html pages
> only. So I configure a web repository connector with a Document filter
> transformation connector and I create the rule with only one allowed mime
> type content and one file extension. So far so good, the job works well but
> if I want to visualize on the MCF log or by the simple history all the
> files that were excluded by the transformation connector it is quickly
> complicated : I have to search manually all the files that were fetched but
> not processed by Tika transformation connector or ingested by the output
> connector.
>
> Of my understanding of the code, the document filter transformation
> connector can communicate directly with the repo transformation connector
> to indicate the rules of exclusion of the documents and so the document
> that need to be excluded are not processed in the Document filter
> transformation connector but directly excluded by the web repo connector.
> So in the simple history, I can see that a document that will be excluded
> is in "activity fetch" and that’s it, there is no additional information
> about it.
> Could it be possible to add a log entry with an explicit result code as
> excluded by "document filter connector" or something like when the document
> is excluded by the repository connector?
>
> Thank you,
> Best regards,
> Olivier
>
>

Mime
View raw message