manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Arcadius Ahouansou <arcad...@menelic.com>
Subject Re: Content filltering/exclusion with MCF
Date Wed, 29 Apr 2015 23:02:36 GMT
Hello Karl.

I agree, this would be slower than the usual filtering by url or HTTP
header.

On the other hand, this would be a very useful feature:
Could be used to remove documents containing swear words from index or
remove adult content or discard emails flagged as spam etc.

Regarding the implementation.
So far in MCF, regex have been used for pattern matching.
In the case of a content filtering, the user will supply a king of
"dictionary" that we will use to determine whether the document will go
through or not.
The dictionary can grow quite a bit.

The other alternative to regex may be the Aho–Corasick string matching
algorithm
A java implementation can be found at
https://github.com/robert-bor/aho-corasick
Let's say in my dictionary, I have tow entries "expired" and "not found".
the algorithm will return either "expired", "not found" or both depending
on what it found in the document.
This output could be used to decide whether to index it or not.

In this specific case where we only want to exclude a content from the
index, we could exit on the first match i.e no need to match the whole
dictionary.
There is a pull-request for dealing with that
https://github.com/robert-bor/aho-corasick/pull/14

Thanks.

Arcadius.

On 29 April 2015 at 22:50, Karl Wright <daddywri@gmail.com> wrote:

> Hi Arcadius,
>
> A feature like this is possible but could be very slow, since there's no
> definite limit on the size of an html page.
>
> Karl
>
>
> On Wed, Apr 29, 2015 at 5:01 PM, Arcadius Ahouansou <arcadius@menelic.com>
> wrote:
>
>>
>> Hello Karl.
>>
>> I have checked the Simple History and I could see deletions.
>>
>> I have recently migrated my config to MCF 2.0.2 without migrating all
>> crawled data. That may be the reason why I have in Solr document that lead
>> to 404.
>>
>> Clearing my Solr index and resetting the crawler may help solve my
>> problem.
>>
>> On the other hand, some of the page I am crawling display friendly
>> messages such as "The document you are looking for has expired" with a 200
>> HTTP header instead of 404.
>> How feasible would it be to exclude document from the index based on the
>> content on the document?
>>
>> Thank you very much.
>>
>> Arcadius.
>>
>>
>>
>> On 28 April 2015 at 12:18, Karl Wright <daddywri@gmail.com> wrote:
>>
>>> Hi Arcadius,
>>>
>>> So, to be clear, the repository connection you are using is a web
>>> connection type?
>>>
>>> The web connector has the following code which should prevent indexing
>>> of any content that was received with a response type of 200:
>>>
>>>       int responseCode = cache.getResponseCode(documentIdentifier);
>>>       if (responseCode != 200)
>>>       {
>>>         if (Logging.connectors.isDebugEnabled())
>>>           Logging.connectors.debug("Web: For document
>>> '"+documentIdentifier+"', not indexing because response code not indexable:
>>> "+responseCode);
>>>         errorCode = "RESPONSECODENOTINDEXABLE";
>>>         errorDesc = "HTTP response code not indexable
>>> ("+responseCode+")";
>>>         activities.noDocument(documentIdentifier,versionString);
>>>         return;
>>>       }
>>>
>>>
>>> You should indeed see these cases logged in the simple history and no
>>> document sent to Solr.  Is this not what you are seeing?
>>>
>>> Karl
>>>
>>>
>>> On Tue, Apr 28, 2015 at 7:01 AM, Arcadius Ahouansou <
>>> arcadius@menelic.com> wrote:
>>>
>>>>
>>>> Hello.
>>>>
>>>> I am using MCF 2.0.2 for crawling the web and ingesting data into Solr.
>>>>
>>>> MCF has ingested into Solr documents that returned HTTP error let's
>>>> says 401, 403, 404 or have a certain content like "this page has expired
>>>> and has been removed"
>>>>
>>>> The question is:
>>>> is there a way to tell MCF to ingest
>>>> - only document not containing a certain content like "Not Found" or
>>>> - only document excluding those with header 401, 403, 404, 500, ...
>>>>
>>>> Thank you very much.
>>>>
>>>> Arcadius.
>>>>
>>>
>>>
>>
>>
>> --
>> Arcadius Ahouansou
>> Menelic Ltd | Information is Power
>> M: 07908761999
>> W: www.menelic.com
>> ---
>>
>
>


-- 
Arcadius Ahouansou
Menelic Ltd | Information is Power
M: 07908761999
W: www.menelic.com
---

Mime
View raw message