nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sebastian Nagel <>
Subject Re: Outlinks in parse filter
Date Fri, 01 Feb 2013 23:01:54 GMT
Hi Markus,

> we should be fine right?
Yes, even better: FeedParser only contains URLNormalizers and URLFilters objects which get
references to plugin instances themselves via ObjectCache in the constructor.
Btw., that's also the way the parse filter plugins are referenced,
eg. TikaParser -> HtmlParseFilters -> ObjectCache.get(conf).getObject(MyCustomParseFilter).
That's efficient, but thread-safety is a requirement ;-)
I found also Andrzej's post:


On 02/01/2013 03:41 PM, Markus Jelsma wrote:
> At a second thought, if like the feed parser the instance is kept in the class and only
loaded in setConf(), we should be fine right?
> -----Original message-----
>> From:Markus Jelsma <>
>> Sent: Fri 01-Feb-2013 15:38
>> To:
>> Subject: RE: Outlinks in parse filter
>> Hi Sebastian,
>> Alright. How about a performance penalty if we get a new instance of filters and
normalizers for each parse? Right now each thread has its own instances. Some filters can
be very costly to load too frequently. 
>> Thanks,
>> Markus
>> -----Original message-----
>>> From:Sebastian Nagel <>
>>> Sent: Tue 29-Jan-2013 22:22
>>> To:
>>> Subject: Re: Outlinks in parse filter
>>> Hi Markus,
>>> this would mean that urlfilter and urlnormalizer plugins are accessed from parse
>>> At a first glance, sounds somewhat oddish. But it's already the case for the
feed parser.
>>> We would have to do it for all parse plugins. Since there not so many that's
no argument against.
>>> Supposed you can still switch it off via the parse.(filter|normalize).urls properties
I see no
>>> serious reason why it can't be done.
>>> Sebastian
>>> On 01/29/2013 01:16 PM, Markus Jelsma wrote:
>>>> Hi,
>>>> Outlinks that reach the parse filters via ParseData are not normalized or
filtered but i believe they should be. If you would try to do something sensible with the
outlinks in the parse filter you cannot rely on their accuracy. Should we not move the calls
to ParseOutputFormat.filterNormalize to the parse plugin?
>>>> Any thoughts?
>>>> Markus

View raw message