incubator-droids-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bertil Chapuis <cont...@bertil.ch>
Subject RE: Customizable Solr Handle
Date Fri, 11 Sep 2009 10:12:05 GMT
I just had a look on the Droids architecture and I ask me if the Parser
could not be considered as a Handler because in most handlers we will
have to parse the content again. 

Doing this could lead to a simple generic filter mechanism. When
executed, the Worker receive an object (Link, FileTask, etc) and for
each Handlers test with the Filter(s) if the Handler should be executed
or not.

filter.shouldExecute(Link, Handler){...}

What do you think about that? It could be a nice way to keep things
simple and modular.

Best regards,

Bertil



On Wed, 2009-09-09 at 11:15 +0100, Tony Dietrich wrote:
> Haven't got time atm to look at this myself, but there's a nice approach to this sort
problem (of what to do with pages that (need to be)|(have been crawled) ) in the old websphinx
package.
> 
> If I remember rightly, the package uses predicate classes (which can be standardised
or sub-classed) and which return true/false in certain conditions, and methods in the crawler
class which determine what actions are taken under which circumstances.
> Ie 
> public boolean shouldVisit(..){..}
> public boolean shouldDownload(..){..}
> public boolean shouldProcess(..){..}
> with each method calling a declared predicate class (or chain of classes, depending on
whether the implementation contains sub classed predicates.)
> 
> Perhaps a similar approach could be used for droids, since it very nicely provides a
standards-acceptable, extensible approach to this sort of problem.
> 
> Perhaps overkill for Bertil's problem, but for future implementations .....
> 
> Tony
> 
> 
> -----Original Message-----
> From: Thorsten Scherler [mailto:thorsten.scherler.ext@juntadeandalucia.es] 
> Sent: 09 September 2009 11:11
> To: droids-dev@incubator.apache.org
> Subject: Re: Customizable Solr Handle
> 
> On Wed, 2009-09-09 at 10:38 +0200, Bertil Chapuis wrote:
> > Hello,
> > 
> > My name is Bertil Chapuis. I am using droids for a personal project and
> > I am trying to create a more customizable solr handler. 
> 
> Hi Bertil, nice to have you on this list.
> 
> > 
> > I posted a ticket with my code (DROIDS-62). However, I am looking for a
> > way to filter the handler's execution. I'd like to handle the documents
> > only if their URI or content matches specific conditions.
> 
> I will have a look at your patch, thanks in advance for your
> contribution. 
> 
> > 
> > For example, the document is handled only if its uri matches the
> > following regex:
> > 
> > http://www.awebsite.com/document-[0-9]*.htm
> > 
> > What's the best way to do that? Is it delegated to the handler's
> > implementation or is there a standard way?
> 
> Mingfai has this filter approach theoretically included in our next
> version. However right now we do not have a standard approach other then
> implementing the validation logic in e.g. the queue. The question is
> whether you want only to crawl the pages that are valid against your
> regex or the limitation is only for the handler. 
> 
> If it is only for the handler then it is maybe best to implement it in
> your worker. Something like:
> ...
> public void execute(Link link) throws DroidsException, IOException {
> 
> ...
> URI uri = link.getURI();
> Pattern pattern = Pattern.compile(PATTERN);
> Matcher matcher = pattern.matcher(uri);
> if (matcher.find()) {
>   droid.getHandlerFactory().handle(link.getURI(), entity);
> }
> ...}
> 
> 
> HTH
> 
> salu2
> 
> > 
> > Best regards,
> > 
> > Bertil Chapuis
> > 
> > 


Mime
View raw message