incubator-droids-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tony Dietrich" <t...@dietrich.org.uk>
Subject RE: Customizable Solr Handle
Date Fri, 11 Sep 2009 10:16:06 GMT
See my previous post.

Websphinx is a good reference package for this structure.  Whoever wrote it
was quite sensible.

Tony

-----Original Message-----
From: Bertil Chapuis [mailto:contact@bertil.ch] 
Sent: 11 September 2009 11:12
To: droids-dev@incubator.apache.org
Subject: RE: Customizable Solr Handle

I just had a look on the Droids architecture and I ask me if the Parser
could not be considered as a Handler because in most handlers we will
have to parse the content again. 

Doing this could lead to a simple generic filter mechanism. When
executed, the Worker receive an object (Link, FileTask, etc) and for
each Handlers test with the Filter(s) if the Handler should be executed
or not.

filter.shouldExecute(Link, Handler){...}

What do you think about that? It could be a nice way to keep things
simple and modular.

Best regards,

Bertil



On Wed, 2009-09-09 at 11:15 +0100, Tony Dietrich wrote:
> Haven't got time atm to look at this myself, but there's a nice approach
to this sort problem (of what to do with pages that (need to be)|(have been
crawled) ) in the old websphinx package.
> 
> If I remember rightly, the package uses predicate classes (which can be
standardised or sub-classed) and which return true/false in certain
conditions, and methods in the crawler class which determine what actions
are taken under which circumstances.
> Ie 
> public boolean shouldVisit(..){..}
> public boolean shouldDownload(..){..}
> public boolean shouldProcess(..){..}
> with each method calling a declared predicate class (or chain of classes,
depending on whether the implementation contains sub classed predicates.)
> 
> Perhaps a similar approach could be used for droids, since it very nicely
provides a standards-acceptable, extensible approach to this sort of
problem.
> 
> Perhaps overkill for Bertil's problem, but for future implementations
.....
> 
> Tony
> 
> 
> -----Original Message-----
> From: Thorsten Scherler [mailto:thorsten.scherler.ext@juntadeandalucia.es]

> Sent: 09 September 2009 11:11
> To: droids-dev@incubator.apache.org
> Subject: Re: Customizable Solr Handle
> 
> On Wed, 2009-09-09 at 10:38 +0200, Bertil Chapuis wrote:
> > Hello,
> > 
> > My name is Bertil Chapuis. I am using droids for a personal project and
> > I am trying to create a more customizable solr handler. 
> 
> Hi Bertil, nice to have you on this list.
> 
> > 
> > I posted a ticket with my code (DROIDS-62). However, I am looking for a
> > way to filter the handler's execution. I'd like to handle the documents
> > only if their URI or content matches specific conditions.
> 
> I will have a look at your patch, thanks in advance for your
> contribution. 
> 
> > 
> > For example, the document is handled only if its uri matches the
> > following regex:
> > 
> > http://www.awebsite.com/document-[0-9]*.htm
> > 
> > What's the best way to do that? Is it delegated to the handler's
> > implementation or is there a standard way?
> 
> Mingfai has this filter approach theoretically included in our next
> version. However right now we do not have a standard approach other then
> implementing the validation logic in e.g. the queue. The question is
> whether you want only to crawl the pages that are valid against your
> regex or the limitation is only for the handler. 
> 
> If it is only for the handler then it is maybe best to implement it in
> your worker. Something like:
> ...
> public void execute(Link link) throws DroidsException, IOException {
> 
> ...
> URI uri = link.getURI();
> Pattern pattern = Pattern.compile(PATTERN);
> Matcher matcher = pattern.matcher(uri);
> if (matcher.find()) {
>   droid.getHandlerFactory().handle(link.getURI(), entity);
> }
> ...}
> 
> 
> HTH
> 
> salu2
> 
> > 
> > Best regards,
> > 
> > Bertil Chapuis
> > 
> > 


Mime
View raw message