incubator-droids-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Thorsten Scherler <thorsten.scherler....@juntadeandalucia.es>
Subject RE: Customizable Solr Handle
Date Fri, 11 Sep 2009 12:58:30 GMT
On Fri, 2009-09-11 at 12:12 +0200, Bertil Chapuis wrote:
> I just had a look on the Droids architecture and I ask me if the Parser
> could not be considered as a Handler because in most handlers we will
> have to parse the content again. 

https://issues.apache.org/jira/browse/DROIDS-57

salu2

> 
> Doing this could lead to a simple generic filter mechanism. When
> executed, the Worker receive an object (Link, FileTask, etc) and for
> each Handlers test with the Filter(s) if the Handler should be executed
> or not.
> 
> filter.shouldExecute(Link, Handler){...}
> 
> What do you think about that? It could be a nice way to keep things
> simple and modular.
> 
> Best regards,
> 
> Bertil
> 
> 
> 
> On Wed, 2009-09-09 at 11:15 +0100, Tony Dietrich wrote:
> > Haven't got time atm to look at this myself, but there's a nice approach to this
sort problem (of what to do with pages that (need to be)|(have been crawled) ) in the old
websphinx package.
> > 
> > If I remember rightly, the package uses predicate classes (which can be standardised
or sub-classed) and which return true/false in certain conditions, and methods in the crawler
class which determine what actions are taken under which circumstances.
> > Ie 
> > public boolean shouldVisit(..){..}
> > public boolean shouldDownload(..){..}
> > public boolean shouldProcess(..){..}
> > with each method calling a declared predicate class (or chain of classes, depending
on whether the implementation contains sub classed predicates.)
> > 
> > Perhaps a similar approach could be used for droids, since it very nicely provides
a standards-acceptable, extensible approach to this sort of problem.
> > 
> > Perhaps overkill for Bertil's problem, but for future implementations .....
> > 
> > Tony
> > 
> > 
> > -----Original Message-----
> > From: Thorsten Scherler [mailto:thorsten.scherler.ext@juntadeandalucia.es] 
> > Sent: 09 September 2009 11:11
> > To: droids-dev@incubator.apache.org
> > Subject: Re: Customizable Solr Handle
> > 
> > On Wed, 2009-09-09 at 10:38 +0200, Bertil Chapuis wrote:
> > > Hello,
> > > 
> > > My name is Bertil Chapuis. I am using droids for a personal project and
> > > I am trying to create a more customizable solr handler. 
> > 
> > Hi Bertil, nice to have you on this list.
> > 
> > > 
> > > I posted a ticket with my code (DROIDS-62). However, I am looking for a
> > > way to filter the handler's execution. I'd like to handle the documents
> > > only if their URI or content matches specific conditions.
> > 
> > I will have a look at your patch, thanks in advance for your
> > contribution. 
> > 
> > > 
> > > For example, the document is handled only if its uri matches the
> > > following regex:
> > > 
> > > http://www.awebsite.com/document-[0-9]*.htm
> > > 
> > > What's the best way to do that? Is it delegated to the handler's
> > > implementation or is there a standard way?
> > 
> > Mingfai has this filter approach theoretically included in our next
> > version. However right now we do not have a standard approach other then
> > implementing the validation logic in e.g. the queue. The question is
> > whether you want only to crawl the pages that are valid against your
> > regex or the limitation is only for the handler. 
> > 
> > If it is only for the handler then it is maybe best to implement it in
> > your worker. Something like:
> > ...
> > public void execute(Link link) throws DroidsException, IOException {
> > 
> > ...
> > URI uri = link.getURI();
> > Pattern pattern = Pattern.compile(PATTERN);
> > Matcher matcher = pattern.matcher(uri);
> > if (matcher.find()) {
> >   droid.getHandlerFactory().handle(link.getURI(), entity);
> > }
> > ...}
> > 
> > 
> > HTH
> > 
> > salu2
> > 
> > > 
> > > Best regards,
> > > 
> > > Bertil Chapuis
> > > 
> > > 
> 
-- 
Thorsten Scherler <thorsten.at.apache.org>
Open Source Java <consulting, training and solutions>

Sociedad Andaluza para el Desarrollo de la Sociedad 
de la InformaciĆ³n, S.A.U. (SADESI)





Mime
View raw message