incubator-droids-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tony Dietrich" <>
Subject RE: Customizable Solr Handle
Date Wed, 09 Sep 2009 10:15:42 GMT
Haven't got time atm to look at this myself, but there's a nice approach to this sort problem
(of what to do with pages that (need to be)|(have been crawled) ) in the old websphinx package.

If I remember rightly, the package uses predicate classes (which can be standardised or sub-classed)
and which return true/false in certain conditions, and methods in the crawler class which
determine what actions are taken under which circumstances.
public boolean shouldVisit(..){..}
public boolean shouldDownload(..){..}
public boolean shouldProcess(..){..}
with each method calling a declared predicate class (or chain of classes, depending on whether
the implementation contains sub classed predicates.)

Perhaps a similar approach could be used for droids, since it very nicely provides a standards-acceptable,
extensible approach to this sort of problem.

Perhaps overkill for Bertil's problem, but for future implementations .....


-----Original Message-----
From: Thorsten Scherler [] 
Sent: 09 September 2009 11:11
Subject: Re: Customizable Solr Handle

On Wed, 2009-09-09 at 10:38 +0200, Bertil Chapuis wrote:
> Hello,
> My name is Bertil Chapuis. I am using droids for a personal project and
> I am trying to create a more customizable solr handler. 

Hi Bertil, nice to have you on this list.

> I posted a ticket with my code (DROIDS-62). However, I am looking for a
> way to filter the handler's execution. I'd like to handle the documents
> only if their URI or content matches specific conditions.

I will have a look at your patch, thanks in advance for your

> For example, the document is handled only if its uri matches the
> following regex:
> What's the best way to do that? Is it delegated to the handler's
> implementation or is there a standard way?

Mingfai has this filter approach theoretically included in our next
version. However right now we do not have a standard approach other then
implementing the validation logic in e.g. the queue. The question is
whether you want only to crawl the pages that are valid against your
regex or the limitation is only for the handler. 

If it is only for the handler then it is maybe best to implement it in
your worker. Something like:
public void execute(Link link) throws DroidsException, IOException {

URI uri = link.getURI();
Pattern pattern = Pattern.compile(PATTERN);
Matcher matcher = pattern.matcher(uri);
if (matcher.find()) {
  droid.getHandlerFactory().handle(link.getURI(), entity);



> Best regards,
> Bertil Chapuis
Thorsten Scherler <>
Open Source Java <consulting, training and solutions>

Sociedad Andaluza para el Desarrollo de la Sociedad 
de la InformaciĆ³n, S.A.U. (SADESI)

View raw message