On Fri, 2009-09-11 at 11:16 +0100, Tony Dietrich wrote:
> See my previous post.
>
> Websphinx is a good reference package for this structure. Whoever wrote it
> was quite sensible.
https://issues.apache.org/jira/browse/DROIDS-58
Maybe you want to comment on this issue with the findings of websphinx.
salu2
>
> Tony
>
> -----Original Message-----
> From: Bertil Chapuis [mailto:contact@bertil.ch]
> Sent: 11 September 2009 11:12
> To: droids-dev@incubator.apache.org
> Subject: RE: Customizable Solr Handle
>
> I just had a look on the Droids architecture and I ask me if the Parser
> could not be considered as a Handler because in most handlers we will
> have to parse the content again.
>
> Doing this could lead to a simple generic filter mechanism. When
> executed, the Worker receive an object (Link, FileTask, etc) and for
> each Handlers test with the Filter(s) if the Handler should be executed
> or not.
>
> filter.shouldExecute(Link, Handler){...}
>
> What do you think about that? It could be a nice way to keep things
> simple and modular.
>
> Best regards,
>
> Bertil
>
>
>
> On Wed, 2009-09-09 at 11:15 +0100, Tony Dietrich wrote:
> > Haven't got time atm to look at this myself, but there's a nice approach
> to this sort problem (of what to do with pages that (need to be)|(have been
> crawled) ) in the old websphinx package.
> >
> > If I remember rightly, the package uses predicate classes (which can be
> standardised or sub-classed) and which return true/false in certain
> conditions, and methods in the crawler class which determine what actions
> are taken under which circumstances.
> > Ie
> > public boolean shouldVisit(..){..}
> > public boolean shouldDownload(..){..}
> > public boolean shouldProcess(..){..}
> > with each method calling a declared predicate class (or chain of classes,
> depending on whether the implementation contains sub classed predicates.)
> >
> > Perhaps a similar approach could be used for droids, since it very nicely
> provides a standards-acceptable, extensible approach to this sort of
> problem.
> >
> > Perhaps overkill for Bertil's problem, but for future implementations
> .....
> >
> > Tony
> >
> >
> > -----Original Message-----
> > From: Thorsten Scherler [mailto:thorsten.scherler.ext@juntadeandalucia.es]
>
> > Sent: 09 September 2009 11:11
> > To: droids-dev@incubator.apache.org
> > Subject: Re: Customizable Solr Handle
> >
> > On Wed, 2009-09-09 at 10:38 +0200, Bertil Chapuis wrote:
> > > Hello,
> > >
> > > My name is Bertil Chapuis. I am using droids for a personal project and
> > > I am trying to create a more customizable solr handler.
> >
> > Hi Bertil, nice to have you on this list.
> >
> > >
> > > I posted a ticket with my code (DROIDS-62). However, I am looking for a
> > > way to filter the handler's execution. I'd like to handle the documents
> > > only if their URI or content matches specific conditions.
> >
> > I will have a look at your patch, thanks in advance for your
> > contribution.
> >
> > >
> > > For example, the document is handled only if its uri matches the
> > > following regex:
> > >
> > > http://www.awebsite.com/document-[0-9]*.htm
> > >
> > > What's the best way to do that? Is it delegated to the handler's
> > > implementation or is there a standard way?
> >
> > Mingfai has this filter approach theoretically included in our next
> > version. However right now we do not have a standard approach other then
> > implementing the validation logic in e.g. the queue. The question is
> > whether you want only to crawl the pages that are valid against your
> > regex or the limitation is only for the handler.
> >
> > If it is only for the handler then it is maybe best to implement it in
> > your worker. Something like:
> > ...
> > public void execute(Link link) throws DroidsException, IOException {
> >
> > ...
> > URI uri = link.getURI();
> > Pattern pattern = Pattern.compile(PATTERN);
> > Matcher matcher = pattern.matcher(uri);
> > if (matcher.find()) {
> > droid.getHandlerFactory().handle(link.getURI(), entity);
> > }
> > ...}
> >
> >
> > HTH
> >
> > salu2
> >
> > >
> > > Best regards,
> > >
> > > Bertil Chapuis
> > >
> > >
>
--
Thorsten Scherler <thorsten.at.apache.org>
Open Source Java <consulting, training and solutions>
Sociedad Andaluza para el Desarrollo de la Sociedad
de la Información, S.A.U. (SADESI)
|