incubator-droids-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Thorsten Scherler <thorsten.scherler....@juntadeandalucia.es>
Subject Re: Need 1 :
Date Tue, 14 Jul 2009 07:56:33 GMT
On Mon, 2009-07-13 at 16:49 +0200, Florent André wrote:
> Hi Droids list !
> 
> After a speak during the Lenya meeting with Senior Thorsten (Olé !:) ), I
> would like to have more informations about droids.

Bonjour Monsieur Florent, bienvenido a Droids. ;)

> I know that droids is not only a web crawler (and I would like to use it
> for other think), but my immediate need is about crawling...

What comes know as xml document I will try to put it in terms of droids.
I guess putting it in our wiki http://cwiki.apache.org/DROIDS/ will be
helpful for future references. 

> So let's go : 
> 
> I would like to pass to droids an xml like (just an example) : 
> <article>
>   <droids:url>http://example.com/test.html</droids:url>

In droids crawling the url is the entrance point of the processing. What
happens then is highly configurable and currently Ming Fai has suggested
some changes for the future. I will describe the possibilities that
droids currently offers for the presented use case. 

Like said we start with the queue where you inject the starting urls.
Then this queue will call a worker (which basically is the part of the
code where the real work is done). This worker may call a linkExtractor
and/or a Parser to extract link and any other information about the
incoming page.

>   <title>
>    
> <droids:xpath>html/body/div[@id='content']/div[@id='title']/h1</droids:xpath>
>   </title>
>   <firstparagraph>
>     
> <droids:xpath>html/body/div[@id='content']/div[@id='article']/p[position()=1]</droids:xpath>
>   </firstparagraph>
>   <othertext>
>    
> <droids:xpath>html/body/div[@id='content']/div[@id='article']/p[position()>1]</droids:xpath>
>   </othertext>
> </article>
> 
> and that droids give me someting like : 
> <article>
>   <title> this is the article title </article>
>   <firstparagraph> This article is about the....</firstparagraph>
>   <othertext>bla bla bla bla bla...</othertext>
> </article>

You could use a simple xsl transformation for that. You can develop the
xsl stylesheet (basically the xpaths) to extract the info with lenya as
usual. Just use a generator to get the source and then add the
transformer which will return the above doc. This stylesheet you would
copy to your droids plugin and use it to generate a result outputstream.
This stream you would pass to save handler of droids which then saves
you the stream to the location you want.

> So my questions are : 
> 
> 1) It's possible ? 

Yes certainly. 

> 
> 2) If yes, I will have to (think that I'm not a java's SuperStar) :
>    a) install droids, type 2 commands lines, and let's go (1 hour work)

No, droids is a very loose framework and we do not have the specific use
case you ask for in our code base (maybe afterwards). ;)

>    b) install droids, really understand understand how droids work, code
> some classes (3 weeks work)

jeje, that is most valuable, but for your use case should not be
necessary.

>    c) install droids, create a class from existing one, doing some try
> error (4-5 days work)

Yeah, I guess that is realistic with testing and so on. 

>    d) ...
> 
> 3) It's difficult to plug droids into a Lenya (based on cocoon) app ?

Actually not at all. I recommend to first code your bot in droids then
generate the jar and copy it to your lenya module. Do not forget the
dependencies that your droids may have and add them to the lib dir of
your module.

HTH to get you the general idea.

salu2

> 
> Thanks for your answer,
> 
> Regards
-- 
Thorsten Scherler <thorsten.at.apache.org>
Open Source Java <consulting, training and solutions>

Sociedad Andaluza para el Desarrollo de la Sociedad 
de la Información, S.A.U. (SADESI)





Mime
View raw message