incubator-droids-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Florent André <florent.andre-...@4sengines.com>
Subject Another proposed idea ? (was Re: Need 1 :)
Date Wed, 12 Aug 2009 02:07:25 GMT
Here I will try to better explain my idea : 

- In my webmaster working days, I have many repetitive "clic action" to do.
hummmmm, a little boring, so go to play :
-- ruby (http://en.wikipedia.org/wiki/Ruby_(programming_language) )
-- mecanize http://mechanize.rubyforge.org/mechanize/
-- hpricot (xml parser) 

Some lines of code after... and I'm an happy webmaster.

But not really in fact. Now I would like to do less code and more "just
instructions". Pass instruction by xml could be very nice.

Consider this use case :
- I have the "enterprise web yellow page" (nearly an LDAP) and my
enterprise CMS (no "dev" solutions possibles - JUST clic), and I have to
pass some informations to yellow-page to CSM.

- so in a cool "droids world", i would like to do something like that :

- write an droid-configuration.xml : set witch worker, configure link depth
following, set the DelayTimer is seconds,...

- write a droids-job.xml : go to this page, fill this form, select links in
{xpath}, follow this link, extract the {xpath} add save, go to this page
and fill the form with saved informations.

... With that, a really happy webmaster ! :)


What do you think about that ?


Asta luego

On Tue, 14 Jul 2009 09:56:33 +0200, Thorsten Scherler
<thorsten.scherler.ext@juntadeandalucia.es> wrote:
> On Mon, 2009-07-13 at 16:49 +0200, Florent André wrote:
>> Hi Droids list !
>> 
>> After a speak during the Lenya meeting with Senior Thorsten (Olé !:) ),
>> I
>> would like to have more informations about droids.
> 
> Bonjour Monsieur Florent, bienvenido a Droids. ;)
> 
>> I know that droids is not only a web crawler (and I would like to use it
>> for other think), but my immediate need is about crawling...
> 
> What comes know as xml document I will try to put it in terms of droids.
> I guess putting it in our wiki http://cwiki.apache.org/DROIDS/ will be
> helpful for future references. 
> 
>> So let's go : 
>> 
>> I would like to pass to droids an xml like (just an example) : 
>> <article>
>>   <droids:url>http://example.com/test.html</droids:url>
> 
> In droids crawling the url is the entrance point of the processing. What
> happens then is highly configurable and currently Ming Fai has suggested
> some changes for the future. I will describe the possibilities that
> droids currently offers for the presented use case. 
> 
> Like said we start with the queue where you inject the starting urls.
> Then this queue will call a worker (which basically is the part of the
> code where the real work is done). This worker may call a linkExtractor
> and/or a Parser to extract link and any other information about the
> incoming page.
> 
>>   <title>
>>    
>>
<droids:xpath>html/body/div[@id='content']/div[@id='title']/h1</droids:xpath>
>>   </title>
>>   <firstparagraph>
>>     
>>
<droids:xpath>html/body/div[@id='content']/div[@id='article']/p[position()=1]</droids:xpath>
>>   </firstparagraph>
>>   <othertext>
>>    
>>
<droids:xpath>html/body/div[@id='content']/div[@id='article']/p[position()>1]</droids:xpath>
>>   </othertext>
>> </article>
>> 
>> and that droids give me someting like : 
>> <article>
>>   <title> this is the article title </article>
>>   <firstparagraph> This article is about the....</firstparagraph>
>>   <othertext>bla bla bla bla bla...</othertext>
>> </article>
> 
> You could use a simple xsl transformation for that. You can develop the
> xsl stylesheet (basically the xpaths) to extract the info with lenya as
> usual. Just use a generator to get the source and then add the
> transformer which will return the above doc. This stylesheet you would
> copy to your droids plugin and use it to generate a result outputstream.
> This stream you would pass to save handler of droids which then saves
> you the stream to the location you want.
> 
>> So my questions are : 
>> 
>> 1) It's possible ? 
> 
> Yes certainly. 
> 
>> 
>> 2) If yes, I will have to (think that I'm not a java's SuperStar) :
>>    a) install droids, type 2 commands lines, and let's go (1 hour work)
> 
> No, droids is a very loose framework and we do not have the specific use
> case you ask for in our code base (maybe afterwards). ;)
> 
>>    b) install droids, really understand understand how droids work, code
>> some classes (3 weeks work)
> 
> jeje, that is most valuable, but for your use case should not be
> necessary.
> 
>>    c) install droids, create a class from existing one, doing some try
>> error (4-5 days work)
> 
> Yeah, I guess that is realistic with testing and so on. 
> 
>>    d) ...
>> 
>> 3) It's difficult to plug droids into a Lenya (based on cocoon) app ?
> 
> Actually not at all. I recommend to first code your bot in droids then
> generate the jar and copy it to your lenya module. Do not forget the
> dependencies that your droids may have and add them to the lib dir of
> your module.
> 
> HTH to get you the general idea.
> 
> salu2
> 
>> 
>> Thanks for your answer,
>> 
>> Regards

Mime
View raw message