incubator-droids-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mingfai <mingfai...@gmail.com>
Subject Re: some proposed ideas for Droids
Date Tue, 23 Jun 2009 06:15:26 GMT
hi,

On Tue, Jun 23, 2009 at 5:23 AM, Ryan McKinley <ryantxu@gmail.com> wrote:

>
>
> For background, I am not (yet) using droids for web crawling -- rather, I
> use it to manage a bunch of jobs that keep external processes running.  It
> is easy to equate droids with crawling, but I think that is one of many
> functions (though obviously the most generally relevant)


notice that all of my proposed idea refers to crawling. (and i only use it
for web crawling) so some points are actually invalid, e.g. i suggest to
remove Task and TaskMaster. They may be valid for core as a generic robot
framework.

btw, from the "what is it" in the incubator page, "intelligent standalone
robot framework" indeed vague to me. other than "framework" that clearly
mean it is not a complete application, the other 3 terms are not clear to
me. does "standalone" refer to it's not distributed/clustered? I actually
expect it Droids to provide a clustered infrastructure to run Droid /
execute Task. Say, for web crawling, one could easily get to the bottleneck
by adding more threads, and a cluster environment is needed.

p.s. the description indeed sounds like the project is for Artificial
Intelligence robot.



> Does the *core* really need access to the whole object graph?  I totally
> agree that most specific implementations will need broader access.
>
> I think droids power will come from its flexibility / simplicity.  Ideally
> the *core* will have as few dependencies as possible.
>
> I agree that sub-project/package that focuses on web crawling could depend
> on spring.


how about split any crawler functionality to a sub-module? and what should
stay in the core and what should be in the crawler module?  e.g. fetcher and
parser, a generic robot may not do fetching and parsing.

say, we may define, if certain functionality is needed by more than one
module, it should go to the core.

Or we could just put everything in core first, and have a concept in mind
that we'll do splitting, and split later. It's good to start simple.



>
>         4. Link-centric design
>>  - Link, extends HashMap, will act as a main arbitary data container, and
>>     a vehicle that store attributes and data thoughout the whole lifecycle
>> of
>>     fetching, parsing, and extracting.
>>
>
> I don't have any strong opinion here.... but I would rather see an API
> where we can rely on method calls then putting stuff into a Map -- perhaps
> years of dealing with request.getAttribute() has turned me sour on this
> model.



more elaboration at this point, that is mainly for the crawler use case:

   - for the crawling use case, I propose to make every component (e.g.
   fetcher, parser) use to a <? extends Link> signature, e.g.
   PaserFactory<T extends Link> with newParser(T link);
   Parser<T extends Link> with parse(T link, Entity entity);

   The raw Link is basically a HashMap. but user that extend Link do not
   need to use any Map interface and they could implement every method as Java
   methods, e.g.
   public class EnhancedLink extends Link{
     protected Set<Link> outLinks;
     //.. getter and setter
   }

   And they could implement their own Extractor<? extends Link> that use
   setOutLink() to store outlink to their own link.

   - refer to DROIDS-52 (https://issues.apache.org/jira/browse/DROIDS-52),
   we would want to store minimum data in a Link. Unless one implement a Queue
   that support passivation, keep putting Link/LinkTask to the Queue/TaskQueue
   will consume a lot of memory.

   - The difficult to define normal interface is it's hard to define the
   interface that will be used in difficult components, and it's not possible
   to foresee what data user want to attach to the Link. Assume the following
   crawler flow : polling (a link from queue) -> fetching -> parsing ->
   extracting

   Take an example from a real use case that I've encountered before. In
   fetching, the fetcher has Request and Response. The response contains HTTP
   headers including Cookie headers. In out link extraction, we may want to
   create a link with the cookie data. it's no good to pass the response object
   all the way to the extractor (and it might not be possible when response is
   not serializable) if it is not a Map like container. Link may have to have a
   List<HTTPHeader> to handle my requirement.


regards,
mingfai

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message