incubator-droids-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tim Gee <...@trickl.com>
Subject Trickl-Crawler - Significant Fork and Extension of Droids Framework
Date Tue, 13 Dec 2011 18:29:35 GMT
Hi,
  I've just released a significant fork and extension of the Apache Droids
framework, which I've been using for my own purposes for a while.
  http://open.trickl.com/trickl-crawler/index.html
  I've released it under the ASL and the intent is that any useful code
might be integrated into the official trunk of droids in the future. I've
taken a rather brutal, but pragmatic approach to using the framework -
where the design hasn't met my needs I've duplicated and revised code from
the framework. So, for example, you will see that significant chunks of the
API I have copied and changed and are available under
com.trickl.crawler.api. Obviously, in a perfect world, I would work with
your development team to discuss changes and find sensible workarounds, but
sadly I didn't have the time for that so I just rushed ahead and made
changes where I needed them to my modified implementation.
  So there will be conflicts in design and perhaps philosophy about some of
my core changes, many of which you might regard as unnecessary. However,
hopefully, there will still be a significant chunk of code that is useful
and perhaps some design changes were indeed worthwhile.
  So there's quite a lot of code to digest in one sitting, but broadly the
significant extensions to functionality are:

   -   Many more handlers. I really wanted to cope with the variety of
   responses web servers can throw back at me and deal with them solely with
   Spring configured beans. For example, a recent requirement I dealt with was
   a web server with information that required: a http post request with
   specific header and post data, formatted in JSON format, which returned
   data in JSON format (but the server gave the content type wrongly as
   "text/html") then I needed a particular JSON parameter which contained
   HTML, which I needed to parse into XML, then convert using XSL, then bind
   using JAXB (phew!). All configured via Spring beans.
   - More flexibility in the parsers (so I've played with other HTML parser
   implementations).
   - A specific requirement I had where using "classpath:/" to load a
   resource, I needed to specify the actual class loader as I've been working
   in a web server environment where the services jar (with the required
   resource) was separate from the jar that contained the droids framework.
   - JSON and SOAP processing. This may be against the philosophy of the
   framework. My particular example where I needed this was scraped
   information about films from Wikipedia. I decided I also wanted the film
   ratings, which are available from org.cara.webcarasearch - however they
   require a SOAP request to get this information. While it's not appropriate
   for a web "crawler" which just follows links - it does seem appropriate for
   a web "walker" (which is really what my misnamed framework was designed
   for) where I've a set of known data sources I want to automatically collect
   data from.
   - A "delegating" droid. My requirement here is that I have a single
   queue of tasks, but some of those tasks need to go to different droids
   (i.e. some tasks are SOAP tasks, some just grab images, some render web
   pages). So all the tasks are sent to the delegating droid, which then
   delegates to another droids depending on some criteria (I use a custom
   field to decide).
   - A rendering droid, based on the Mozilla Gecko engine. I haven't
   touched this code in over a year...all I can say is it once worked in a
   single threaded environment, but had issues in multi-threaded environments.
   Since writing the code, I've no longer a requirement for rendering web
   pages, so I've not maintained this code.
   - Support for timed tasks. Some web servers are very slow to respond and
   rather than allowing them to consume resources for too long, I needed the
   ability to kill some tasks after a time limit.

There's some major design issues as well. Probably too many to list here
and should you wish to discuss these, it might be worth going over each
individually. Many of them are because my idea of a droid task is more
"general" than that assumed in the main branch of droids. So fields just as
"depth", do not make any sense for a SOAP task. My top level class "Task"
just requires an identifier.

   I hope the code is useful and presents ideas for refinement and
development of the main branch of Apache Droids.

Best regards,
Tim

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message