incubator-droids-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Richard Frovarp <rfrov...@apache.org>
Subject Re: Trickl-Crawler - Significant Fork and Extension of Droids Framework
Date Thu, 22 Dec 2011 19:34:17 GMT
On 12/13/2011 12:29 PM, Tim Gee wrote:
> Hi,
>    I've just released a significant fork and extension of the Apache Droids
> framework, which I've been using for my own purposes for a while.
>    http://open.trickl.com/trickl-crawler/index.html
>    I've released it under the ASL and the intent is that any useful code
> might be integrated into the official trunk of droids in the future. I've
> taken a rather brutal, but pragmatic approach to using the framework -
> where the design hasn't met my needs I've duplicated and revised code from
> the framework. So, for example, you will see that significant chunks of the
> API I have copied and changed and are available under
> com.trickl.crawler.api. Obviously, in a perfect world, I would work with
> your development team to discuss changes and find sensible workarounds, but
> sadly I didn't have the time for that so I just rushed ahead and made
> changes where I needed them to my modified implementation.
>    So there will be conflicts in design and perhaps philosophy about some of
> my core changes, many of which you might regard as unnecessary. However,
> hopefully, there will still be a significant chunk of code that is useful
> and perhaps some design changes were indeed worthwhile.
>

Hello Tim,

I've been meaning to take some time to look through your release, but 
sadly I have not had the opportunity to yet. Thank you for using AL2, 
and I hope we can incorporate some of your changes back into the core.

You make some interesting changes and additions. Being able to process 
those different content types (JSON for example), might not make sense 
for a crawler, but it is quite useful. In my implementation, I'm doing 
all sorts of status code handling, and recording that into a database. I 
figure if I'm crawling my material, I might as well know what is broken 
and where the redirects are. So those functionalities are certainly very 
useful when going over a certain set of content.

You mentioned that you've tried to use other HTML parsers. How does 
HtmlCleaner vary from JTidy? Do you have any feeling of how those and/or 
Neko compare to the ones from Tika? I've got a few pages that Tika blows 
up on.

How did you handle Spring? I see you aren't using the droids-spring 
module. I know so very little about Spring, that I don't know where any 
of the deficiencies are.

It would be nice to merge some of your functionality in. I hope I have 
time to look at it soon. Obviously patches are always welcome, and may 
work with low hanging fruit. The more significant changes would require 
more effort.

Mime
View raw message