incubator-droids-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Thorsten Scherler <>
Subject Re: HTML outllink extraction
Date Thu, 02 Apr 2009 11:24:38 GMT
On Thu, 2009-04-02 at 18:53 +0800, Mingfai wrote:
> hi,
> The default LinkExtractor seems to be quite simple. (too simple) It mainly
> uses URI.resolve and only cater the # and javascript scenarios. (from
> getURI) Simple usage link resolving a <a
> href="test.html"> for new URI("") will be wrong as it
> will return a".

Well the link extraction always worked well. The case you just pointed
out looks like a bug BUT if you mean new URL
("","test.html)) then have a look at, java.lang.String)

> And there are many case that
> the URI.resolve doesn't cater. It seems to me we need to do some works at
> this area to make Droids more usable. Does anyone have any experience in out
> link extraction?

Enhancements are always welcome however the link extraction should work
fine. At least when I last looked at it was fine. The limitation ATM is
the extraction of jscript generated links.

> I'm trying to see how other frameworks handle out link extraction and looked
> at:

Funny enough that have been the base of droids outlink extraction in the
first version I hacked.

> (Heritrix's JavaDoc shows they have given some good thought in handling
> different tags and attributes)
> What do you think if I add a wiki page that list out some scenarios of
> outlink handling (i.e. the requirement)? Or does anyone know if any of the
> many Java crawler projects have documentation at this area?

If you do not look into jscript/ajax link extraction then there is no
secret to it. Either go with xpath expression or e.g. for plain text
with regexp. Please fell free to open a wiki page around the issue.


> regards,
> mingfai
Thorsten Scherler <>
Open Source Java <consulting, training and solutions>

Sociedad Andaluza para el Desarrollo de la Sociedad 
de la InformaciĆ³n, S.A.U. (SADESI)

View raw message