incubator-droids-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Thorsten Scherler <thorsten.scherler....@juntadeandalucia.es>
Subject Re: HTML outllink extraction
Date Thu, 02 Apr 2009 11:24:38 GMT
On Thu, 2009-04-02 at 18:53 +0800, Mingfai wrote:
> hi,
> 
> The default LinkExtractor seems to be quite simple. (too simple) It mainly
> uses URI.resolve and only cater the # and javascript scenarios. (from
> LinkExtractor.java getURI) Simple usage link resolving a <a
> href="test.html"> for new URI("http://www.google.com") will be wrong as it
> will return a http://www.google.comtest.html".

Well the link extraction always worked well. The case you just pointed
out looks like a bug BUT if you mean new URL
("http://testServer.com","test.html)) then have a look at
http://java.sun.com/j2se/1.4.2/docs/api/java/net/URL.html#URL(java.net.URL, java.lang.String)

> And there are many case that
> the URI.resolve doesn't cater. It seems to me we need to do some works at
> this area to make Droids more usable. Does anyone have any experience in out
> link extraction?

Enhancements are always welcome however the link extraction should work
fine. At least when I last looked at it was fine. The limitation ATM is
the extraction of jscript generated links.

> 
> I'm trying to see how other frameworks handle out link extraction and looked
> at:
> http://svn.apache.org/viewvc/lucene/nutch/trunk/src/java/org/apache/nutch/parse/OutlinkExtractor.java?view=log

Funny enough that have been the base of droids outlink extraction in the
first version I hacked.

> https://archive-crawler.svn.sourceforge.net/svnroot/archive-crawler/trunk/heritrix2/engine/src/main/java/org/archive/extractor/RegexpHTMLLinkExtractor.java
> (Heritrix's JavaDoc shows they have given some good thought in handling
> different tags and attributes)
> 
> What do you think if I add a wiki page that list out some scenarios of
> outlink handling (i.e. the requirement)? Or does anyone know if any of the
> many Java crawler projects have documentation at this area?

If you do not look into jscript/ajax link extraction then there is no
secret to it. Either go with xpath expression or e.g. for plain text
with regexp. Please fell free to open a wiki page around the issue.

salu2

> 
> regards,
> mingfai
-- 
Thorsten Scherler <thorsten.at.apache.org>
Open Source Java <consulting, training and solutions>

Sociedad Andaluza para el Desarrollo de la Sociedad 
de la InformaciĆ³n, S.A.U. (SADESI)





Mime
View raw message