incubator-droids-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mingfai <mingfai...@gmail.com>
Subject HTML outllink extraction
Date Thu, 02 Apr 2009 10:53:33 GMT
hi,

The default LinkExtractor seems to be quite simple. (too simple) It mainly
uses URI.resolve and only cater the # and javascript scenarios. (from
LinkExtractor.java getURI) Simple usage link resolving a <a
href="test.html"> for new URI("http://www.google.com") will be wrong as it
will return a http://www.google.comtest.html". And there are many case that
the URI.resolve doesn't cater. It seems to me we need to do some works at
this area to make Droids more usable. Does anyone have any experience in out
link extraction?

I'm trying to see how other frameworks handle out link extraction and looked
at:
http://svn.apache.org/viewvc/lucene/nutch/trunk/src/java/org/apache/nutch/parse/OutlinkExtractor.java?view=log
https://archive-crawler.svn.sourceforge.net/svnroot/archive-crawler/trunk/heritrix2/engine/src/main/java/org/archive/extractor/RegexpHTMLLinkExtractor.java
(Heritrix's JavaDoc shows they have given some good thought in handling
different tags and attributes)

What do you think if I add a wiki page that list out some scenarios of
outlink handling (i.e. the requirement)? Or does anyone know if any of the
many Java crawler projects have documentation at this area?

regards,
mingfai

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message