incubator-droids-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Thorsten Scherler <thorsten.scherler....@juntadeandalucia.es>
Subject Re: HTML outllink extraction
Date Thu, 02 Apr 2009 12:19:09 GMT
On Thu, 2009-04-02 at 19:38 +0800, Mingfai wrote:
> let's just look at the specific case first. Maybe I have jumped to the
> conclusion that the Link Extraction feature is too simple too soon.
> 
> At line 139 of LinkExtractor.java, it uses URI.resolve(String) to resolve a
> URI.
>       if (!target.toLowerCase().startsWith("javascript")
>           && !target.contains(":/")) {
> 139:        return base.getURI().resolve(target.split("#")[0]);
>       }
>       else if (!target.toLowerCase().startsWith("javascript")) {
>         return new URI(target.split("#")[0]);
>       }
> 
> When I test the URI API with:
>   new URI("http://www.google.com").resolve("index.php")
> it resolves the url to "http://www.google.comindex.php"
> 
> if you didn't mean it is a bug with my JDK, then we need to specially append
> a "/" prefix

Hmm, 
http://java.sun.com/j2se/1.4.2/docs/api/java/net/URI.html#resolve(java.net.URI)
"...
3.Otherwise the new URI's authority component is copied from this URI,
and its path is computed as follows: 

     A. If the given URI's path is absolute then the new URI's path is
        taken from the given URI. 
        
     B. Otherwise the given URI's path is relative, and so the new URI's
        path is computed by resolving the path of the given URI against
        the path of this URI. This is done by concatenating all but the
        last segment of this URI's path, if any, with the given URI's
        path and then normalizing the result as if by invoking the
        normalize method. 
..."

That sounds that new URI("http://www.apache.org").resolve("index.html")
should return http://www.apache.org/index.html. Since it reads: "the
result as if by invoking the normalize method" 

http://java.sun.com/j2se/1.4.2/docs/api/java/net/URI.html#normalize()

        
> 
> And previously, I found another scenario that doesn't work, when there is a
> link <a href="?test=true">test</a> under www.google.com/index.php , it
> resolves to www.google.com/?test=true rather than
> www.google.com/index.php?test=true like in a web browser.
> 
> This makes me feel there are many special scenario that a crawler need to
> cater. What do you think? Is it really so simple? My suggest to add a page
> is for listing those special scenarios, that sometimes maybe just cause by
> non-standard usage.

Actually that should normally be handled by the above linked methods.
Please comment on issue DROIDS-8/DROIDS-11 if you find that the link
extraction is not working as expected. 

salu2

> 
> regards,
> mingfai
> 
> 
> 
> 
> On Thu, Apr 2, 2009 at 7:24 PM, Thorsten Scherler <
> thorsten.scherler.ext@juntadeandalucia.es> wrote:
> 
> > On Thu, 2009-04-02 at 18:53 +0800, Mingfai wrote:
> > > hi,
> > >
> > > The default LinkExtractor seems to be quite simple. (too simple) It
> > mainly
> > > uses URI.resolve and only cater the # and javascript scenarios. (from
> > > LinkExtractor.java getURI) Simple usage link resolving a <a
> > > href="test.html"> for new URI("http://www.google.com") will be wrong as
> > it
> > > will return a http://www.google.comtest.html".
> >
> > Well the link extraction always worked well. The case you just pointed
> > out looks like a bug BUT if you mean new URL
> > ("http://testServer.com","test.html)) then have a look at
> > http://java.sun.com/j2se/1.4.2/docs/api/java/net/URL.html#URL(java.net.URL<http://java.sun.com/j2se/1.4.2/docs/api/java/net/URL.html#URL%28java.net.URL>,
> > java.lang.String)
> >
> > > And there are many case that
> > > the URI.resolve doesn't cater. It seems to me we need to do some works at
> > > this area to make Droids more usable. Does anyone have any experience in
> > out
> > > link extraction?
> >
> > Enhancements are always welcome however the link extraction should work
> > fine. At least when I last looked at it was fine. The limitation ATM is
> > the extraction of jscript generated links.
> >
> > >
> > > I'm trying to see how other frameworks handle out link extraction and
> > looked
> > > at:
> > >
> > http://svn.apache.org/viewvc/lucene/nutch/trunk/src/java/org/apache/nutch/parse/OutlinkExtractor.java?view=log
> >
> > Funny enough that have been the base of droids outlink extraction in the
> > first version I hacked.
> >
> > >
> > https://archive-crawler.svn.sourceforge.net/svnroot/archive-crawler/trunk/heritrix2/engine/src/main/java/org/archive/extractor/RegexpHTMLLinkExtractor.java
> > > (Heritrix's JavaDoc shows they have given some good thought in handling
> > > different tags and attributes)
> > >
> > > What do you think if I add a wiki page that list out some scenarios of
> > > outlink handling (i.e. the requirement)? Or does anyone know if any of
> > the
> > > many Java crawler projects have documentation at this area?
> >
> > If you do not look into jscript/ajax link extraction then there is no
> > secret to it. Either go with xpath expression or e.g. for plain text
> > with regexp. Please fell free to open a wiki page around the issue.
> >
> > salu2
> >
> > >
> > > regards,
> > > mingfai
> > --
> > Thorsten Scherler <thorsten.at.apache.org>
> > Open Source Java <consulting, training and solutions>
> >
> > Sociedad Andaluza para el Desarrollo de la Sociedad
> > de la InformaciĆ³n, S.A.U. (SADESI)
> >
> >
> >
> >
> >
-- 
Thorsten Scherler <thorsten.at.apache.org>
Open Source Java <consulting, training and solutions>

Sociedad Andaluza para el Desarrollo de la Sociedad 
de la InformaciĆ³n, S.A.U. (SADESI)





Mime
View raw message