incubator-droids-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Thorsten Scherler <>
Subject Re: [jira] Commented: (DROIDS-45) Fail to resovle outlink correctly
Date Fri, 03 Apr 2009 08:11:11 GMT
On Thu, 2009-04-02 at 06:10 -0700, Mingfai Ma (JIRA) wrote:
> [
> Mingfai Ma commented on DROIDS-45:
> ----------------------------------
> the LinkExtractor doesn't append '/' automatically. 

Hmm, I just asked Javier to have a look into this since he had been the
last that worked with the code. Will try to find some time to debug this
weekend since need to finish a project ATM.

One small thing I found in your class is the constructor, it is not
optimal since we would be forced to create a lot of instances (for each
base/link), that needs rethinking to reuse the class.


> and I think it shouldn't, as it is possible for a server to handle with and without '/'
differently. For root domain URL, it may be ok. but for deeper URL, we can't just assume the
last segment of the request path is a directory
> Apache mod_dir should append a trailing slash but unfortunately, not all web server on
the internet have this feature enabled :-)
> > Fail to resovle outlink correctly
> > ---------------------------------
> >
> >                 Key: DROIDS-45
> >                 URL:
> >             Project: Droids
> >          Issue Type: Bug
> >          Components: core
> >    Affects Versions: 0.01
> >            Reporter: Mingfai Ma
> >
> > I've encountered several cases that outlinks are not extracted correctly. Most are
cause by the use of URI.resolve(). 
> > 1. For a base URI of new URI(""), <a href="test.html">test.html</a>
will be resolved to http://www.domain.comtest.html
> > 2. For a base URI of new URI(""), <a href="?test=true">test
with param</a> will be resolved to
> > 3. for <a href="\n">line break!</a>, URL.resolve
will throw exception. And in a browser, it can resolves the URI. (remarks: I didn't check
if this scenario affect the default Tika/NekoHTML parsing. )
> > I suspect there are many different scenarios, many of them are probably caused by
non-standard usage. (but a crawler has to handle non-standard usage in order to function)
Obviously, we cannot cater every case, and I suggest to consider a resolve failure as a bug
if a link works in a Mozilla browser but not in Droids LinkExtractor. 
> > this issue is related to the LinkExtractor created in DROIDS-8
Thorsten Scherler <>
Open Source Java <consulting, training and solutions>

Sociedad Andaluza para el Desarrollo de la Sociedad 
de la InformaciĆ³n, S.A.U. (SADESI)

View raw message