incubator-droids-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Thorsten Scherler <thorsten.scherler....@juntadeandalucia.es>
Subject Re: [jira] Commented: (DROIDS-45) Fail to resovle outlink correctly
Date Fri, 03 Apr 2009 08:11:11 GMT
On Thu, 2009-04-02 at 06:10 -0700, Mingfai Ma (JIRA) wrote:
> [ https://issues.apache.org/jira/browse/DROIDS-45?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12694995#action_12694995
] 
> 
> Mingfai Ma commented on DROIDS-45:
> ----------------------------------
> 
> the LinkExtractor doesn't append '/' automatically. 

Hmm, I just asked Javier to have a look into this since he had been the
last that worked with the code. Will try to find some time to debug this
weekend since need to finish a project ATM.

One small thing I found in your class is the constructor, it is not
optimal since we would be forced to create a lot of instances (for each
base/link), that needs rethinking to reuse the class.

salu2


> and I think it shouldn't, as it is possible for a server to handle with and without '/'
differently. For root domain URL, it may be ok. but for deeper URL, we can't just assume the
last segment of the request path is a directory
> 
> Apache mod_dir should append a trailing slash but unfortunately, not all web server on
the internet have this feature enabled :-)
> http://httpd.apache.org/docs/2.2/mod/mod_dir.html
> 
> > Fail to resovle outlink correctly
> > ---------------------------------
> >
> >                 Key: DROIDS-45
> >                 URL: https://issues.apache.org/jira/browse/DROIDS-45
> >             Project: Droids
> >          Issue Type: Bug
> >          Components: core
> >    Affects Versions: 0.01
> >            Reporter: Mingfai Ma
> >
> > I've encountered several cases that outlinks are not extracted correctly. Most are
cause by the use of URI.resolve(). 
> > 1. For a base URI of new URI("http://www.domain.com"), <a href="test.html">test.html</a>
will be resolved to http://www.domain.comtest.html
> > 2. For a base URI of new URI("http://www.domain.com/index.php"), <a href="?test=true">test
with param</a> will be resolved to http://www.domain.com/?test=true
> > 3. for <a href="http://www.yahoo.com\n">line break!</a>, URL.resolve
will throw exception. And in a browser, it can resolves the URI. (remarks: I didn't check
if this scenario affect the default Tika/NekoHTML parsing. )
> > I suspect there are many different scenarios, many of them are probably caused by
non-standard usage. (but a crawler has to handle non-standard usage in order to function)
Obviously, we cannot cater every case, and I suggest to consider a resolve failure as a bug
if a link works in a Mozilla browser but not in Droids LinkExtractor. 
> > this issue is related to the LinkExtractor created in DROIDS-8
> 
-- 
Thorsten Scherler <thorsten.at.apache.org>
Open Source Java <consulting, training and solutions>

Sociedad Andaluza para el Desarrollo de la Sociedad 
de la InformaciĆ³n, S.A.U. (SADESI)





Mime
View raw message