incubator-droids-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Richard Frovarp (Resolved) (JIRA)" <j...@apache.org>
Subject [jira] [Resolved] (DROIDS-45) Fail to resolve outlink correctly
Date Sat, 03 Dec 2011 01:39:40 GMT

     [ https://issues.apache.org/jira/browse/DROIDS-45?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Richard Frovarp resolved DROIDS-45.
-----------------------------------

       Resolution: Fixed
    Fix Version/s: 0.0.2

Our code does this relatively well.
However, using the droids-tika module for parsing seems to handle everything very well. Let's
let the Tika people worry about those problems.
                
> Fail to resolve outlink correctly
> ---------------------------------
>
>                 Key: DROIDS-45
>                 URL: https://issues.apache.org/jira/browse/DROIDS-45
>             Project: Droids
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 0.0.1
>            Reporter: Mingfai Ma
>             Fix For: 0.0.2
>
>         Attachments: DROIDS-45b.patch, DROIDS-45c.patch
>
>
> I've encountered several cases that outlinks are not extracted correctly. Most are cause
by the use of URI.resolve(). 
> 1. For a base URI of new URI("http://www.domain.com"), <a href="test.html">test.html</a>
will be resolved to http://www.domain.comtest.html
> 2. For a base URI of new URI("http://www.domain.com/index.php"), <a href="?test=true">test
with param</a> will be resolved to http://www.domain.com/?test=true
> 3. for <a href="http://www.yahoo.com\n">line break!</a>, URL.resolve will
throw exception. And in a browser, it can resolves the URI. (remarks: I didn't check if this
scenario affect the default Tika/NekoHTML parsing. )
> I suspect there are many different scenarios, many of them are probably caused by non-standard
usage. (but a crawler has to handle non-standard usage in order to function) Obviously, we
cannot cater every case, and I suggest to consider a resolve failure as a bug if a link works
in a Mozilla browser but not in Droids LinkExtractor. 
> this issue is related to the LinkExtractor created in DROIDS-8

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message