incubator-droids-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mingfai Ma (JIRA)" <j...@apache.org>
Subject [jira] Updated: (DROIDS-45) Fail to resolve outlink correctly
Date Sat, 04 Apr 2009 16:37:13 GMT

     [ https://issues.apache.org/jira/browse/DROIDS-45?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Mingfai Ma updated DROIDS-45:
-----------------------------

    Attachment: LinkResolverTests.java
                LinkResolver.java

Changed the API base on Thorsten's comment. 

Notice that these two classes need further processing to put into Droids. The classes are
not in Droids package, there are no license terms, and the style doesn't align to the original
LinkExtractor. They are attached as a base for a Droids implementation.

> Fail to resolve outlink correctly
> ---------------------------------
>
>                 Key: DROIDS-45
>                 URL: https://issues.apache.org/jira/browse/DROIDS-45
>             Project: Droids
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 0.01
>            Reporter: Mingfai Ma
>         Attachments: LinkResolver.java, LinkResolver.java, LinkResolverTests.java, LinkResolverTests.java
>
>
> I've encountered several cases that outlinks are not extracted correctly. Most are cause
by the use of URI.resolve(). 
> 1. For a base URI of new URI("http://www.domain.com"), <a href="test.html">test.html</a>
will be resolved to http://www.domain.comtest.html
> 2. For a base URI of new URI("http://www.domain.com/index.php"), <a href="?test=true">test
with param</a> will be resolved to http://www.domain.com/?test=true
> 3. for <a href="http://www.yahoo.com\n">line break!</a>, URL.resolve will
throw exception. And in a browser, it can resolves the URI. (remarks: I didn't check if this
scenario affect the default Tika/NekoHTML parsing. )
> I suspect there are many different scenarios, many of them are probably caused by non-standard
usage. (but a crawler has to handle non-standard usage in order to function) Obviously, we
cannot cater every case, and I suggest to consider a resolve failure as a bug if a link works
in a Mozilla browser but not in Droids LinkExtractor. 
> this issue is related to the LinkExtractor created in DROIDS-8

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message