incubator-droids-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mingfai Ma (JIRA)" <>
Subject [jira] Commented: (DROIDS-45) Fail to resolve outlink correctly
Date Sun, 07 Jun 2009 19:09:07 GMT


Mingfai Ma commented on DROIDS-45:

not sure if null path should be normalized to "/"
assertEquals("", normalizer.normalize(""));

if a website behaves differently for null and "/" path, then there might be problem. 

  //apply pattens
        if (path != null && !"".equals(path))
            for (Pattern pattern : PATH_REPLACEMENTS.keySet()) {
                path = pattern.matcher(path).replaceAll(PATH_REPLACEMENTS.get(pattern));
        else {
            path = "/";

changing "/" to null path is odd but may cause less problem. e.g. for,
it just redirect the request to "", and the fetching operation won't
be affected. I tested a couple of popular/famous websites and they will either redirect null
path request to another url or to "/" path. One of the main function of this normalization
is to avoid duplicated link as much as possible. 

> Fail to resolve outlink correctly
> ---------------------------------
>                 Key: DROIDS-45
>                 URL:
>             Project: Droids
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 0.01
>            Reporter: Mingfai Ma
>         Attachments: DROIDS-45b.patch, DROIDS-45c.patch
> I've encountered several cases that outlinks are not extracted correctly. Most are cause
by the use of URI.resolve(). 
> 1. For a base URI of new URI(""), <a href="test.html">test.html</a>
will be resolved to http://www.domain.comtest.html
> 2. For a base URI of new URI(""), <a href="?test=true">test
with param</a> will be resolved to
> 3. for <a href="\n">line break!</a>, URL.resolve will
throw exception. And in a browser, it can resolves the URI. (remarks: I didn't check if this
scenario affect the default Tika/NekoHTML parsing. )
> I suspect there are many different scenarios, many of them are probably caused by non-standard
usage. (but a crawler has to handle non-standard usage in order to function) Obviously, we
cannot cater every case, and I suggest to consider a resolve failure as a bug if a link works
in a Mozilla browser but not in Droids LinkExtractor. 
> this issue is related to the LinkExtractor created in DROIDS-8

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message