nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doğacan Güney (JIRA) <>
Subject [jira] Resolved: (NUTCH-567) Proper (?) handling of URIs in TagSoup.
Date Mon, 25 Feb 2008 09:40:51 GMT


Doğacan Güney resolved NUTCH-567.

       Resolution: Fixed
    Fix Version/s: 1.0.0
         Assignee: Doğacan Güney

Fixed in rev. 630779.

I added tagsoup-1.2 under parse-html and modified plugin.xml accordingly.

I also modified tagsoup-LICENSE.txt to reflect the license change.. But, since tagsoup now
uses Apache 2.0 anyway, I am not sure if a separate LICENSE file is still necessary.. If not,
someone gives me a heads-up and I will remove the license file....

> Proper (?) handling of URIs in TagSoup.
> ---------------------------------------
>                 Key: NUTCH-567
>                 URL:
>             Project: Nutch
>          Issue Type: Improvement
>            Reporter: Dawid Weiss
>            Assignee: Doğacan Güney
>            Priority: Minor
>             Fix For: 1.0.0
>         Attachments: README-tagsoup-patched.txt, tagsoup-1.1.3-uripatched.jar
> Doug Cook reported that TagSoup incorrectly handles some URI parameters. More discussion
on the list and at TagSoup's mailing list.
> I looked at the sources of TagSoup because I'm using it myself (although the URIs are
not relevant for me). It seems like you can implement a naive workaround by remembering the
parsing state and just avoiding entity resolution. Attached is the patch that does this.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message