nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dawid Weiss (JIRA)" <>
Subject [jira] Commented: (NUTCH-567) Proper (?) handling of URIs in TagSoup.
Date Sat, 05 Jan 2008 17:38:33 GMT


Dawid Weiss commented on NUTCH-567:

John Cowan apparently released a fixed version of TagSoup (1.2). This is good news for several
reasons (quoting):

- As noted above, I have changed the license to Apache 2.0.

- The processing of entity references in attribute values has finally been fixed to do what
browsers do. That is, a reference is only recognized if it is properly terminated by a semicolon;
 otherwise it is treated as plain text. This means that URIs like "foo?cdown=32&cup=42"
are no longer seen as containing an instance of the cup character.

I guess this issue is no longer applicable and an upgrade to the newer TagSoup would be appropriate.

> Proper (?) handling of URIs in TagSoup.
> ---------------------------------------
>                 Key: NUTCH-567
>                 URL:
>             Project: Nutch
>          Issue Type: Improvement
>            Reporter: Dawid Weiss
>            Priority: Minor
>         Attachments: README-tagsoup-patched.txt, tagsoup-1.1.3-uripatched.jar
> Doug Cook reported that TagSoup incorrectly handles some URI parameters. More discussion
on the list and at TagSoup's mailing list.
> I looked at the sources of TagSoup because I'm using it myself (although the URIs are
not relevant for me). It seems like you can implement a naive workaround by remembering the
parsing state and just avoiding entity resolution. Attached is the patch that does this.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message