nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sebastian Nagel (JIRA)" <>
Subject [jira] [Created] (NUTCH-1767) remove special treatment of "params" in relative links
Date Sun, 27 Apr 2014 22:20:15 GMT
Sebastian Nagel created NUTCH-1767:

             Summary: remove special treatment of "params" in relative links
                 Key: NUTCH-1767
             Project: Nutch
          Issue Type: Bug
    Affects Versions: 2.2.1, 1.8
            Reporter: Sebastian Nagel
            Priority: Minor
             Fix For: 2.3, 1.9

[RFC 1808|] specified that path elements of URLs may contains
so-called params startet by ";", e.g. ";type=a". If the base URL contains a path param while
the link target does not, params are transferred to the target:
Step 5: 
 a) if the embedded URL's <params> is non-empty, we skip to
     step 7; otherwise, it inherits the <params> of the base URL (if any)
This behaviour has been implemented with NUTCH-436. Later (NUTCH-1115) it had been made optional
and configurable by property {{parser.fix.embeddedparams}}. NUTCH-797 made the changes of
both issues inactive for 1.x (not applied to 2.x) with reference to RFC 3986.

[RFC 3986|] which obsoletes RFC 1808 does not mention params
and examples given in sect. 5.4. "Reference Resolution Examples" contradict RFC 1808. Also
[Wikipedia|] states:
Historically, each segment was specified to contain parameters separated from it using a semicolon
(";"), though this was rarely used in practice and current specifications allow but no longer
specify such semantics.

Accordingly, any special treatment of "params" in relative links should be removed from Nutch.
At a first glance, this would include:
* 2.x parse-html and parse-tika
** remove fixEmbeddedParams(...)
** change unit tests to follow examples from RFC 3986
* 1.x
** remove unused fixEmbeddedParams(...) from parse-html
** remove property {{parser.fix.embeddedparams}} from nutch-default.xml

This message was sent by Atlassian JIRA

View raw message