nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hudson (Jira)" <j...@apache.org>
Subject [jira] [Commented] (NUTCH-2770) Subcollection logic allows empty string as a whitelist value, thus matching every incoming document.
Date Fri, 13 Mar 2020 09:09:00 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17058534#comment-17058534
] 

Hudson commented on NUTCH-2770:
-------------------------------

SUCCESS: Integrated in Jenkins build Nutch-trunk #3666 (See [https://builds.apache.org/job/Nutch-trunk/3666/])
NUTCH-2770 Subcollection logic allows empty string as a whitelist value, (snagel: [https://github.com/apache/nutch/commit/4443cc1edd536321a0774ae050ff747c1bcfa706])
* (edit) src/plugin/subcollection/src/java/org/apache/nutch/collection/Subcollection.java


> Subcollection logic allows empty string as a whitelist value, thus matching every incoming
document.
> ----------------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-2770
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2770
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer, plugin
>    Affects Versions: 1.16
>            Reporter: Jason Grey
>            Assignee: Sebastian Nagel
>            Priority: Minor
>             Fix For: 1.17
>
>         Attachments: NUTCH-2770.patch
>
>
> If subcollections.xml whitelist element contains empty lines at the end (ie: because
the XML was formatted nicely) those lines can become an empty string in the string matching
logic. That logic uses String.contains, and that in turn returns TRUE for an empty string
as input.
> This then causes that subcollection to be tagged on EVERY incoming document.
> Here is a POC to show the issue in isolation, since I do not yet have a dev environment
setup for nutch yet.
> {code:java}
> /**
> This is a snippet that does the same logic as Subcollection.java in nutch.
> https://github.com/apache/nutch/blob/fdee94d8e0894384f1fca7c9f16c7593a5bc928c/src/plugin/subcollection/src/java/org/apache/nutch/collection/Subcollection.java
> **/
> import java.lang.Math; 
> import java.util.StringTokenizer;
> public class HelloWorld
> {
>   public static void main(String[] args)
>   {
>     String urlToTest = "https://www.example.com/test/url/here";
>     String text = "\r\n\t//research.xyz.com/\r\n\t/research/\r\n\t";
>     StringTokenizer st = new StringTokenizer(text, "\n\r");
>     while (st.hasMoreElements()) {
>       String line = ((String) st.nextElement()).trim();
>       boolean matched = urlToTest.contains(line);
>       System.out.println("line: [" + line + "] = " + matched);
>     }
>   }
> }
> /**
> output:
> line: [//research.xyz.com/] = false
> line: [/research/] = false
> line: [] = true
> as we can see, for the text in our XML config, it's outputting an extra line which is
matching on EVERYTHING...
> **/	
> {code}
> There is a workaround, if you collapse the whitespace in the XML file, but I think we
should fix this anyway. I will try to do so and submit a patch soon which will filter out
empty string.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Mime
View raw message