nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sebastian Nagel (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (NUTCH-2585) NPE in TrieStringMatcher
Date Fri, 03 May 2019 15:19:00 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-2585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16832568#comment-16832568
] 

Sebastian Nagel commented on NUTCH-2585:
----------------------------------------

PR including fix is open: [#452|https://github.com/apache/nutch/pull/452]

I've decided to move the unsafe code block into a synchronized method. Because the TrieStringMatcher
allows to mix matching and adding strings, the lazy conversion of nodes is mandatory. The
impact on matching performance should be negligible because the synchronized method is only
called on-demand if the node wasn't already prepared for matching.

> NPE in TrieStringMatcher
> ------------------------
>
>                 Key: NUTCH-2585
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2585
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 1.14
>            Reporter: Markus Jelsma
>            Priority: Major
>             Fix For: 1.16
>
>
> Stumbled on this one just now:
> {code}
> 2018-05-25 14:29:31,844 INFO [FetcherThread] org.apache.nutch.fetcher.FetcherThread:
FetcherThread 42 fetch of http://www.ndcmediagroep.nl/wp-content/uploads/2017/03/Leaflet-Noflik-Wenje.pdf
failed with: java.lang.NullPointerException
> 	at org.apache.nutch.util.TrieStringMatcher$TrieNode.getChild(TrieStringMatcher.java:107)
> 	at org.apache.nutch.util.SuffixStringMatcher.shortestMatch(SuffixStringMatcher.java:74)
> 	at org.apache.nutch.urlfilter.suffix.SuffixURLFilter.filter(SuffixURLFilter.java:164)
> 	at org.apache.nutch.net.URLFilters.filter(URLFilters.java:43)
> 	at org.apache.nutch.fetcher.FetcherThread.handleRedirect(FetcherThread.java:487)
> 	at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:404)
> {code}
> Edit - added on 1 may 2019, i got a slightly different strack trace and using PrefixURLFilter
this time:
> {code}
> 2019-05-01 08:50:07,282 INFO [FetcherThread] org.apache.nutch.fetcher.FetcherThread:
FetcherThread 38 fetch of https://kanaalstreek.nl/fzh/2018/06/04/vijf-maal-goud-voor-pegasus-op-nk
failed with: java.lang.NullPointerException
> 	at org.apache.nutch.util.TrieStringMatcher$TrieNode.getChild(TrieStringMatcher.java:107)
> 	at org.apache.nutch.util.PrefixStringMatcher.shortestMatch(PrefixStringMatcher.java:79)
> 	at org.apache.nutch.urlfilter.prefix.PrefixURLFilter.filter(PrefixURLFilter.java:73)
> 	at org.apache.nutch.net.URLFilters.filter(URLFilters.java:43)
> 	at org.apache.nutch.fetcher.FetcherThread.handleRedirect(FetcherThread.java:487)
> 	at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:404)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message