nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sebastian Nagel (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (NUTCH-2585) NPE in TrieStringMatcher
Date Fri, 03 May 2019 14:36:00 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-2585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16832542#comment-16832542
] 

Sebastian Nagel commented on NUTCH-2585:
----------------------------------------

Ok, this is reproduced using parallel streams (see [615b20e|https://github.com/sebastian-nagel/nutch/commit/615b20eafe947bf75abee836ddd3a9b67706c49f]):
{noformat}
Testing thread-safety (NUTCH-2585) with 1000 iterations:
Cycle  908 : 7 matches
Exception in thread "main" java.lang.NullPointerException
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
        at java.util.concurrent.ForkJoinTask.getThrowableException(ForkJoinTask.java:598)
        at java.util.concurrent.ForkJoinTask.reportException(ForkJoinTask.java:677)
        at java.util.concurrent.ForkJoinTask.invoke(ForkJoinTask.java:735)
        at java.util.stream.ReduceOps$ReduceOp.evaluateParallel(ReduceOps.java:714)
        at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:233)
        at java.util.stream.LongPipeline.reduce(LongPipeline.java:438)
        at java.util.stream.LongPipeline.sum(LongPipeline.java:396)
        at java.util.stream.ReferencePipeline.count(ReferencePipeline.java:526)
        at org.apache.nutch.util.PrefixStringMatcher.main(PrefixStringMatcher.java:130)
Caused by: java.lang.NullPointerException
        at org.apache.nutch.util.TrieStringMatcher$TrieNode.getChild(TrieStringMatcher.java:106)
        at org.apache.nutch.util.PrefixStringMatcher.matches(PrefixStringMatcher.java:65)
        at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:174)
        at java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948)
        at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
        at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
        at java.util.stream.ReduceOps$ReduceTask.doLeaf(ReduceOps.java:747)
        at java.util.stream.ReduceOps$ReduceTask.doLeaf(ReduceOps.java:721)
        at java.util.stream.AbstractTask.compute(AbstractTask.java:316)
        at java.util.concurrent.CountedCompleter.exec(CountedCompleter.java:731)
        at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
        at java.util.concurrent.ForkJoinPool$WorkQueue.execLocalTasks(ForkJoinPool.java:1040)
        at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1058)
        at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692)
        at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:157)
{noformat}

As solution I would propose to add a {{compile()}} method and call it in the constructors
of PrefixStringMatcher and SuffixStringMatcher. That would avoid the lazy conversion and sorting
of the trie nodes. Making the {{getChild(char nextChar)}} method synchronized would also help
but would make the matching potentially slower.



> NPE in TrieStringMatcher
> ------------------------
>
>                 Key: NUTCH-2585
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2585
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 1.14
>            Reporter: Markus Jelsma
>            Priority: Major
>             Fix For: 1.16
>
>
> Stumbled on this one just now:
> {code}
> 2018-05-25 14:29:31,844 INFO [FetcherThread] org.apache.nutch.fetcher.FetcherThread:
FetcherThread 42 fetch of http://www.ndcmediagroep.nl/wp-content/uploads/2017/03/Leaflet-Noflik-Wenje.pdf
failed with: java.lang.NullPointerException
> 	at org.apache.nutch.util.TrieStringMatcher$TrieNode.getChild(TrieStringMatcher.java:107)
> 	at org.apache.nutch.util.SuffixStringMatcher.shortestMatch(SuffixStringMatcher.java:74)
> 	at org.apache.nutch.urlfilter.suffix.SuffixURLFilter.filter(SuffixURLFilter.java:164)
> 	at org.apache.nutch.net.URLFilters.filter(URLFilters.java:43)
> 	at org.apache.nutch.fetcher.FetcherThread.handleRedirect(FetcherThread.java:487)
> 	at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:404)
> {code}
> Edit - added on 1 may 2019, i got a slightly different strack trace and using PrefixURLFilter
this time:
> {code}
> 2019-05-01 08:50:07,282 INFO [FetcherThread] org.apache.nutch.fetcher.FetcherThread:
FetcherThread 38 fetch of https://kanaalstreek.nl/fzh/2018/06/04/vijf-maal-goud-voor-pegasus-op-nk
failed with: java.lang.NullPointerException
> 	at org.apache.nutch.util.TrieStringMatcher$TrieNode.getChild(TrieStringMatcher.java:107)
> 	at org.apache.nutch.util.PrefixStringMatcher.shortestMatch(PrefixStringMatcher.java:79)
> 	at org.apache.nutch.urlfilter.prefix.PrefixURLFilter.filter(PrefixURLFilter.java:73)
> 	at org.apache.nutch.net.URLFilters.filter(URLFilters.java:43)
> 	at org.apache.nutch.fetcher.FetcherThread.handleRedirect(FetcherThread.java:487)
> 	at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:404)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message