nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From MengYing Wang <mengyingwa...@gmail.com>
Subject [Problem solved] Can't crawl filesystem with protocol-file plugin - java.lang.NullPointerException
Date Thu, 30 Oct 2014 05:15:53 GMT
Dear Sebastian,

1: Actually, /Users/AngelaWang/Documents/programs/oodt/cas-curator/staging/products/xml/
is not a "real" path, cas-curator is a symbolic link of the real fold
cas-curator-0.6.

$ greadlink -f
/Users/AngelaWang/Documents/programs/oodt/cas-curator/staging/products/xml/

/Users/AngelaWang/Documents/programs/oodt/cas-curator-0.6/staging/products/xml

On my god! When I change my original "symbolic virtual" path to the "real"
path, the Nutch could crawl the local files into my Solr now. Many thanks!
Sebastian, you helped a lot! Thank you!

2: Also I have applied your new patch file, and the
java.lang.NullPointerException
error totally disappears. Amazing! Thank you!

$ ./nutch parsechecker
"file:/Users/AngelaWang/Documents/programs/oodt/cas-curator-0.6/staging/products/xml/"

fetching:
file:/Users/AngelaWang/Documents/programs/oodt/cas-curator-0.6/staging/products/xml/

parsing:
file:/Users/AngelaWang/Documents/programs/oodt/cas-curator-0.6/staging/products/xml/

contentType: text/html

signature: 17bdb44990391c96bb8d48d1802ff11c

---------

Url

---------------


file:/Users/AngelaWang/Documents/programs/oodt/cas-curator-0.6/staging/products/xml/

---------

ParseData

---------


Version: 5

Status: success(1,0)

Title: Index of
/Users/AngelaWang/Documents/programs/oodt/cas-curator-0.6/staging/products/xml

Outlinks: 2

  outlink: toUrl:
file:/Users/AngelaWang/Documents/programs/oodt/cas-curator-0.6/staging/products/
anchor: ../

  outlink: toUrl:
file:/Users/AngelaWang/Documents/programs/oodt/cas-curator-0.6/staging/products/xml/monitor.xml
anchor: monitor.xml

Content Metadata: Content-Length=352 nutch.crawl.score=0.0
Last-Modified=Tue, 14 Oct 2014 20:05:50 GMT Content-Type=text/html

Parse Metadata: CharEncodingForConversion=windows-1252
OriginalCharEncoding=windows-1252


$ ./nutch indexchecker
"file:/Users/AngelaWang/Documents/programs/oodt/cas-curator-0.6/staging/products/xml/"

fetching:
file:/Users/AngelaWang/Documents/programs/oodt/cas-curator-0.6/staging/products/xml/

parsing:
file:/Users/AngelaWang/Documents/programs/oodt/cas-curator-0.6/staging/products/xml/

contentType: text/html

content : Index of
/Users/AngelaWang/Documents/programs/oodt/cas-curator-0.6/staging/products/xml
Index of /Us

id :
file:/Users/AngelaWang/Documents/programs/oodt/cas-curator-0.6/staging/products/xml/

title : Index of
/Users/AngelaWang/Documents/programs/oodt/cas-curator-0.6/staging/products/xml

host :

digest : 17bdb44990391c96bb8d48d1802ff11c

tstamp : Wed Oct 29 21:54:00 PDT 2014

url :
file:/Users/AngelaWang/Documents/programs/oodt/cas-curator-0.6/staging/products/xml/

3: Something wrong with this tutorial
https://wiki.apache.org/nutch/IntranetDocumentSearch. To index the local
files in the Solr, we also need to enable the "indexer-solr" plugin
in File: conf/nutch-site.xml which is not mentioned there. Please add it
too, so future users could easily follow it step by step.


Best,

Mengying (Angela) Wang








On Mon, Oct 27, 2014 at 4:29 PM, Sebastian Nagel <wastl.nagel@googlemail.com
> wrote:

> Hi,
>
> thanks for testing!
>
> 1. is
> /Users/AngelaWang/Documents/programs/oodt/cas-curator/staging/products/xml/
>    the "real" path. I.e., are there no symbolic links in the path?
>    The command
>      readlink -f
> /Users/AngelaWang/Documents/programs/oodt/cas-curator/staging/products/xml/
>    should show you whether this is the case or not.
>    Because Parse objects results are stored by "real" path in the
> ParseResult
>    this may cause a NPE, when there is no ParseResult available per
> original path.
>
> 2. unhappily, the log output is ambiguous. there are two places in
> ParserChecker where
>    exceptions are catched with the same log message.
>    Can you apply the attached patch and test again? Just to get more
> verbose log messages.
>    If you have time, please, open a Jira to improve the logging in this
> case.
>
> Thanks,
> Sebastian
>
> On 10/26/2014 02:24 AM, Mengying Wang wrote:
> > Hi Sebastian,
> >
> > I have downloaded the Nutch source code from github (
> https://github.com/apache/nutch), applied the
> > patches (NUTCH-1879 and NUTCH-1880), and then reinstalled the Nutch.
> Now the good news is that all
> > urls contain only 1 slash. But unfortunately,
> java.lang.NullPointerException warning/error occurs
> > for both of the parsechecker and indexchecker commands.
> >
> > Below is the running log:
> >
> > $ ./nutch parsechecker
> >
> "file:/Users/AngelaWang/Documents/programs/oodt/cas-curator/staging/products/xml/"
> > fetching:
> file:/Users/AngelaWang/Documents/programs/oodt/cas-curator/staging/products/xml/
> > parsing:
> file:/Users/AngelaWang/Documents/programs/oodt/cas-curator/staging/products/xml/
> > contentType: text/html
> > signature: 17bdb44990391c96bb8d48d1802ff11c
> > Couldn't pass score, url
> >
> file:/Users/AngelaWang/Documents/programs/oodt/cas-curator/staging/products/xml/
> > (java.lang.NullPointerException)
> > ---------
> > Url
> > ---------------
> >
> >
> file:/Users/AngelaWang/Documents/programs/oodt/cas-curator-0.6/staging/products/xml/
> > ---------
> > ParseData
> > ---------
> >
> > Version: 5
> > Status: success(1,0)
> > Title: Index of
> /Users/AngelaWang/Documents/programs/oodt/cas-curator-0.6/staging/products/xml
> > Outlinks: 2
> >   outlink: toUrl:
> file:/Users/AngelaWang/Documents/programs/oodt/cas-curator-0.6/staging/products/
> > anchor: ../
> >   outlink: toUrl:
> >
> file:/Users/AngelaWang/Documents/programs/oodt/cas-curator-0.6/staging/products/xml/monitor.xml
> > anchor: monitor.xml
> > Content Metadata: Content-Length=352 nutch.crawl.score=0.0
> Last-Modified=Tue, 14 Oct 2014 20:05:50
> > GMT Content-Type=text/html
> > Parse Metadata: CharEncodingForConversion=windows-1252
> OriginalCharEncoding=windows-1252
> >
> >
> > $ ./nutch indexchecker
> >
> "file:/Users/AngelaWang/Documents/programs/oodt/cas-curator/staging/products/xml/"
> > fetching:
> file:/Users/AngelaWang/Documents/programs/oodt/cas-curator/staging/products/xml/
> > parsing:
> file:/Users/AngelaWang/Documents/programs/oodt/cas-curator/staging/products/xml/
> > contentType: text/html
> > Exception in thread "main" java.lang.NullPointerException
> > at
> org.apache.nutch.indexer.IndexingFiltersChecker.run(IndexingFiltersChecker.java:139)
> > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> > at
> org.apache.nutch.indexer.IndexingFiltersChecker.main(IndexingFiltersChecker.java:177)
> >
> > Thanks.
> > Mengying (Angela) Wang
>
>


-- 
Best,
Mengying (Angela) Wang

Mime
View raw message