nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From MengYing Wang <mengyingwa...@gmail.com>
Subject Re: [Problem solved] Can't crawl filesystem with protocol-file plugin - java.lang.NullPointerException
Date Fri, 31 Oct 2014 06:50:01 GMT
Dear Sebastian,

Done. Add some comment in the Jira
https://issues.apache.org/jira/browse/NUTCH-1483 to explain why I cannot
crawl the filesystem using the protocol-file, and how to solve it. Also
mention to the new Jira https://issues.apache.org/jira/browse/NUTCH-1884
which is actually not a "real" bug. Thanks. :)

Best,
Mengying (Angela) Wang

On Thu, Oct 30, 2014 at 11:32 AM, Sebastian Nagel <
wastl.nagel@googlemail.com> wrote:

> Hi Mengying,
>
> great!
>
> > When I change my original "symbolic virtual" path to the "real" path,
> the Nutch could
> > crawl the local files
>
> In fact, path normalization is good here, otherwise you could end up with
> many
> duplicates. But the protocol-file plugin could make this more explicit.
> Could think about treating such pathes as redirects: that's conceptually
> close.
>
> > 2: Also I have applied your new patch file, and the
> java.lang.NullPointerException error totally
> > disappears. Amazing! Thank you!
>
> Perfect!
>
> If you have the time, please, open Jiras for the two problems.
> If not, let me know, and I'll do this.
>
> Thanks for testing!
>
> Best,
> Sebastian
>
> On 10/30/2014 06:15 AM, MengYing Wang wrote:
> > Dear Sebastian,
> >
> > 1: Actually,
> /Users/AngelaWang/Documents/programs/oodt/cas-curator/staging/products/xml/
> is not
> > a "real" path, cas-curator is a symbolic link of the real fold
> cas-curator-0.6.
> >
> > $ greadlink -f
> /Users/AngelaWang/Documents/programs/oodt/cas-curator/staging/products/xml/
> >
> >
> /Users/AngelaWang/Documents/programs/oodt/cas-curator-0.6/staging/products/xml
> >
> > On my god! When I change my original "symbolic virtual" path to the
> "real" path, the Nutch could
> > crawl the local files into my Solr now. Many thanks! Sebastian, you
> helped a lot! Thank you!
> >
> > 2: Also I have applied your new patch file, and the
> java.lang.NullPointerException error totally
> > disappears. Amazing! Thank you!
> >
> > $ ./nutch parsechecker
> >
> "file:/Users/AngelaWang/Documents/programs/oodt/cas-curator-0.6/staging/products/xml/"
> >
> > fetching:
> file:/Users/AngelaWang/Documents/programs/oodt/cas-curator-0.6/staging/products/xml/
> >
> > parsing:
> file:/Users/AngelaWang/Documents/programs/oodt/cas-curator-0.6/staging/products/xml/
> >
> > contentType: text/html
> >
> > signature: 17bdb44990391c96bb8d48d1802ff11c
> >
> > ---------
> >
> > Url
> >
> > ---------------
> >
> >
> >
> file:/Users/AngelaWang/Documents/programs/oodt/cas-curator-0.6/staging/products/xml/
> >
> > ---------
> >
> > ParseData
> >
> > ---------
> >
> >
> > Version: 5
> >
> > Status: success(1,0)
> >
> > Title: Index of
> /Users/AngelaWang/Documents/programs/oodt/cas-curator-0.6/staging/products/xml
> >
> > Outlinks: 2
> >
> >   outlink: toUrl:
> file:/Users/AngelaWang/Documents/programs/oodt/cas-curator-0.6/staging/products/
> > anchor: ../
> >
> >   outlink: toUrl:
> >
> file:/Users/AngelaWang/Documents/programs/oodt/cas-curator-0.6/staging/products/xml/monitor.xml
> > anchor: monitor.xml
> >
> > Content Metadata: Content-Length=352 nutch.crawl.score=0.0
> Last-Modified=Tue, 14 Oct 2014 20:05:50
> > GMT Content-Type=text/html
> >
> > Parse Metadata: CharEncodingForConversion=windows-1252
> OriginalCharEncoding=windows-1252
> >
> >
> > $ ./nutch indexchecker
> >
> "file:/Users/AngelaWang/Documents/programs/oodt/cas-curator-0.6/staging/products/xml/"
> >
> > fetching:
> file:/Users/AngelaWang/Documents/programs/oodt/cas-curator-0.6/staging/products/xml/
> >
> > parsing:
> file:/Users/AngelaWang/Documents/programs/oodt/cas-curator-0.6/staging/products/xml/
> >
> > contentType: text/html
> >
> > content :Index of
> /Users/AngelaWang/Documents/programs/oodt/cas-curator-0.6/staging/products/xml
> > Index of /Us
> >
> > id
> :file:/Users/AngelaWang/Documents/programs/oodt/cas-curator-0.6/staging/products/xml/
> >
> > title :Index of
> /Users/AngelaWang/Documents/programs/oodt/cas-curator-0.6/staging/products/xml
> >
> > host :
> >
> > digest :17bdb44990391c96bb8d48d1802ff11c
> >
> > tstamp :Wed Oct 29 21:54:00 PDT 2014
> >
> > url
> :file:/Users/AngelaWang/Documents/programs/oodt/cas-curator-0.6/staging/products/xml/
> >
> > 3: Something wrong with this tutorial
> https://wiki.apache.org/nutch/IntranetDocumentSearch. To index
> > the local files in the Solr, we also need to enable the "indexer-solr"
> plugin in File:
> > conf/nutch-site.xml which is not mentioned there. Please add it too, so
> future users could easily
> > follow it step by step.
> >
> >
> > Best,
> >
> > Mengying (Angela) Wang
> >
> >
> >
> >
> >
> >
> >
> >
> > On Mon, Oct 27, 2014 at 4:29 PM, Sebastian Nagel <
> wastl.nagel@googlemail.com
> > <mailto:wastl.nagel@googlemail.com>> wrote:
> >
> >     Hi,
> >
> >     thanks for testing!
> >
> >     1. is
> /Users/AngelaWang/Documents/programs/oodt/cas-curator/staging/products/xml/
> >        the "real" path. I.e., are there no symbolic links in the path?
> >        The command
> >          readlink -f
> /Users/AngelaWang/Documents/programs/oodt/cas-curator/staging/products/xml/
> >        should show you whether this is the case or not.
> >        Because Parse objects results are stored by "real" path in the
> ParseResult
> >        this may cause a NPE, when there is no ParseResult available per
> original path.
> >
> >     2. unhappily, the log output is ambiguous. there are two places in
> ParserChecker where
> >        exceptions are catched with the same log message.
> >        Can you apply the attached patch and test again? Just to get more
> verbose log messages.
> >        If you have time, please, open a Jira to improve the logging in
> this case.
> >
> >     Thanks,
> >     Sebastian
> >
> >     On 10/26/2014 02:24 AM, Mengying Wang wrote:
> >     > Hi Sebastian,
> >     >
> >     > I have downloaded the Nutch source code from github (
> https://github.com/apache/nutch), applied the
> >     > patches (NUTCH-1879 and NUTCH-1880), and then reinstalled the
> Nutch.  Now the good news is
> >     that all
> >     > urls contain only 1 slash. But unfortunately,
> java.lang.NullPointerException warning/error occurs
> >     > for both of the parsechecker and indexchecker commands.
> >     >
> >     > Below is the running log:
> >     >
> >     > $ ./nutch parsechecker
> >     >
> "file:/Users/AngelaWang/Documents/programs/oodt/cas-curator/staging/products/xml/"
> >     > fetching:
> file:/Users/AngelaWang/Documents/programs/oodt/cas-curator/staging/products/xml/
> >     > parsing:
> file:/Users/AngelaWang/Documents/programs/oodt/cas-curator/staging/products/xml/
> >     > contentType: text/html
> >     > signature: 17bdb44990391c96bb8d48d1802ff11c
> >     > Couldn't pass score, url
> >     >
> file:/Users/AngelaWang/Documents/programs/oodt/cas-curator/staging/products/xml/
> >     > (java.lang.NullPointerException)
> >     > ---------
> >     > Url
> >     > ---------------
> >     >
> >     >
> file:/Users/AngelaWang/Documents/programs/oodt/cas-curator-0.6/staging/products/xml/
> >     > ---------
> >     > ParseData
> >     > ---------
> >     >
> >     > Version: 5
> >     > Status: success(1,0)
> >     > Title: Index of
> /Users/AngelaWang/Documents/programs/oodt/cas-curator-0.6/staging/products/xml
> >     > Outlinks: 2
> >     >   outlink: toUrl:
> file:/Users/AngelaWang/Documents/programs/oodt/cas-curator-0.6/staging/products/
> >     > anchor: ../
> >     >   outlink: toUrl:
> >     >
> file:/Users/AngelaWang/Documents/programs/oodt/cas-curator-0.6/staging/products/xml/monitor.xml
> >     > anchor: monitor.xml
> >     > Content Metadata: Content-Length=352 nutch.crawl.score=0.0
> Last-Modified=Tue, 14 Oct 2014 20:05:50
> >     > GMT Content-Type=text/html
> >     > Parse Metadata: CharEncodingForConversion=windows-1252
> OriginalCharEncoding=windows-1252
> >     >
> >     >
> >     > $ ./nutch indexchecker
> >     >
> "file:/Users/AngelaWang/Documents/programs/oodt/cas-curator/staging/products/xml/"
> >     > fetching:
> file:/Users/AngelaWang/Documents/programs/oodt/cas-curator/staging/products/xml/
> >     > parsing:
> file:/Users/AngelaWang/Documents/programs/oodt/cas-curator/staging/products/xml/
> >     > contentType: text/html
> >     > Exception in thread "main" java.lang.NullPointerException
> >     > at
> org.apache.nutch.indexer.IndexingFiltersChecker.run(IndexingFiltersChecker.java:139)
> >     > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> >     > at
> org.apache.nutch.indexer.IndexingFiltersChecker.main(IndexingFiltersChecker.java:177)
> >     >
> >     > Thanks.
> >     > Mengying (Angela) Wang
> >
> >
> >
> >
> > --
> > Best,
> > Mengying (Angela) Wang
>
>


-- 
Best,
Mengying (Angela) Wang

Mime
View raw message