Dear Sebastian,
Done. Add some comment in the Jira
https://issues.apache.org/jira/browse/NUTCH-1483 to explain why I cannot
crawl the filesystem using the protocol-file, and how to solve it. Also
mention to the new Jira https://issues.apache.org/jira/browse/NUTCH-1884
which is actually not a "real" bug. Thanks. :)
Best,
Mengying (Angela) Wang
On Thu, Oct 30, 2014 at 11:32 AM, Sebastian Nagel <
wastl.nagel@googlemail.com> wrote:
> Hi Mengying,
>
> great!
>
> > When I change my original "symbolic virtual" path to the "real" path,
> the Nutch could
> > crawl the local files
>
> In fact, path normalization is good here, otherwise you could end up with
> many
> duplicates. But the protocol-file plugin could make this more explicit.
> Could think about treating such pathes as redirects: that's conceptually
> close.
>
> > 2: Also I have applied your new patch file, and the
> java.lang.NullPointerException error totally
> > disappears. Amazing! Thank you!
>
> Perfect!
>
> If you have the time, please, open Jiras for the two problems.
> If not, let me know, and I'll do this.
>
> Thanks for testing!
>
> Best,
> Sebastian
>
> On 10/30/2014 06:15 AM, MengYing Wang wrote:
> > Dear Sebastian,
> >
> > 1: Actually,
> /Users/AngelaWang/Documents/programs/oodt/cas-curator/staging/products/xml/
> is not
> > a "real" path, cas-curator is a symbolic link of the real fold
> cas-curator-0.6.
> >
> > $ greadlink -f
> /Users/AngelaWang/Documents/programs/oodt/cas-curator/staging/products/xml/
> >
> >
> /Users/AngelaWang/Documents/programs/oodt/cas-curator-0.6/staging/products/xml
> >
> > On my god! When I change my original "symbolic virtual" path to the
> "real" path, the Nutch could
> > crawl the local files into my Solr now. Many thanks! Sebastian, you
> helped a lot! Thank you!
> >
> > 2: Also I have applied your new patch file, and the
> java.lang.NullPointerException error totally
> > disappears. Amazing! Thank you!
> >
> > $ ./nutch parsechecker
> >
> "file:/Users/AngelaWang/Documents/programs/oodt/cas-curator-0.6/staging/products/xml/"
> >
> > fetching:
> file:/Users/AngelaWang/Documents/programs/oodt/cas-curator-0.6/staging/products/xml/
> >
> > parsing:
> file:/Users/AngelaWang/Documents/programs/oodt/cas-curator-0.6/staging/products/xml/
> >
> > contentType: text/html
> >
> > signature: 17bdb44990391c96bb8d48d1802ff11c
> >
> > ---------
> >
> > Url
> >
> > ---------------
> >
> >
> >
> file:/Users/AngelaWang/Documents/programs/oodt/cas-curator-0.6/staging/products/xml/
> >
> > ---------
> >
> > ParseData
> >
> > ---------
> >
> >
> > Version: 5
> >
> > Status: success(1,0)
> >
> > Title: Index of
> /Users/AngelaWang/Documents/programs/oodt/cas-curator-0.6/staging/products/xml
> >
> > Outlinks: 2
> >
> > outlink: toUrl:
> file:/Users/AngelaWang/Documents/programs/oodt/cas-curator-0.6/staging/products/
> > anchor: ../
> >
> > outlink: toUrl:
> >
> file:/Users/AngelaWang/Documents/programs/oodt/cas-curator-0.6/staging/products/xml/monitor.xml
> > anchor: monitor.xml
> >
> > Content Metadata: Content-Length=352 nutch.crawl.score=0.0
> Last-Modified=Tue, 14 Oct 2014 20:05:50
> > GMT Content-Type=text/html
> >
> > Parse Metadata: CharEncodingForConversion=windows-1252
> OriginalCharEncoding=windows-1252
> >
> >
> > $ ./nutch indexchecker
> >
> "file:/Users/AngelaWang/Documents/programs/oodt/cas-curator-0.6/staging/products/xml/"
> >
> > fetching:
> file:/Users/AngelaWang/Documents/programs/oodt/cas-curator-0.6/staging/products/xml/
> >
> > parsing:
> file:/Users/AngelaWang/Documents/programs/oodt/cas-curator-0.6/staging/products/xml/
> >
> > contentType: text/html
> >
> > content :Index of
> /Users/AngelaWang/Documents/programs/oodt/cas-curator-0.6/staging/products/xml
> > Index of /Us
> >
> > id
> :file:/Users/AngelaWang/Documents/programs/oodt/cas-curator-0.6/staging/products/xml/
> >
> > title :Index of
> /Users/AngelaWang/Documents/programs/oodt/cas-curator-0.6/staging/products/xml
> >
> > host :
> >
> > digest :17bdb44990391c96bb8d48d1802ff11c
> >
> > tstamp :Wed Oct 29 21:54:00 PDT 2014
> >
> > url
> :file:/Users/AngelaWang/Documents/programs/oodt/cas-curator-0.6/staging/products/xml/
> >
> > 3: Something wrong with this tutorial
> https://wiki.apache.org/nutch/IntranetDocumentSearch. To index
> > the local files in the Solr, we also need to enable the "indexer-solr"
> plugin in File:
> > conf/nutch-site.xml which is not mentioned there. Please add it too, so
> future users could easily
> > follow it step by step.
> >
> >
> > Best,
> >
> > Mengying (Angela) Wang
> >
> >
> >
> >
> >
> >
> >
> >
> > On Mon, Oct 27, 2014 at 4:29 PM, Sebastian Nagel <
> wastl.nagel@googlemail.com
> > <mailto:wastl.nagel@googlemail.com>> wrote:
> >
> > Hi,
> >
> > thanks for testing!
> >
> > 1. is
> /Users/AngelaWang/Documents/programs/oodt/cas-curator/staging/products/xml/
> > the "real" path. I.e., are there no symbolic links in the path?
> > The command
> > readlink -f
> /Users/AngelaWang/Documents/programs/oodt/cas-curator/staging/products/xml/
> > should show you whether this is the case or not.
> > Because Parse objects results are stored by "real" path in the
> ParseResult
> > this may cause a NPE, when there is no ParseResult available per
> original path.
> >
> > 2. unhappily, the log output is ambiguous. there are two places in
> ParserChecker where
> > exceptions are catched with the same log message.
> > Can you apply the attached patch and test again? Just to get more
> verbose log messages.
> > If you have time, please, open a Jira to improve the logging in
> this case.
> >
> > Thanks,
> > Sebastian
> >
> > On 10/26/2014 02:24 AM, Mengying Wang wrote:
> > > Hi Sebastian,
> > >
> > > I have downloaded the Nutch source code from github (
> https://github.com/apache/nutch), applied the
> > > patches (NUTCH-1879 and NUTCH-1880), and then reinstalled the
> Nutch. Now the good news is
> > that all
> > > urls contain only 1 slash. But unfortunately,
> java.lang.NullPointerException warning/error occurs
> > > for both of the parsechecker and indexchecker commands.
> > >
> > > Below is the running log:
> > >
> > > $ ./nutch parsechecker
> > >
> "file:/Users/AngelaWang/Documents/programs/oodt/cas-curator/staging/products/xml/"
> > > fetching:
> file:/Users/AngelaWang/Documents/programs/oodt/cas-curator/staging/products/xml/
> > > parsing:
> file:/Users/AngelaWang/Documents/programs/oodt/cas-curator/staging/products/xml/
> > > contentType: text/html
> > > signature: 17bdb44990391c96bb8d48d1802ff11c
> > > Couldn't pass score, url
> > >
> file:/Users/AngelaWang/Documents/programs/oodt/cas-curator/staging/products/xml/
> > > (java.lang.NullPointerException)
> > > ---------
> > > Url
> > > ---------------
> > >
> > >
> file:/Users/AngelaWang/Documents/programs/oodt/cas-curator-0.6/staging/products/xml/
> > > ---------
> > > ParseData
> > > ---------
> > >
> > > Version: 5
> > > Status: success(1,0)
> > > Title: Index of
> /Users/AngelaWang/Documents/programs/oodt/cas-curator-0.6/staging/products/xml
> > > Outlinks: 2
> > > outlink: toUrl:
> file:/Users/AngelaWang/Documents/programs/oodt/cas-curator-0.6/staging/products/
> > > anchor: ../
> > > outlink: toUrl:
> > >
> file:/Users/AngelaWang/Documents/programs/oodt/cas-curator-0.6/staging/products/xml/monitor.xml
> > > anchor: monitor.xml
> > > Content Metadata: Content-Length=352 nutch.crawl.score=0.0
> Last-Modified=Tue, 14 Oct 2014 20:05:50
> > > GMT Content-Type=text/html
> > > Parse Metadata: CharEncodingForConversion=windows-1252
> OriginalCharEncoding=windows-1252
> > >
> > >
> > > $ ./nutch indexchecker
> > >
> "file:/Users/AngelaWang/Documents/programs/oodt/cas-curator/staging/products/xml/"
> > > fetching:
> file:/Users/AngelaWang/Documents/programs/oodt/cas-curator/staging/products/xml/
> > > parsing:
> file:/Users/AngelaWang/Documents/programs/oodt/cas-curator/staging/products/xml/
> > > contentType: text/html
> > > Exception in thread "main" java.lang.NullPointerException
> > > at
> org.apache.nutch.indexer.IndexingFiltersChecker.run(IndexingFiltersChecker.java:139)
> > > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> > > at
> org.apache.nutch.indexer.IndexingFiltersChecker.main(IndexingFiltersChecker.java:177)
> > >
> > > Thanks.
> > > Mengying (Angela) Wang
> >
> >
> >
> >
> > --
> > Best,
> > Mengying (Angela) Wang
>
>
--
Best,
Mengying (Angela) Wang
|