nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chris Mattmann" <mattm...@apache.org>
Subject Re: Review Request 9119: Create SegmentContentDumperTool for easily extracting out file contents from SegmentDirs
Date Sat, 20 Sep 2014 16:49:16 GMT


> On Sept. 9, 2014, 11:40 p.m., Lewis McGibbney wrote:
> > ./trunk/src/java/org/apache/nutch/tools/FileDumper.java, line 101
> > <https://reviews.apache.org/r/9119/diff/1/?file=681989#file681989line101>
> >
> >     When I change the Text() class to use the UTF8() class, I get the following
> >     
> >     lmcgibbn@LMC-032857 /usr/local/trunk/runtime/local(master) $ ./bin/nutch org.apache.nutch.tools.FileDumper
. /usr/local/trunk/src/testresources/testcrawl/segments/
> >     2014-09-09 16:02:21.339 java[3942:1903] Unable to load realm info from SCDynamicStore
> >     Sep 09, 2014 4:02:21 PM org.apache.nutch.tools.FileDumper main
> >     INFO: Processing segment: [/usr/local/trunk/src/testresources/testcrawl/segments/20060919213635]
> >     Exception in thread "main" java.io.EOFException
> >     	at java.io.DataInputStream.readFully(DataInputStream.java:197)
> >     	at java.io.DataInputStream.readFully(DataInputStream.java:169)
> >     	at org.apache.nutch.protocol.Content.readFieldsCompressed(Content.java:99)
> >     	at org.apache.nutch.protocol.Content.readFields(Content.java:154)
> >     	at org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:1813)
> >     	at org.apache.nutch.tools.FileDumper.main(FileDumper.java:101)
> >         
> >     UTF8 is of course deprecated now so we need to stick with Text and implement
the corect code.

hey @Lewis, not sure if this is really an error or not. I grepped around all the Nutch code,
and also did a find -name for anything that references testcrawl. No Nutch code in src/test
or src/java reference it. So I'm not sure that we should be using old UTF8 (instead of Text)
crawl dirs here. I will go ahead and add some exception handling anyways and try to make it
more robust. 

[chipotle:~/src/nutch/src] mattmann% grep -R "testcrawl" test
[chipotle:~/src/nutch/src] mattmann% grep -R "testcrawl" *
[chipotle:~/src/nutch/src] mattmann% find . -name "testcrawl" -print


- Chris


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/9119/#review52796
-----------------------------------------------------------


On Sept. 10, 2014, 3:15 a.m., Chris Mattmann wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/9119/
> -----------------------------------------------------------
> 
> (Updated Sept. 10, 2014, 3:15 a.m.)
> 
> 
> Review request for nutch.
> 
> 
> Bugs: NUTCH-1526
>     https://issues.apache.org/jira/browse/NUTCH-1526
> 
> 
> Repository: nutch
> 
> 
> Description
> -------
> 
> Will contain the patch the SegmentContentDumperTool described in NUTCH-1526:
> 
> ./bin/nutch org.apache.nutch.tools.SegmentContentDumper [options]
>    -segmentRootDir full file path to the root segment directory, e.g., crawl/segments
>    -regexUrlPattern a regex URL pattern to select URL keys to dump from the content DB
in each segment
>    -outputDir The output directory to write file names to.
>    -metadata --key=value where key is a Content Metadata key and value is a value to
check.
> 
> 
> Diffs
> -----
> 
>   ./trunk/src/java/org/apache/nutch/tools/FileDumper.java PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/9119/diff/
> 
> 
> Testing
> -------
> 
> Testing it on DARPA XDATA XNET.
> 
> 
> Thanks,
> 
> Chris Mattmann
> 
>


Mime
View raw message