nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (Jira)" <j...@apache.org>
Subject [jira] [Commented] (NUTCH-1863) Add JSON format dump output to readdb command
Date Mon, 23 Dec 2019 10:13:00 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-1863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17002182#comment-17002182
] 

ASF GitHub Bot commented on NUTCH-1863:
---------------------------------------

sebastian-nagel commented on pull request #490: Fix for NUTCH-1863: Add JSON format dump output
to readdb command
URL: https://github.com/apache/nutch/pull/490#discussion_r360833719
 
 

 ##########
 File path: src/java/org/apache/nutch/crawl/CrawlDbReader.java
 ##########
 @@ -879,35 +988,37 @@ public void processTopNJob(String crawlDb, long topN, float min,
 
   }
 
-
-  public int run(String[] args) throws IOException, InterruptedException, ClassNotFoundException,
Exception {
+  public int run(String[] args) throws IOException, InterruptedException,
+      ClassNotFoundException, Exception {
     @SuppressWarnings("resource")
     CrawlDbReader dbr = new CrawlDbReader();
 
     if (args.length < 2) {
-      System.err
-          .println("Usage: CrawlDbReader <crawldb> (-stats | -dump <out_dir>
| -topN <nnnn> <out_dir> [<min>] | -url <url>)");
+      System.err.println(
+          "Usage: CrawlDbReader <crawldb> (-stats | -dump <out_dir> | -topN <nnnn>
<out_dir> [<min>] | -url <url>)");
       System.err
           .println("\t<crawldb>\tdirectory name where crawldb is located");
       System.err
           .println("\t-stats [-sort] \tprint overall statistics to System.out");
       System.err.println("\t\t[-sort]\tlist status sorted by host");
-      System.err
-          .println("\t-dump <out_dir> [-format normal|csv|crawldb]\tdump the whole
db to a text file in <out_dir>");
+      System.err.println(
+          "\t-dump <out_dir> [-format normal|csv|crawldb|json]\tdump the whole db to
a text file in <out_dir>");
       System.err.println("\t\t[-format csv]\tdump in Csv format");
-      System.err
 
 Review comment:
   Should also add a line describing `-format json`, maybe clarify that it's "JSON lines".
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


> Add JSON format dump output to readdb command
> ---------------------------------------------
>
>                 Key: NUTCH-1863
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1863
>             Project: Nutch
>          Issue Type: New Feature
>          Components: crawldb
>    Affects Versions: 2.3, 1.10
>            Reporter: Lewis John McGibbney
>            Assignee: Shashanka Balakuntala Srinivasa
>            Priority: Major
>             Fix For: 1.17
>
>
> Opening up the ability for third parties to consume Nutch crawldb data as JSON would
be a poisitive thing IMHO.
> This issue should improve the readdb functionality of both 1.X to enable JSON dumps of
crawldb data.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Mime
View raw message