nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (Jira)" <j...@apache.org>
Subject [jira] [Commented] (NUTCH-1863) Add JSON format dump output to readdb command
Date Sun, 22 Dec 2019 15:55:00 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-1863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17001952#comment-17001952
] 

ASF GitHub Bot commented on NUTCH-1863:
---------------------------------------

sebastian-nagel commented on pull request #490: Fix for NUTCH-1863: Add JSON format dump output
to readdb command
URL: https://github.com/apache/nutch/pull/490#discussion_r360713673
 
 

 ##########
 File path: src/java/org/apache/nutch/crawl/CrawlDbReader.java
 ##########
 @@ -185,13 +190,83 @@ public synchronized void write(Text key, CrawlDatum value)
         out.writeByte('\n');
       }
 
-      public synchronized void close(TaskAttemptContext context) throws IOException {
+      public synchronized void close(TaskAttemptContext context)
+          throws IOException {
+        out.close();
+      }
+    }
+
+    public RecordWriter<Text, CrawlDatum> getRecordWriter(
+        TaskAttemptContext context) throws IOException {
+      String name = getUniqueFile(context, "part", "");
+      Path dir = FileOutputFormat.getOutputPath(context);
+      FileSystem fs = dir.getFileSystem(context.getConfiguration());
+      DataOutputStream fileOut = fs.create(new Path(dir, name), context);
+      return new LineRecordWriter(fileOut);
+    }
+  }
+
+  public static class CrawlDatumJsonOutputFormat
+      extends FileOutputFormat<Text, CrawlDatum> {
+    protected static class LineRecordWriter
+        extends RecordWriter<Text, CrawlDatum> {
+      private DataOutputStream out;
+      private ArrayList<String> jsonString = new ArrayList<String>();
+      public LineRecordWriter(DataOutputStream out) {
+        this.out = out;
+        try {
+          out.writeBytes("[");
+        } catch (IOException e) {
+        }
+      }
+
+      public synchronized void write(Text key, CrawlDatum value)
+          throws IOException {
+        String fetchTime = new Date(value.getFetchTime()).toString();
+        String modifiedTime = new Date(value.getModifiedTime()).toString();
+        String recordString = "";
+        recordString += "{\n" + "\t\"url\":\"" + key.toString()
 
 Review comment:
   I would strongly recommend to use a JSON library to do the escaping. There are many pitfalls,
e.g., what if a URL contains a `"`? [Jackson](https://github.com/FasterXML/jackson) is already
on board (it's a dependency) and it's pretty handy to write JSON:
   ```java
   import com.fasterxml.jackson.databind.ObjectMapper;
   import com.fasterxml.jackson.databind.ObjectWriter;
   ...
   public static class JsonIndenter extends MinimalPrettyPrinter {
   
       // @Override
       public void writeObjectFieldValueSeparator(JsonGenerator jg)
           throws IOException, JsonGenerationException {
         jg.writeRaw(": ");
       }
   
       // @Override
       public void writeObjectEntrySeparator(JsonGenerator jg)
           throws IOException, JsonGenerationException {
         jg.writeRaw(", ");
       }
   }
   ...
   // inside write(Text key, CrawlDatum value)
   ObjectMapper jsonMapper = new ObjectMapper();
   jsonMapper.getFactory().configure(JsonGenerator.Feature.ESCAPE_NON_ASCII,
           true);
   ObjectWriter jsonWriter = jsonMapper.writer(new JsonIndenter());
   
   ...
   Map<String, String> data = new LinkedHashMap<String, String>(); // a linked
hash map preserves the ordering
   data.put("url", key.toString());
   // put all other fields into the map
   ...
   // and write the serialized record
   out.write(jsonWriter.writeValueAsBytes(data));
   ```
   
   There are many options to control the output format...
   
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


> Add JSON format dump output to readdb command
> ---------------------------------------------
>
>                 Key: NUTCH-1863
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1863
>             Project: Nutch
>          Issue Type: New Feature
>          Components: crawldb
>    Affects Versions: 2.3, 1.10
>            Reporter: Lewis John McGibbney
>            Assignee: Shashanka Balakuntala Srinivasa
>            Priority: Major
>             Fix For: 1.17
>
>
> Opening up the ability for third parties to consume Nutch crawldb data as JSON would
be a poisitive thing IMHO.
> This issue should improve the readdb functionality of both 1.X to enable JSON dumps of
crawldb data.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Mime
View raw message