samza-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hai Lu <lhai...@gmail.com>
Subject Re: Review Request 51142: SAMZA-967: HDFS System Consumer
Date Wed, 05 Oct 2016 18:16:41 GMT

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51142/
-----------------------------------------------------------

(Updated Oct. 5, 2016, 6:16 p.m.)


Review request for samza, Yi Pan (Data Infrastructure) and Navina Ramesh.


Changes
-------

fix gradle warning


Bugs: SAMZA-967
    https://issues.apache.org/jira/browse/SAMZA-967


Repository: samza


Description
-------

Add HDFS System Consumer: 

1. System admin, partitioner
2. System consumer with metrics

Design doc can be found here: https://issues.apache.org/jira/secure/attachment/12824078/HDFSSystemConsumer.pdf

An overview of the high level architecture: 

The system factory is used by Samza to instantiate SystemConsumer, SystemProducer, and SystemAdmin
for a specific system. The FileDataSystemFactory can be reused for other file system like
sources. 

HDFSSystemAdmin will start a “DirectoryPartitioner” to figure out the set of HDFS files
need to be consumed for this job. The DirectoryPartitioner also uses “GroupingPattern”
to group files into partitions if advanced partitioning is required. HDFSSystemAdmin will
then persist the “PartitionDescriptor” to HDFS.

The HDFSSystemConsumer will then pick up the “PartitionDescriptor” from HDFS. Based on
this information as well as the actual assignment of partitions, it would then know which
files to read from.

The initial implementation of the HDFS system consumer supports only avro data files. It’s
very easy to extend it to a variety of file format by implementing the FileReader interface.

                                                                                         
                            
                             +------------------------------------------------------------------------------+
        
                             |                                                           
                  |         
           +-----------------+                                     HDFS                  
                  |         
           |   Obtain        |                                                           
                  |         
           |  Partition      +------+----------------------^------+---------------------------------^-------+
        
           | Description            |                      |      |                      
          |                 
           |                        |                      |      |                      
          |                 
           |          +-------------v-------+              |      |       Filtering/     
          |                 
           |          |                     |              |      +---+    Grouping      
          +-----+           
           |          | HDFSAvroFileReader  |              |          |                  
                |           
           |          |                     |    Persist   |          |                  
                |           
           |          +---------+-----------+   Partition  |          |                  
                |           
           |                    |              Description |   +------v--------------+   
     +----------+----------+
           |                    |                          |   |                     |   
     |                     |
           |          +---------+-----------+              |   |Directory Partitioner|   
     |   HDFSAvroWriter    |
           |          |     IFileReader     |              |   |                     |   
     |                     |
           |          |                     |              |   +------+--------------+   
     +----------+----------+
           |          +---------+-----------+              |          |                  
                |           
           |                    |                          |          |                  
                |           
           |                    |                          |          |                  
                |           
           |          +---------+-----------+            +-+----------+--------+         
     +----------+----------+
           |          |                     |            |                     |         
     |                     |
           |          | HDFSSystemConsumer  |            |   HDFSSystemAdmin   |         
     | HDFSSystemProducer  |
           +---------->                     |            |                     |      
        |                     |
                      +---------+-----------+            +-----------+---------+         
     +----------+----------+
                                |                                    |                   
                |           
                                +------------------------------------+------------------------------------+
          
                                                                     |                   
                            
                             +---------------------------------------+--------------------------------------+
        
                             |                                                           
                  |         
                             |                              HDFSSystemFactory            
                  |         
                             |                                                           
                  |         
                             +------------------------------------------------------------------------------+


Diffs (updated)
-----

  build.gradle 2bea27b75288d3103178bc3762b9556f6e69cdd1 
  samza-hdfs/src/main/java/org/apache/samza/system/hdfs/HdfsSystemAdmin.java PRE-CREATION

  samza-hdfs/src/main/java/org/apache/samza/system/hdfs/HdfsSystemConsumer.java PRE-CREATION

  samza-hdfs/src/main/java/org/apache/samza/system/hdfs/PartitionDescriptorUtil.java PRE-CREATION

  samza-hdfs/src/main/java/org/apache/samza/system/hdfs/partitioner/DirectoryPartitioner.java
PRE-CREATION 
  samza-hdfs/src/main/java/org/apache/samza/system/hdfs/partitioner/FileSystemAdapter.java
PRE-CREATION 
  samza-hdfs/src/main/java/org/apache/samza/system/hdfs/partitioner/HdfsFileSystemAdapter.java
PRE-CREATION 
  samza-hdfs/src/main/java/org/apache/samza/system/hdfs/reader/AvroFileHdfsReader.java PRE-CREATION

  samza-hdfs/src/main/java/org/apache/samza/system/hdfs/reader/HdfsReaderFactory.java PRE-CREATION

  samza-hdfs/src/main/java/org/apache/samza/system/hdfs/reader/MultiFileHdfsReader.java PRE-CREATION

  samza-hdfs/src/main/java/org/apache/samza/system/hdfs/reader/SingleFileHdfsReader.java PRE-CREATION

  samza-hdfs/src/main/scala/org/apache/samza/system/hdfs/HdfsConfig.scala 61b7570afae3219b618c8830905035063941bdd7

  samza-hdfs/src/main/scala/org/apache/samza/system/hdfs/HdfsSystemAdmin.scala 92eb4472533db67dca01f075cb460581b4bdac0d

  samza-hdfs/src/main/scala/org/apache/samza/system/hdfs/HdfsSystemFactory.scala ef3c20a097ddf2feecaf8b0ad4587ea4bf6570b7

  samza-hdfs/src/test/java/org/apache/samza/system/hdfs/TestHdfsSystemConsumer.java PRE-CREATION

  samza-hdfs/src/test/java/org/apache/samza/system/hdfs/TestPartitionDesctiptorUtil.java PRE-CREATION

  samza-hdfs/src/test/java/org/apache/samza/system/hdfs/partitioner/TestDirectoryPartitioner.java
PRE-CREATION 
  samza-hdfs/src/test/java/org/apache/samza/system/hdfs/partitioner/TestHdfsFileSystemAdapter.java
PRE-CREATION 
  samza-hdfs/src/test/java/org/apache/samza/system/hdfs/reader/TestAvroFileHdfsReader.java
PRE-CREATION 
  samza-hdfs/src/test/java/org/apache/samza/system/hdfs/reader/TestMultiFileHdfsReader.java
PRE-CREATION 
  samza-hdfs/src/test/resources/integTest/emptyTestFile PRE-CREATION 
  samza-hdfs/src/test/resources/partitioner/testfile01 PRE-CREATION 
  samza-hdfs/src/test/resources/partitioner/testfile02 PRE-CREATION 
  samza-hdfs/src/test/resources/reader/TestEvent.avsc PRE-CREATION 
  samza-hdfs/src/test/scala/org/apache/samza/system/hdfs/TestHdfsSystemProducerTestSuite.scala
261310d03de204718621f601117f016da14841df 
  samza-yarn/src/main/java/org/apache/samza/job/yarn/YarnContainerRunner.java dacc52de0a34498a715a299bc69c95aebd1b08ba

  samza-yarn/src/main/scala/org/apache/samza/job/yarn/YarnJobFactory.scala 4e328a5f8c2b496a71e36c106339b7af263c96c7


Diff: https://reviews.apache.org/r/51142/diff/


Testing
-------

unit tests pass.

manually tested by writing a real hdfs samza job and deploying to a yarn cluster.


Thanks,

Hai Lu


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message