samza-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yi Pan \(Data Infrastructure\)" <yi...@linkedin.com>
Subject Re: Review Request 51142: SAMZA-967: HDFS System Consumer
Date Thu, 29 Sep 2016 18:57:14 GMT

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51142/#review150895
-----------------------------------------------------------



Still in the middle (MultiFileHdfsReader). Will continue after lunch. BTW, overall looks very
good! Thanks!


build.gradle (line 465)
<https://reviews.apache.org/r/51142/#comment218968>

    I would strongly recommend to remove the dependency on samza-kafka and samza-yarn here.
There are only very small portion of code/configuration that we needed in samza-hdfs and I
would vote for less dependency on other modules here.



samza-hdfs/src/main/java/org/apache/samza/system/hdfs/HdfsSystemAdmin.java (line 55)
<https://reviews.apache.org/r/51142/#comment218969>

    Nice doc! Curious whether you use some tools for the ASCII graphs here?



samza-hdfs/src/main/java/org/apache/samza/system/hdfs/HdfsSystemConsumer.java (line 199)
<https://reviews.apache.org/r/51142/#comment218971>

    So, we are not handling HdfsFileReader exceptions w/ retries yet, right? I think that
it is fine for the first implementation. We should definitely log a JIRA for improvement here
though, since I heard from Venice team that Hadoop actually is less reliable than Kafka.


- Yi Pan (Data Infrastructure)


On Sept. 28, 2016, 9:57 p.m., Hai Lu wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51142/
> -----------------------------------------------------------
> 
> (Updated Sept. 28, 2016, 9:57 p.m.)
> 
> 
> Review request for samza, Yi Pan (Data Infrastructure) and Navina Ramesh.
> 
> 
> Bugs: SAMZA-967
>     https://issues.apache.org/jira/browse/SAMZA-967
> 
> 
> Repository: samza
> 
> 
> Description
> -------
> 
> Add HDFS System Consumer: 
> 
> 1. System admin, partitioner
> 2. System consumer with metrics
> 
> Design doc can be found here: https://issues.apache.org/jira/secure/attachment/12824078/HDFSSystemConsumer.pdf
> 
> An overview of the high level architecture: 
> 
> The system factory is used by Samza to instantiate SystemConsumer, SystemProducer, and
SystemAdmin for a specific system. The FileDataSystemFactory can be reused for other file
system like sources. 
> 
> HDFSSystemAdmin will start a “DirectoryPartitioner” to figure out the set of HDFS
files need to be consumed for this job. The DirectoryPartitioner also uses “GroupingPattern”
to group files into partitions if advanced partitioning is required. HDFSSystemAdmin will
then persist the “PartitionDescriptor” to HDFS.
> 
> The HDFSSystemConsumer will then pick up the “PartitionDescriptor” from HDFS. Based
on this information as well as the actual assignment of partitions, it would then know which
files to read from.
> 
> The initial implementation of the HDFS system consumer supports only avro data files.
It’s very easy to extend it to a variety of file format by implementing the FileReader interface.
> 
>                                                                                     
                                 
>                              +------------------------------------------------------------------------------+
        
>                              |                                                      
                       |         
>            +-----------------+                                     HDFS             
                       |         
>            |   Obtain        |                                                      
                       |         
>            |  Partition      +------+----------------------^------+---------------------------------^-------+
        
>            | Description            |                      |      |                 
               |                 
>            |                        |                      |      |                 
               |                 
>            |          +-------------v-------+              |      |       Filtering/
               |                 
>            |          |                     |              |      +---+    Grouping 
               +-----+           
>            |          | HDFSAvroFileReader  |              |          |             
                     |           
>            |          |                     |    Persist   |          |             
                     |           
>            |          +---------+-----------+   Partition  |          |             
                     |           
>            |                    |              Description |   +------v--------------+
        +----------+----------+
>            |                    |                          |   |                    
|         |                     |
>            |          +---------+-----------+              |   |Directory Partitioner|
        |   HDFSAvroWriter    |
>            |          |     IFileReader     |              |   |                    
|         |                     |
>            |          |                     |              |   +------+--------------+
        +----------+----------+
>            |          +---------+-----------+              |          |             
                     |           
>            |                    |                          |          |             
                     |           
>            |                    |                          |          |             
                     |           
>            |          +---------+-----------+            +-+----------+--------+    
          +----------+----------+
>            |          |                     |            |                     |    
          |                     |
>            |          | HDFSSystemConsumer  |            |   HDFSSystemAdmin   |    
          | HDFSSystemProducer  |
>            +---------->                     |            |                     | 
             |                     |
>                       +---------+-----------+            +-----------+---------+    
          +----------+----------+
>                                 |                                    |              
                     |           
>                                 +------------------------------------+------------------------------------+
          
>                                                                      |              
                                 
>                              +---------------------------------------+--------------------------------------+
        
>                              |                                                      
                       |         
>                              |                              HDFSSystemFactory       
                       |         
>                              |                                                      
                       |         
>                              +------------------------------------------------------------------------------+
> 
> 
> Diffs
> -----
> 
>   build.gradle 1d4eb74b1294318db8454631ddd0901596121ab2 
>   samza-hdfs/src/main/java/org/apache/samza/system/hdfs/HdfsSystemAdmin.java PRE-CREATION

>   samza-hdfs/src/main/java/org/apache/samza/system/hdfs/HdfsSystemConsumer.java PRE-CREATION

>   samza-hdfs/src/main/java/org/apache/samza/system/hdfs/PartitionDescriptorUtil.java
PRE-CREATION 
>   samza-hdfs/src/main/java/org/apache/samza/system/hdfs/partitioner/DirectoryPartitioner.java
PRE-CREATION 
>   samza-hdfs/src/main/java/org/apache/samza/system/hdfs/partitioner/FileSystemAdapter.java
PRE-CREATION 
>   samza-hdfs/src/main/java/org/apache/samza/system/hdfs/partitioner/HdfsFileSystemAdapter.java
PRE-CREATION 
>   samza-hdfs/src/main/java/org/apache/samza/system/hdfs/reader/AvroFileHdfsReader.java
PRE-CREATION 
>   samza-hdfs/src/main/java/org/apache/samza/system/hdfs/reader/HdfsReaderFactory.java
PRE-CREATION 
>   samza-hdfs/src/main/java/org/apache/samza/system/hdfs/reader/MultiFileHdfsReader.java
PRE-CREATION 
>   samza-hdfs/src/main/java/org/apache/samza/system/hdfs/reader/SingleFileHdfsReader.java
PRE-CREATION 
>   samza-hdfs/src/main/scala/org/apache/samza/system/hdfs/HdfsConfig.scala 61b7570afae3219b618c8830905035063941bdd7

>   samza-hdfs/src/main/scala/org/apache/samza/system/hdfs/HdfsSystemAdmin.scala 92eb4472533db67dca01f075cb460581b4bdac0d

>   samza-hdfs/src/main/scala/org/apache/samza/system/hdfs/HdfsSystemFactory.scala ef3c20a097ddf2feecaf8b0ad4587ea4bf6570b7

>   samza-hdfs/src/test/java/org/apache/samza/system/hdfs/TestHdfsSystemConsumer.java PRE-CREATION

>   samza-hdfs/src/test/java/org/apache/samza/system/hdfs/TestPartitionDesctiptorUtil.java
PRE-CREATION 
>   samza-hdfs/src/test/java/org/apache/samza/system/hdfs/partitioner/TestDirectoryPartitioner.java
PRE-CREATION 
>   samza-hdfs/src/test/java/org/apache/samza/system/hdfs/partitioner/TestHdfsFileSystemAdapter.java
PRE-CREATION 
>   samza-hdfs/src/test/java/org/apache/samza/system/hdfs/reader/TestAvroFileHdfsReader.java
PRE-CREATION 
>   samza-hdfs/src/test/java/org/apache/samza/system/hdfs/reader/TestMultiFileHdfsReader.java
PRE-CREATION 
>   samza-hdfs/src/test/resources/integTest/emptyTestFile PRE-CREATION 
>   samza-hdfs/src/test/resources/partitioner/testfile01 PRE-CREATION 
>   samza-hdfs/src/test/resources/partitioner/testfile02 PRE-CREATION 
>   samza-hdfs/src/test/resources/reader/TestEvent.avsc PRE-CREATION 
>   samza-hdfs/src/test/scala/org/apache/samza/system/hdfs/TestHdfsSystemProducerTestSuite.scala
261310d03de204718621f601117f016da14841df 
>   samza-shell/src/main/bash/run-job-for-azkaban.sh PRE-CREATION 
>   samza-yarn/src/main/java/org/apache/samza/job/yarn/YarnContainerRunner.java dacc52de0a34498a715a299bc69c95aebd1b08ba

>   samza-yarn/src/main/scala/org/apache/samza/job/yarn/YarnJobFactory.scala 4e328a5f8c2b496a71e36c106339b7af263c96c7

> 
> Diff: https://reviews.apache.org/r/51142/diff/
> 
> 
> Testing
> -------
> 
> unit tests pass.
> 
> manually tested by writing a real hdfs samza job and deploying to a yarn cluster.
> 
> 
> Thanks,
> 
> Hai Lu
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message