flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (FLINK-5944) Flink should support reading Snappy Files
Date Sun, 24 Sep 2017 10:15:00 GMT

    [ https://issues.apache.org/jira/browse/FLINK-5944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16178140#comment-16178140
] 

ASF GitHub Bot commented on FLINK-5944:
---------------------------------------

Github user mlipkovich commented on a diff in the pull request:

    https://github.com/apache/flink/pull/4683#discussion_r140652438
  
    --- Diff: flink-core/pom.xml ---
    @@ -52,6 +52,12 @@ under the License.
     			<artifactId>flink-shaded-asm</artifactId>
     		</dependency>
     
    +		<dependency>
    +			<groupId>org.apache.flink</groupId>
    +			<artifactId>flink-shaded-hadoop2</artifactId>
    +			<version>${project.version}</version>
    +		</dependency>
    --- End diff --
    
    What do you think about adding this dependency to compile-time only?
    
    Regarding to difference between codecs as I understand the thing is that Snappy compressed
files are not splittable. So Hadoop splits raw files into blocks and compresses each block
separately using regular Snappy. If you download the whole Hadoop Snappy compressed file regular
Snappy will not be able to decompress it since it's not aware of block boundaries


> Flink should support reading Snappy Files
> -----------------------------------------
>
>                 Key: FLINK-5944
>                 URL: https://issues.apache.org/jira/browse/FLINK-5944
>             Project: Flink
>          Issue Type: New Feature
>          Components: Batch Connectors and Input/Output Formats
>            Reporter: Ilya Ganelin
>            Assignee: Mikhail Lipkovich
>              Labels: features
>
> Snappy is an extremely performant compression format that's widely used offering fast
decompression/compression. 
> This can be easily implemented by creating a SnappyInflaterInputStreamFactory and updating
the initDefaultInflateInputStreamFactories in FileInputFormat.
> Flink already includes the Snappy dependency in the project. 
> There is a minor gotcha in this. If we wish to use this with Hadoop, then we must provide
two separate implementations since Hadoop uses a different version of the snappy format than
Snappy Java (which is the xerial/snappy included in Flink). 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message