beam-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eugene Kirpichov (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (BEAM-167) TextIO can't read concatenated gzip files
Date Mon, 04 Apr 2016 19:04:25 GMT

    [ https://issues.apache.org/jira/browse/BEAM-167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15224834#comment-15224834
] 

Eugene Kirpichov commented on BEAM-167:
---------------------------------------

Here's a test and a patch https://gist.github.com/jkff/d8d984a33a41ec607328cee8e418c174
(I haven't yet gone through the contribution guide steps. Will do as soon as I get to it;
meanwhile anybody who has - feel free to use this directly).

> TextIO can't read concatenated gzip files
> -----------------------------------------
>
>                 Key: BEAM-167
>                 URL: https://issues.apache.org/jira/browse/BEAM-167
>             Project: Beam
>          Issue Type: Bug
>            Reporter: Eugene Kirpichov
>
> $ cat <<END > header.csv
> a,b,c
> END
> $ cat <<END > body.csv
> 1,2,3
> 4,5,6
> 7,8,9
> END
> $ gzip -c header.csv > file.gz
> $ gzip -c body.csv >> file.gz
> The file is well-formed:
> $ gzip -dc file.gz
> a,b,c
> 1,2,3
> 4,5,6
> 7,8,9
> However, TextIO.Read.from("/path/to/file.gz") will read only "a,b,c" - reproducible even
when the file is on local disk and with the DirectPipelineRunner.
> The bug is in CompressedSource. It uses GzipCompressorInputStream, which by default reads
only the first gzip stream in the file, but has an option to read all of them. Previously
(in Dataflow SDK 1.4.0) we used GZIPInputStream which reads all streams.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message