beam-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Olivier NGUYEN QUOC (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (BEAM-2490) ReadFromText function not taking all data with glob operator (*)
Date Wed, 21 Jun 2017 08:49:00 GMT

     [ https://issues.apache.org/jira/browse/BEAM-2490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Olivier NGUYEN QUOC updated BEAM-2490:
--------------------------------------
    Description: 
I run a very simple pipeline:
* Read my files from storage
* Split with '\n' char
* Write in on a Google Cloud Storage

I have 6 files matching with the pattern:
* my_files_2016090116_20160902_060051_xxxxxxxxxx.csv.gz (229.25 MB)
* my_files_2016090117_20160902_060051_xxxxxxxxxx.csv.gz (184.1 MB)
* my_files_2016090118_20160902_060051_xxxxxxxxxx.csv.gz (171.73 MB)
* my_files_2016090119_20160902_060051_xxxxxxxxxx.csv.gz (151.34 MB)
* my_files_2016090120_20160902_060051_xxxxxxxxxx.csv.gz (129.69 MB)
* my_files_2016090121_20160902_060051_xxxxxxxxxx.csv.gz (151.7 MB)
* my_files_2016090122_20160902_060051_xxxxxxxxxx.csv.gz (346.46 MB)
* my_files_2016090122_20160902_060051_xxxxxxxxxx.csv.gz (222.57 MB)

It run well but there is only a 288.62 MB file in output of this pipeline (instead of a 1.5
GB file).

{code: none}
data = (p | 'ReadMyFiles' >> beam.io.ReadFromText(
          "gs://XXXX_folder1/my_files_20160901*.csv.gz",
          skip_header_lines=1,
          compression_type=beam.io.filesystem.CompressionTypes.GZIP
          )
                       | 'SplitLines' >> beam.FlatMap(lambda x: x.split('\n'))
                    )
output = (
          data| "Write" >> beam.io.WriteToText('gs://XXX_folder2/test.csv', num_shards=1)
            )
{code}

Dataflow indicates me that the estimated size 	of the output after the ReadFromText step is
602.29 MB only, which not correspond to any unique input file size nor the overall file size
matching with the pattern.


  was:
I run a very simple pipeline:
* Read my files from storage
* Split with '\n' char
* Write in on a Google Cloud Storage

I have 6 files matching with the pattern:
* my_files_2016090116_20160902_060051_xxxxxxxxxx.csv.gz (229.25 MB)
* my_files_2016090117_20160902_060051_xxxxxxxxxx.csv.gz (184.1 MB)
* my_files_2016090118_20160902_060051_xxxxxxxxxx.csv.gz (171.73 MB)
* my_files_2016090119_20160902_060051_xxxxxxxxxx.csv.gz (151.34 MB)
* my_files_2016090120_20160902_060051_xxxxxxxxxx.csv.gz (129.69 MB)
* my_files_2016090121_20160902_060051_xxxxxxxxxx.csv.gz (151.7 MB)
* my_files_2016090122_20160902_060051_xxxxxxxxxx.csv.gz (346.46 MB)
* my_files_2016090122_20160902_060051_xxxxxxxxxx.csv.gz (222.57 MB)

It run well but there is only a 288.62 MB file in output of this pipeline (instead of a 1.5
GB file).

{code:python}
data = (p | 'ReadMyFiles' >> beam.io.ReadFromText(
          "gs://XXXX_folder1/my_files_20160901*.csv.gz",
          skip_header_lines=1,
          compression_type=beam.io.filesystem.CompressionTypes.GZIP
          )
                       | 'SplitLines' >> beam.FlatMap(lambda x: x.split('\n'))
                    )
output = (
          data| "Write" >> beam.io.WriteToText('gs://XXX_folder2/test.csv', num_shards=1)
            )
{code}

Dataflow indicate me that the estimated size 	of the output after the ReadFromText step is
602.29 MB only, which not correspond to any unique input file size nor the overall file size
matching with the pattern.



> ReadFromText function not taking all data with glob operator (*) 
> -----------------------------------------------------------------
>
>                 Key: BEAM-2490
>                 URL: https://issues.apache.org/jira/browse/BEAM-2490
>             Project: Beam
>          Issue Type: Bug
>          Components: sdk-py
>    Affects Versions: 2.0.0
>         Environment: Usage with Google Cloud Platform: Dataflow runner
>            Reporter: Olivier NGUYEN QUOC
>            Assignee: Ahmet Altay
>
> I run a very simple pipeline:
> * Read my files from storage
> * Split with '\n' char
> * Write in on a Google Cloud Storage
> I have 6 files matching with the pattern:
> * my_files_2016090116_20160902_060051_xxxxxxxxxx.csv.gz (229.25 MB)
> * my_files_2016090117_20160902_060051_xxxxxxxxxx.csv.gz (184.1 MB)
> * my_files_2016090118_20160902_060051_xxxxxxxxxx.csv.gz (171.73 MB)
> * my_files_2016090119_20160902_060051_xxxxxxxxxx.csv.gz (151.34 MB)
> * my_files_2016090120_20160902_060051_xxxxxxxxxx.csv.gz (129.69 MB)
> * my_files_2016090121_20160902_060051_xxxxxxxxxx.csv.gz (151.7 MB)
> * my_files_2016090122_20160902_060051_xxxxxxxxxx.csv.gz (346.46 MB)
> * my_files_2016090122_20160902_060051_xxxxxxxxxx.csv.gz (222.57 MB)
> It run well but there is only a 288.62 MB file in output of this pipeline (instead of
a 1.5 GB file).
> {code: none}
> data = (p | 'ReadMyFiles' >> beam.io.ReadFromText(
>           "gs://XXXX_folder1/my_files_20160901*.csv.gz",
>           skip_header_lines=1,
>           compression_type=beam.io.filesystem.CompressionTypes.GZIP
>           )
>                        | 'SplitLines' >> beam.FlatMap(lambda x: x.split('\n'))
>                     )
> output = (
>           data| "Write" >> beam.io.WriteToText('gs://XXX_folder2/test.csv', num_shards=1)
>             )
> {code}
> Dataflow indicates me that the estimated size 	of the output after the ReadFromText step
is 602.29 MB only, which not correspond to any unique input file size nor the overall file
size matching with the pattern.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message