beam-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ahmet Altay (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (BEAM-1386) Job hangs without warnings after reading ~20GB of gz csv
Date Fri, 03 Feb 2017 23:10:51 GMT

    [ https://issues.apache.org/jira/browse/BEAM-1386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15852300#comment-15852300
] 

Ahmet Altay commented on BEAM-1386:
-----------------------------------

Hi Johan,

Yes dividing up the compressed file into smaller files will help. Beam Python SDK initially
would not split compressed files because there is not efficient way of seeking. Another option
would be to use another file format such as AvroIO. I will close this bug because this work
as intended for Beam.

If you continue to have problems specific to Dataflow, you can reach out to us using (https://cloud.google.com/dataflow/support
or on StackOverflow.

Ahmet

> Job hangs without warnings after reading ~20GB of gz csv
> --------------------------------------------------------
>
>                 Key: BEAM-1386
>                 URL: https://issues.apache.org/jira/browse/BEAM-1386
>             Project: Beam
>          Issue Type: Bug
>          Components: sdk-py
>    Affects Versions: 0.5.0
>         Environment: Running on Google Dataflow with 'n1-standard-8' machines.
>            Reporter: Johan Brodin
>            Assignee: Ahmet Altay
>            Priority: Minor
>
> When running the job it works fine up until 20GB or around 23 million rows from a gzip:ed
csv file (total size 43M rows). Halted the job so the statistic from it seam to disappeared,
but here it is the id "2017-02-03_04_25_41-15296331815975218867". Is there any built in limitations
to file size? Should I try to break the file up into several smaller files? Could the issue
be related to the workers disk size?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message