beam-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ahmet Altay (JIRA)" <>
Subject [jira] [Commented] (BEAM-2208) Apache Beam Python SDK is atleast 5 times slower
Date Tue, 09 May 2017 05:40:04 GMT


Ahmet Altay commented on BEAM-2208:

[], do you have any recent job runs? Only bare minimum metadata information
is kept beyond a few weeks for debugging, and I cannot tell what is is causing the difference
from that information. One thing I noticed is the biggest difference in time comes from the
reading step. Could you check whether GCS gzip compression is enabled for this input file?

> Apache Beam Python SDK is atleast 5 times slower
> ------------------------------------------------
>                 Key: BEAM-2208
>                 URL:
>             Project: Beam
>          Issue Type: Improvement
>          Components: runner-dataflow, sdk-py
>    Affects Versions: 0.6.0
>            Reporter: Anant Bhandarkar
>            Assignee: Ahmet Altay
>            Priority: Critical
> I have been trying to run the Beam Word count example with a 2GB file.
> When I run the Java Example for word count of this csv file the job gets completed in
7.15secs Mins.
> Job ID	
> 2017-04-18_23_57_02-2832613177376293063
> But word count example with same file using Python SDK takes 28 to 35mins 2017-04-20_04_48_27-8924552896141769408
> SDK version	
> Apache Beam SDK for Python 0.6.0

This message was sent by Atlassian JIRA

View raw message