beam-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ahmet Altay (JIRA)" <>
Subject [jira] [Commented] (BEAM-2208) Python SDK wordcount on cloud Dataflow runner is slow
Date Tue, 09 May 2017 17:56:04 GMT


Ahmet Altay commented on BEAM-2208:

Thank you [].

>From the linked job a few things are happening:
- Autoscaling cannot scale beyond 8 workers. This might be a quota issue on your side.
- max_num_workers is not set. If this is not set, autoscaling will be capped at 15 workers.
(Although you are not hitting this because of the above)
- It is possible that there is a hot key, which is adding to the execution time.

It would be most helpful if I can reproduce this case. From the title of the issue I am assuming
that you are using the wordcount example as is. Would it be possible for you to share your
input file (if it only contains dummy information). Otherwise, would you check your quota
and try running with another input file?

> Python SDK wordcount on cloud Dataflow runner is slow
> -----------------------------------------------------
>                 Key: BEAM-2208
>                 URL:
>             Project: Beam
>          Issue Type: Improvement
>          Components: runner-dataflow, sdk-py
>    Affects Versions: 0.6.0
>            Reporter: Anant Bhandarkar
>            Assignee: Ahmet Altay
>            Priority: Critical
> I have been trying to run the Beam Word count example with a 2GB file.
> When I run the Java Example for word count of this csv file the job gets completed in
7.15secs Mins.
> Job ID	
> 2017-04-18_23_57_02-2832613177376293063
> But word count example with same file using Python SDK takes 28 to 35mins 2017-04-20_04_48_27-8924552896141769408
> SDK version	
> Apache Beam SDK for Python 0.6.0

This message was sent by Atlassian JIRA

View raw message