From user-return-28050-apmail-spark-user-archive=spark.apache.org@spark.apache.org Wed Mar 4 10:20:19 2015 Return-Path: X-Original-To: apmail-spark-user-archive@minotaur.apache.org Delivered-To: apmail-spark-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 4BDCF109A8 for ; Wed, 4 Mar 2015 10:20:19 +0000 (UTC) Received: (qmail 47757 invoked by uid 500); 4 Mar 2015 10:20:10 -0000 Delivered-To: apmail-spark-user-archive@spark.apache.org Received: (qmail 47685 invoked by uid 500); 4 Mar 2015 10:20:10 -0000 Mailing-List: contact user-help@spark.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list user@spark.apache.org Received: (qmail 47675 invoked by uid 99); 4 Mar 2015 10:20:10 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 04 Mar 2015 10:20:10 +0000 X-ASF-Spam-Status: No, hits=2.1 required=5.0 tests=HK_RANDOM_ENVFROM,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of zjffdu@gmail.com designates 209.85.215.41 as permitted sender) Received: from [209.85.215.41] (HELO mail-la0-f41.google.com) (209.85.215.41) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 04 Mar 2015 10:20:06 +0000 Received: by labmn12 with SMTP id mn12so7254940lab.2 for ; Wed, 04 Mar 2015 02:17:30 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=D4DRcQ568lPxmI9YQFZ00oeKUgSNcIrReafCam2jZn4=; b=Nk0jLnlo73sQwRY/TZVbrRFmauGM1dYe6JLGRvDHqo2gcEnVz2Amn4BaOaSBTcf84s Xm/ovq4PWtgaaoQpNj8xXo3FDP6G3o5RnrCtd45L8uDU6prMNVI+fZD4655YlPd9vGxS gb2UecFQ/6tvCHSjWRSGguLHKxkHsMUhVc/ufx7aqmT+3q5+CuN/rHWVTUnwkLL6oNQ6 9d1/2JCNNs7UIYMKH/o6Ax67wp8RGZ96VXWe2gUHPE14EUnRZbGZAWww8foPAknXKtql E6yxOLjsoLDOkqyBenmL24eTrLCsuU0ZHvB0o7XqMwywWpsiwgxLnfafFcrlsEczfYFi bXlw== MIME-Version: 1.0 X-Received: by 10.152.170.199 with SMTP id ao7mr2606346lac.27.1425464250445; Wed, 04 Mar 2015 02:17:30 -0800 (PST) Received: by 10.25.30.10 with HTTP; Wed, 4 Mar 2015 02:17:30 -0800 (PST) In-Reply-To: References: Date: Wed, 4 Mar 2015 18:17:30 +0800 Message-ID: Subject: Re: Is the RDD's Partitions determined before hand ? From: Jeff Zhang To: Sean Owen Cc: "user@spark.apache.org" Content-Type: multipart/alternative; boundary=089e01176b2949b767051073c211 X-Virus-Checked: Checked by ClamAV on apache.org --089e01176b2949b767051073c211 Content-Type: text/plain; charset=UTF-8 Hi Sean, > If you know a stage needs unusually high parallelism for example you can repartition further for that stage. The problem is we may don't know whether high parallelism is needed. e.g. for the join operator, high parallelism may only be necessary for some dataset that lots of data can join together while for other dataset high parallelism may not be necessary if only a few data can join together. So my question is that unable changing parallelism at runtime dynamically may not be flexible. On Wed, Mar 4, 2015 at 5:36 PM, Sean Owen wrote: > Hm, what do you mean? You can control, to some extent, the number of > partitions when you read the data, and can repartition if needed. > > You can set the default parallelism too so that it takes effect for most > ops thay create an RDD. One # of partitions is usually about right for all > work (2x or so the number of execution slots). > > If you know a stage needs unusually high parallelism for example you can > repartition further for that stage. > On Mar 4, 2015 1:50 AM, "Jeff Zhang" wrote: > >> Thanks Sean. >> >> But if the partitions of RDD is determined before hand, it would not be >> flexible to run the same program on the different dataset. Although for the >> first stage the partitions can be determined by the input data set, for the >> intermediate stage it is not possible. Users have to create policy to >> repartition or coalesce based on the data set size. >> >> >> On Tue, Mar 3, 2015 at 6:29 PM, Sean Owen wrote: >> >>> An RDD has a certain fixed number of partitions, yes. You can't change >>> an RDD. You can repartition() or coalese() and RDD to make a new one >>> with a different number of RDDs, possibly requiring a shuffle. >>> >>> On Tue, Mar 3, 2015 at 10:21 AM, Jeff Zhang wrote: >>> > I mean is it possible to change the partition number at runtime. Thanks >>> > >>> > >>> > -- >>> > Best Regards >>> > >>> > Jeff Zhang >>> >> >> >> >> -- >> Best Regards >> >> Jeff Zhang >> > -- Best Regards Jeff Zhang --089e01176b2949b767051073c211 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Hi Sean,

=C2=A0> If you know a stage needs unusually high parallel= ism for example you can repartition further for that stage.=C2=A0

= The problem is we may don'= t know whether high parallelism is needed. e.g. for the join operator, high= parallelism may only be=C2=A0necessary=C2=A0for some dataset that lots of = data can join together while for other dataset high parallelism may not be= =C2=A0necessary=C2=A0if=C2=A0only a few data can join together.=C2=A0

<= div>So my question is that unable changing parallelism at runtime dynamical= ly may not be flexible.



On Wed, Mar 4, 2015 at 5:36= PM, Sean Owen <sowen@cloudera.com> wrote:

Hm, what do you mean? You can control, to = some extent, the number of partitions when you read the data, and can repar= tition if needed.

You can set the default parallelism too so that it takes eff= ect for most ops thay create an RDD. One # of partitions is usually about r= ight for all work (2x or so the number of execution slots).

If you know a stage needs unusually high parallelism for exa= mple you can repartition further for that stage.

On Mar 4, 2015 1:50 AM, "Jeff Zhang" &= lt;zjffdu@gmail.com> wrote:
Thanks Sean.

But if the partitions of RDD is = determined before hand, it would not be flexible to run the same program on= the different dataset. Although for the first stage the partitions can be = determined by the input data set, for the intermediate stage it is not poss= ible. Users have to create policy to repartition or coalesce based on the d= ata set size.


On Tue, Mar 3, 2015 at 6:29 PM, Sean Owen <
sowen@= cloudera.com> wrote:
An RDD= has a certain fixed number of partitions, yes. You can't change
an RDD. You can repartition() or coalese() and RDD to make a new one
with a different number of RDDs, possibly requiring a shuffle.

On Tue, Mar 3, 2015 at 10:21 AM, Jeff Zhang <zjffdu@gmail.com> wrote:
> I mean is it possible to change the partition number at runtime. Thank= s
>
>
> --
> Best Regards
>
> Jeff Zhang



--
=
Best Regards

Jeff Zhang



--
=
Best Regards

Jeff Zhang
--089e01176b2949b767051073c211--