spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Daniel Siegmann <dsiegm...@securityscorecard.io>
Subject Re: Controlling number of spark partitions in dataframes
Date Thu, 26 Oct 2017 18:28:45 GMT
When working with datasets, Spark uses spark.sql.shuffle.partitions. It
defaults to 200. Between that and the default parallelism you can control
the number of partitions (except for the initial read).

More info here:
http://spark.apache.org/docs/latest/sql-programming-guide.html#other-configuration-options

I have no idea why it defaults to a fixed 200 (while default parallelism
defaults to a number scaled to your number of cores), or why there are two
separate configuration properties.


--
Daniel Siegmann
Senior Software Engineer
*SecurityScorecard Inc.*
214 W 29th Street, 5th Floor
New York, NY 10001


On Thu, Oct 26, 2017 at 9:53 AM, Deepak Sharma <deepakmca05@gmail.com>
wrote:

> I guess the issue is spark.default.parallelism is ignored when you are
> working with Data frames.It is supposed to work with only raw RDDs.
>
> Thanks
> Deepak
>
> On Thu, Oct 26, 2017 at 10:05 PM, Noorul Islam Kamal Malmiyoda <
> noorul@noorul.com> wrote:
>
>> Hi all,
>>
>> I have the following spark configuration
>>
>> spark.app.name=Test
>> spark.cassandra.connection.host=127.0.0.1
>> spark.cassandra.connection.keep_alive_ms=5000
>> spark.cassandra.connection.port=10000
>> spark.cassandra.connection.timeout_ms=30000
>> spark.cleaner.ttl=3600
>> spark.default.parallelism=4
>> spark.master=local[2]
>> spark.ui.enabled=false
>> spark.ui.showConsoleProgress=false
>>
>> Because I am setting spark.default.parallelism to 4, I was expecting
>> only 4 spark partitions. But it looks like it is not the case
>>
>> When I do the following
>>
>>     df.foreachPartition { partition =>
>>       val groupedPartition = partition.toList.grouped(3).toList
>>       println("Grouped partition " + groupedPartition)
>>     }
>>
>> There are too many print statements with empty list at the top. Only
>> the relevant partitions are at the bottom. Is there a way to control
>> number of partitions?
>>
>> Regards,
>> Noorul
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>
>>
>
>
> --
> Thanks
> Deepak
> www.bigdatabig.com
> www.keosha.net
>

Mime
View raw message