spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Cheng Lian <lian.cs....@gmail.com>
Subject Re: configure number of cached partition in memory on SparkSQL
Date Mon, 16 Mar 2015 12:41:03 GMT
Hi Judy,

In the case of |HadoopRDD| and |NewHadoopRDD|, partition number is 
actually decided by the |InputFormat| used. And 
|spark.sql.inMemoryColumnarStorage.batchSize| is not related to 
partition number, it controls the in-memory columnar batch size within a 
single partition.

Also, what do you mean by “change the number of partitions /after/ 
caching the table”? Are you trying to re-cache an already cached table 
with a different partition number?

Currently, I don’t see a super intuitive pure SQL way to set the 
partition number in this case. Maybe you can try this (assuming table 
|t| has a column |s| which is expected to be sorted):

|SET  spark.sql.shuffle.partitions =10;
CACHE  TABLE  cached_tAS  SELECT  *FROM  tORDER  BY  s;
|

In this way, we introduce a shuffle by sorting a column, and zoom in/out 
the partition number at the same time. This might not be the best way 
out there, but it’s the first one that jumped into my head.

Cheng

On 3/5/15 3:51 AM, Judy Nash wrote:

> Hi,
>
> I am tuning a hive dataset on Spark SQL deployed via thrift server.
>
> How can I change the number of partitions after caching the table on 
> thrift server?
>
> I have tried the following but still getting the same number of 
> partitions after caching:
>
> Spark.default.parallelism
>
> spark.sql.inMemoryColumnarStorage.batchSize
>
> Thanks,
>
> Judy
>
​

Mime
View raw message