spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Archit Thakur <archit279tha...@gmail.com>
Subject Re: Parquet number of partitions
Date Thu, 07 May 2015 07:31:07 GMT
Hi.
No. of partitions are determined by the RDD it uses in the plan it creates.
It uses NewHadoopRDD which gives partitions by getSplits of input format it
is using. It uses FilteringParquetRowInputFormat subclass of
ParquetInputFormat. To change the no of partitions write a new input format
and make the NewHadoopRDD use your plan. or if u r ready to shuffle u can
use repartition api without change of code.

Thanks & Regards.

On Tue, May 5, 2015 at 7:56 PM, Masf <masfworld@gmail.com> wrote:

> Hi Eric.
>
> Q1:
> When I read parquet files, I've tested that Spark generates so many
> partitions as parquet files exist in the path.
>
> Q2:
> To reduce the number of partitions you can use rdd.repartition(x), x=>
> number of partitions. Depend on your case, repartition could be a heavy task
>
>
> Regards.
> Miguel.
>
> On Tue, May 5, 2015 at 3:56 PM, Eric Eijkelenboom <
> eric.eijkelenboom@gmail.com> wrote:
>
>> Hello guys
>>
>> Q1: How does Spark determine the number of partitions when reading a
>> Parquet file?
>>
>> val df = sqlContext.parquetFile(path)
>>
>> Is it some way related to the number of Parquet row groups in my input?
>>
>> Q2: How can I reduce this number of partitions? Doing this:
>>
>> df.rdd.coalesce(200).count
>>
>> from the spark-shell causes job execution to hang…
>>
>> Any ideas? Thank you in advance.
>>
>> Eric
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> For additional commands, e-mail: user-help@spark.apache.org
>>
>>
>
>
> --
>
>
> Saludos.
> Miguel Ángel
>

Mime
View raw message