spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chanh Le <giaosu...@gmail.com>
Subject Re: Why so many parquet file part when I store data in Alluxio or File?
Date Fri, 01 Jul 2016 04:04:38 GMT
Hi Deepark,
Thank for replying. The way to write into alluxio is 
df.write.mode(SaveMode.Append).partitionBy("network_id", "time").parquet("alluxio://master1:19999/FACT_ADMIN_HOURLYā€¯)


I partition by 2 columns and store. I just want when I write it automatic write a size properly
for what I already set in Alluxio 512MB per block.


> On Jul 1, 2016, at 11:01 AM, Deepak Sharma <deepakmca05@gmail.com> wrote:
> 
> Before writing coalesing your rdd to 1 .
> It will create only 1 output file .
> Multiple part file happens as all your executors will be writing their partitions to
separate part files.
> 
> Thanks
> Deepak
> 
> On 1 Jul 2016 8:01 am, "Chanh Le" <giaosudau@gmail.com <mailto:giaosudau@gmail.com>>
wrote:
> Hi everyone,
> I am using Alluxio for storage. But I am little bit confuse why I am do set block size
of alluxio is 512MB and my file part only few KB and too many part.
> Is that normal? Because I want to read it fast? Is that many part effect the read operation?
> How to set the size of file part?
> 
> Thanks.
> Chanh
> 
> 
> 
>  
> 
> <Screen_Shot_2016-07-01_at_9_24_55_AM.png>


Mime
View raw message