spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Yu <yuzhih...@gmail.com>
Subject Re: Why so many parquet file part when I store data in Alluxio or File?
Date Fri, 01 Jul 2016 04:38:44 GMT
Looking under Alluxio source, it seems only "fs.hdfs.impl.disable.cache" is
in use.

FYI

On Thu, Jun 30, 2016 at 9:30 PM, Deepak Sharma <deepakmca05@gmail.com>
wrote:

> Ok.
> I came across this issue.
> Not sure if you already assessed this:
> https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-6921
>
> The workaround mentioned may work for you .
>
> Thanks
> Deepak
> On 1 Jul 2016 9:34 am, "Chanh Le" <giaosudau@gmail.com> wrote:
>
>> Hi Deepark,
>> Thank for replying. The way to write into alluxio is
>> df.write.mode(SaveMode.Append).partitionBy("network_id", "time").parquet(
>> "alluxio://master1:19999/FACT_ADMIN_HOURLYā€¯)
>>
>>
>> I partition by 2 columns and store. I just want when I write it automatic
>> write a size properly for what I already set in Alluxio 512MB per block.
>>
>>
>> On Jul 1, 2016, at 11:01 AM, Deepak Sharma <deepakmca05@gmail.com> wrote:
>>
>> Before writing coalesing your rdd to 1 .
>> It will create only 1 output file .
>> Multiple part file happens as all your executors will be writing their
>> partitions to separate part files.
>>
>> Thanks
>> Deepak
>> On 1 Jul 2016 8:01 am, "Chanh Le" <giaosudau@gmail.com> wrote:
>>
>> Hi everyone,
>> I am using Alluxio for storage. But I am little bit confuse why I am do
>> set block size of alluxio is 512MB and my file part only few KB and too
>> many part.
>> Is that normal? Because I want to read it fast? Is that many part effect
>> the read operation?
>> How to set the size of file part?
>>
>> Thanks.
>> Chanh
>>
>>
>>
>>
>>
>> <Screen_Shot_2016-07-01_at_9_24_55_AM.png>
>>
>>
>>

Mime
View raw message