spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Yu <yuzhih...@gmail.com>
Subject Re: Why so many parquet file part when I store data in Alluxio or File?
Date Fri, 01 Jul 2016 10:31:50 GMT
The comment from zhangxiongfei was from a year ago.

Maybe something changed since them ?

On Fri, Jul 1, 2016 at 12:07 AM, Chanh Le <giaosudau@gmail.com> wrote:

> Hi Ted,
> I set sc.hadoopConfiguration.setBoolean("fs.hdfs.impl.disable.cache", true
> )
>
> sc.hadoopConfiguration.setLong("fs.local.block.size", 268435456)
>
> but It seems not working.
>
>
>
> On Jul 1, 2016, at 11:38 AM, Ted Yu <yuzhihong@gmail.com> wrote:
>
> Looking under Alluxio source, it seems only "fs.hdfs.impl.disable.cache"
> is in use.
>
> FYI
>
> On Thu, Jun 30, 2016 at 9:30 PM, Deepak Sharma <deepakmca05@gmail.com>
> wrote:
>
>> Ok.
>> I came across this issue.
>> Not sure if you already assessed this:
>> https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-6921
>>
>> The workaround mentioned may work for you .
>>
>> Thanks
>> Deepak
>> On 1 Jul 2016 9:34 am, "Chanh Le" <giaosudau@gmail.com> wrote:
>>
>>> Hi Deepark,
>>> Thank for replying. The way to write into alluxio is
>>> df.write.mode(SaveMode.Append).partitionBy("network_id", "time"
>>> ).parquet("alluxio://master1:19999/FACT_ADMIN_HOURLYā€¯)
>>>
>>>
>>> I partition by 2 columns and store. I just want when I write
>>> it automatic write a size properly for what I already set in Alluxio 512MB
>>> per block.
>>>
>>>
>>> On Jul 1, 2016, at 11:01 AM, Deepak Sharma <deepakmca05@gmail.com>
>>> wrote:
>>>
>>> Before writing coalesing your rdd to 1 .
>>> It will create only 1 output file .
>>> Multiple part file happens as all your executors will be writing their
>>> partitions to separate part files.
>>>
>>> Thanks
>>> Deepak
>>> On 1 Jul 2016 8:01 am, "Chanh Le" <giaosudau@gmail.com> wrote:
>>>
>>> Hi everyone,
>>> I am using Alluxio for storage. But I am little bit confuse why I am do
>>> set block size of alluxio is 512MB and my file part only few KB and too
>>> many part.
>>> Is that normal? Because I want to read it fast? Is that many part effect
>>> the read operation?
>>> How to set the size of file part?
>>>
>>> Thanks.
>>> Chanh
>>>
>>>
>>>
>>>
>>>
>>> <Screen_Shot_2016-07-01_at_9_24_55_AM.png>
>>>
>>>
>>>
>
>

Mime
View raw message