spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gene Pang <gene.p...@gmail.com>
Subject Re: Why so many parquet file part when I store data in Alluxio or File?
Date Fri, 08 Jul 2016 13:33:27 GMT
Hi Chanh,

You should be able to set the Alluxio block size with:

sc.hadoopConfiguration.set("alluxio.user.block.size.bytes.default", "256mb")

I think you have many parquet files because you have many Spark executors
writing out their partition of the files.

Hope that helps,
Gene

On Sun, Jul 3, 2016 at 8:02 PM, Chanh Le <giaosudau@gmail.com> wrote:

> Hi Gene,
> Could you give some suggestions on that?
>
>
>
> On Jul 1, 2016, at 5:31 PM, Ted Yu <yuzhihong@gmail.com> wrote:
>
> The comment from zhangxiongfei was from a year ago.
>
> Maybe something changed since them ?
>
> On Fri, Jul 1, 2016 at 12:07 AM, Chanh Le <giaosudau@gmail.com> wrote:
>
>> Hi Ted,
>> I set sc.hadoopConfiguration.setBoolean("fs.hdfs.impl.disable.cache",
>> true)
>>
>> sc.hadoopConfiguration.setLong("fs.local.block.size", 268435456)
>>
>> but It seems not working.
>>
>> <Screen_Shot_2016-07-01_at_2_06_27_PM.png>
>>
>>
>> On Jul 1, 2016, at 11:38 AM, Ted Yu <yuzhihong@gmail.com> wrote:
>>
>> Looking under Alluxio source, it seems only "fs.hdfs.impl.disable.cache"
>> is in use.
>>
>> FYI
>>
>> On Thu, Jun 30, 2016 at 9:30 PM, Deepak Sharma <deepakmca05@gmail.com>
>> wrote:
>>
>>> Ok.
>>> I came across this issue.
>>> Not sure if you already assessed this:
>>> https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-6921
>>>
>>> The workaround mentioned may work for you .
>>>
>>> Thanks
>>> Deepak
>>> On 1 Jul 2016 9:34 am, "Chanh Le" <giaosudau@gmail.com> wrote:
>>>
>>>> Hi Deepark,
>>>> Thank for replying. The way to write into alluxio is
>>>> df.write.mode(SaveMode.Append).partitionBy("network_id", "time"
>>>> ).parquet("alluxio://master1:19999/FACT_ADMIN_HOURLYā€¯)
>>>>
>>>>
>>>> I partition by 2 columns and store. I just want when I write
>>>> it automatic write a size properly for what I already set in Alluxio 512MB
>>>> per block.
>>>>
>>>>
>>>> On Jul 1, 2016, at 11:01 AM, Deepak Sharma <deepakmca05@gmail.com>
>>>> wrote:
>>>>
>>>> Before writing coalesing your rdd to 1 .
>>>> It will create only 1 output file .
>>>> Multiple part file happens as all your executors will be writing their
>>>> partitions to separate part files.
>>>>
>>>> Thanks
>>>> Deepak
>>>> On 1 Jul 2016 8:01 am, "Chanh Le" <giaosudau@gmail.com> wrote:
>>>>
>>>> Hi everyone,
>>>> I am using Alluxio for storage. But I am little bit confuse why I am do
>>>> set block size of alluxio is 512MB and my file part only few KB and too
>>>> many part.
>>>> Is that normal? Because I want to read it fast? Is that many part
>>>> effect the read operation?
>>>> How to set the size of file part?
>>>>
>>>> Thanks.
>>>> Chanh
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> <Screen_Shot_2016-07-01_at_9_24_55_AM.png>
>>>>
>>>>
>>>>
>>
>>
>
>

Mime
View raw message