spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Arnaud LARROQUE <alarro...@gmail.com>
Subject Re: Persist Dataframe to HDFS considering HDFS Block Size.
Date Mon, 21 Jan 2019 08:37:01 GMT
Hi Shivam,

At the end, the file is taking its own space regardless of the block size.
So if you're file is just a few ko bytes, it will take only this few ko
bytes.
But I've noticed that when the file is written, somehow a block is
allocated and the Namenode consider that all the block size is used. I had
this problem when writing a too much partitioned dataset !
But as soon as the file was written, the Namenode seems to know its true
size and drop the "default block size"

Arnaud

On Mon, Jan 21, 2019 at 9:01 AM Shivam Sharma <28shivamsharma@gmail.com>
wrote:

> Don't we have any property for it?
>
> One more quick question that if files created by Spark is less than HDFS
> block size then the rest of Block space will become unavailable and remain
> unutilized or it will be shared with other files?
>
> On Mon, Jan 21, 2019 at 1:30 PM Shivam Sharma <28shivamsharma@gmail.com>
> wrote:
>
>> Don't we have any property for it?
>>
>> One more quick question that if files created by Spark is less than HDFS
>> block size then the rest of Block space will become unavailable and remain
>> unutilized or it will be shared with other files?
>>
>> On Sun, Jan 20, 2019 at 12:47 AM Hichame El Khalfi <hichame@elkhalfi.com>
>> wrote:
>>
>>> You can do this in 2 passes (not one)
>>> A) save you dataset into hdfs with what you have.
>>> B) calculate number of partition, n= (size of your dataset)/hdfs block
>>> size
>>> Then run simple spark job to read and partition based on 'n'.
>>>
>>> Hichame
>>>
>>> *From:* felixcheung_m@hotmail.com
>>> *Sent:* January 19, 2019 2:06 PM
>>> *To:* 28shivamsharma@gmail.com; user@spark.apache.org
>>> *Subject:* Re: Persist Dataframe to HDFS considering HDFS Block Size.
>>>
>>> You can call coalesce to combine partitions..
>>>
>>>
>>> ------------------------------
>>> *From:* Shivam Sharma <28shivamsharma@gmail.com>
>>> *Sent:* Saturday, January 19, 2019 7:43 AM
>>> *To:* user@spark.apache.org
>>> *Subject:* Persist Dataframe to HDFS considering HDFS Block Size.
>>>
>>> Hi All,
>>>
>>> I wanted to persist dataframe on HDFS. Basically, I am inserting data
>>> into a HIVE table using Spark. Currently, at the time of writing to HIVE
>>> table I have set total shuffle partitions = 400 so total 400 files are
>>> being created which is not even considering HDFS block size. How can I tell
>>> spark to persist according to HDFS Blocks.
>>>
>>> We have something like this HIVE which solves this problem:
>>>
>>> set hive.merge.sparkfiles=true;
>>> set hive.merge.smallfiles.avgsize=2048000000;
>>> set hive.merge.size.per.task=4096000000;
>>>
>>> Thanks
>>>
>>> --
>>> Shivam Sharma
>>> Indian Institute Of Information Technology, Design and Manufacturing
>>> Jabalpur
>>> Mobile No- (+91) 8882114744
>>> Email:- 28shivamsharma@gmail.com
>>> LinkedIn:-*https://www.linkedin.com/in/28shivamsharma
>>> <https://www.linkedin.com/in/28shivamsharma>*
>>>
>>
>>
>> --
>> Shivam Sharma
>> Indian Institute Of Information Technology, Design and Manufacturing
>> Jabalpur
>> Mobile No- (+91) 8882114744
>> Email:- 28shivamsharma@gmail.com
>> LinkedIn:-*https://www.linkedin.com/in/28shivamsharma
>> <https://www.linkedin.com/in/28shivamsharma>*
>>
>
>
> --
> Shivam Sharma
> Indian Institute Of Information Technology, Design and Manufacturing
> Jabalpur
> Mobile No- (+91) 8882114744
> Email:- 28shivamsharma@gmail.com
> LinkedIn:-*https://www.linkedin.com/in/28shivamsharma
> <https://www.linkedin.com/in/28shivamsharma>*
>

Mime
View raw message