spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jung <jb_j...@naver.com>
Subject Re: dfs.blocksize is not applicable to some cases
Date Tue, 01 Dec 2015 08:05:09 GMT
I get additional information. Second case work normally if I set dfs.blocksize in hdfs-site.xml
to 512MB and restart all namenode and datanodes.

  231.0 M  /user/hive/warehouse/partition_test3/part-r-00000-d2e4ee9e-0a5f-4ee1-b511-88848a7a92d4.gz.parquet
  /user/hive/warehouse/partition_test3/part-r-00000-d2e4ee9e-0a5f-4ee1-b511-88848a7a92d4.gz.parquet
242202275 bytes, 1 block(s):  OK

It seems dfs.blocksize from sc.hadoopConfiguration get ignored in somewhere when the parent
RDD is managed table or parquet type.

-----Original Message-----
From: "Jung"<jb_jung@naver.com> 
To: "Ted Yu"<yuzhihong@gmail.com>; <user@spark.apache.org>; 
Cc: 
Sent: 2015-12-01 (화) 10:22:25
Subject: Re: dfs.blocksize is not applicable to some cases
 
Yes, I can reproduce it in Spark 1.5.2.
This is the results.

1. first case(1block)
  221.1 M  /user/hive/warehouse/partition_test/part-r-00000-b0e5ecd3-75a3-4c92-94ec-59353d08067a.gz.parquet
  221.1 M  /user/hive/warehouse/partition_test/part-r-00001-b0e5ecd3-75a3-4c92-94ec-59353d08067a.gz.parquet
  221.1 M  /user/hive/warehouse/partition_test/part-r-00002-b0e5ecd3-75a3-4c92-94ec-59353d08067a.gz.parquet

  /user/hive/warehouse/partition_test/part-r-00000-b0e5ecd3-75a3-4c92-94ec-59353d08067a.gz.parquet
231863863 bytes, 1 block(s):  OK

2. second case(2blocks)
   231.0 M  /user/hive/warehouse/partition_test2/part-r-00000-b7486a52-cfb9-4db0-8d94-377c039026ef.gz.parquet
  
  /user/hive/warehouse/partition_test2/part-r-00000-b7486a52-cfb9-4db0-8d94-377c039026ef.gz.parquet
242201812 bytes, 2 block(s):  OK

In terms of PARQUET-166, I think it only discusses row group performance. Should I set dfs.blocksize
to a little bit more than parquet.block.size? 

Thanks

-----Original Message-----
From: "Ted Yu"<yuzhihong@gmail.com> 
To: "Jung"<jb_jung@naver.com>; 
Cc: "user"<user@spark.apache.org>; 
Sent: 2015-12-01 (화) 03:09:58
Subject: Re: dfs.blocksize is not applicable to some cases
 
I am not expert in Parquet. Looking at PARQUET-166, it seems that parquet.block.size should
be lower than dfs.blocksize Have you tried Spark 1.5.2 to see if the problem persists ? Cheers
On Mon, Nov 30, 2015 at 1:55 AM, Jung <jb_jung@naver.com> wrote:
Hello,
I use Spark 1.4.1 and Hadoop 2.2.0.
It may be a stupid question but I cannot understand why "dfs.blocksize" in hadoop option doesn't
affect the number of blocks sometimes.
When I run the script below,

  val BLOCK_SIZE = 1024 * 1024 * 512 // set to 512MB, hadoop default is 128MB
  sc.hadoopConfiguration.setInt("parquet.block.size", BLOCK_SIZE)
  sc.hadoopConfiguration.setInt("dfs.blocksize",BLOCK_SIZE)
  sc.parallelize(1 to 500000000, 24).repartition(3).toDF.saveAsTable("partition_test")

it creates 3 files like this.

  221.1 M  /user/hive/warehouse/partition_test/part-r-00001.gz.parquet
  221.1 M  /user/hive/warehouse/partition_test/part-r-00002.gz.parquet
  221.1 M  /user/hive/warehouse/partition_test/part-r-00003.gz.parquet

To check how many blocks in a file, I enter the command "hdfs fsck /user/hive/warehouse/partition_test/part-r-00001.gz.parquet
-files -blocks".

  Total blocks (validated):      1 (avg. block size 231864402 B)

It is normal case because maximum blocksize change from 128MB to 512MB.
In the real world, I have a bunch of files.

  14.4 M  /user/hive/warehouse/data_1g/part-r-00001.gz.parquet
  14.4 M  /user/hive/warehouse/data_1g/part-r-00002.gz.parquet
  14.4 M  /user/hive/warehouse/data_1g/part-r-00003.gz.parquet
  14.4 M  /user/hive/warehouse/data_1g/part-r-00004.gz.parquet
  14.4 M  /user/hive/warehouse/data_1g/part-r-00005.gz.parquet
  14.4 M  /user/hive/warehouse/data_1g/part-r-00006.gz.parquet
  14.4 M  /user/hive/warehouse/data_1g/part-r-00007.gz.parquet
  14.4 M  /user/hive/warehouse/data_1g/part-r-00008.gz.parquet
  14.4 M  /user/hive/warehouse/data_1g/part-r-00009.gz.parquet
  14.4 M  /user/hive/warehouse/data_1g/part-r-00010.gz.parquet
  14.4 M  /user/hive/warehouse/data_1g/part-r-00011.gz.parquet
  14.4 M  /user/hive/warehouse/data_1g/part-r-00012.gz.parquet
  14.4 M  /user/hive/warehouse/data_1g/part-r-00013.gz.parquet
  14.4 M  /user/hive/warehouse/data_1g/part-r-00014.gz.parquet
  14.4 M  /user/hive/warehouse/data_1g/part-r-00015.gz.parquet
  14.4 M  /user/hive/warehouse/data_1g/part-r-00016.gz.parquet

Each file consists of 1block (avg. block size 15141395 B) and I run the almost same code as
first.

  val BLOCK_SIZE = 1024 * 1024 * 512 // set to 512MB, hadoop default is 128MB
  sc.hadoopConfiguration.setInt("parquet.block.size", BLOCK_SIZE)
  sc.hadoopConfiguration.setInt("dfs.blocksize",BLOCK_SIZE)
  sqlContext.table("data_1g").repartition(1).saveAsTable("partition_test2")

It creates one file.

 231.0 M  /user/hive/warehouse/partition_test2/part-r-00001.gz.parquet

But it consists of 2 blocks. It seems dfs.blocksize is not applicable.

  /user/hive/warehouse/partition_test2/part-r-00001.gz.parquet 242202143 bytes, 2 block(s):
 OK
  0. BP-2098986396-192.168.100.1-1389779750403:blk_1080124727_6385839 len=134217728 repl=2
  1. BP-2098986396-192.168.100.1-1389779750403:blk_1080124728_6385840 len=107984415 repl=2

Because of this, Spark read it as 2partition even though I repartition data into 1partition.
If the file size after repartitioning is a little more 128MB and save it again, it writes
2 files like 128Mb, 1MB.
It is very important for me because I use repartition method many times. Please help me figure
out.

  Jung 
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message