spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From zhangxiongfei <zhangxiongfei0...@163.com>
Subject Spark SQL 1.3.1 "saveAsParquetFile" will output tachyon file with different block size
Date Mon, 13 Apr 2015 11:13:32 GMT
Hi experts
I run below code  in Spark Shell to access parquet files in Tachyon.
1.First,created a DataFrame by loading a bunch of Parquet Files in Tachyon
val ta3 =sqlContext.parquetFile("tachyon://tachyonserver:19998/apps/tachyon/zhangxf/parquetAdClick-6p-256m");
2.Second, set the "fs.local.block.size" to 256M to make sure that block size of output files
in Tachyon is 256M.
   sc.hadoopConfiguration.setLong("fs.local.block.size",268435456)
3.Third,saved above DataFrame into Parquet files that is stored in Tachyon
  ta3.saveAsParquetFile("tachyon://tachyonserver:19998/apps/tachyon/zhangxf/parquetAdClick-6p-256m-test");
After above code run successfully, the output parquet files were stored in Tachyon,but these
files have different block size,below is the information of those files in the path "tachyon://tachyonserver:19998/apps/tachyon/zhangxf/parquetAdClick-6p-256m-test":
    File Name                     Size              Block Size     In-Memory     Pin     Creation
Time
 _SUCCESS                      0.00 B           256.00 MB     100%         NO     04-13-2015
17:48:23:519
_common_metadata      1088.00 B      256.00 MB     100%         NO     04-13-2015 17:48:23:741
_metadata                       22.71 KB       256.00 MB     100%         NO     04-13-2015
17:48:23:646
part-r-00001.parquet     177.19 MB     32.00 MB      100%         NO     04-13-2015 17:46:44:626
part-r-00002.parquet     177.21 MB     32.00 MB      100%         NO     04-13-2015 17:46:44:636
part-r-00003.parquet     177.02 MB     32.00 MB      100%         NO     04-13-2015 17:46:45:439
part-r-00004.parquet     177.21 MB     32.00 MB      100%         NO     04-13-2015 17:46:44:845
part-r-00005.parquet     177.40 MB     32.00 MB      100%         NO     04-13-2015 17:46:44:638
part-r-00006.parquet     177.33 MB     32.00 MB      100%         NO     04-13-2015 17:46:44:648

It seems that the API saveAsParquetFile does not distribute/broadcast the hadoopconfiguration
to executors like the other API such as saveAsTextFile.The configutation "fs.local.block.size"
only take effects on Driver.
If I set that configuration before loading parquet files,the problem is gone.
Could anyone help me verify this problem?

Thanks
Zhang Xiongfei
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message