spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From zhangxiongfei <>
Subject Why does the HDFS parquet file generated by Spark SQL have different size with those on Tachyon?
Date Fri, 17 Apr 2015 09:51:20 GMT
I did some tests on Parquet Files with Spark SQL DataFrame API.
I generated 36 gzip compressed parquet files by Spark SQL and stored them on Tachyon,The size
of each file is about  222M.Then read them with below code.
val tfs =sqlContext.parquetFile("tachyon://datanode8.bitauto.dmp:19998/apps/tachyon/adClick");
Next,I just save this DataFrame onto HDFS with below code.It will generate 36 parquet files
too,but the size of each file is about 265M
My question is Why the files on HDFS has different size with those on Tachyon even though
they come from the same original data?

Zhang Xiongfei

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message