spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sam Liu" <liuqiyun_sp...@sina.com>
Subject Strange results of running Spark GenSort.scala
Date Sun, 28 Dec 2014 12:57:31 GMT
Hi Experts,
I am confusing on the input parameters of GenSort.scala and encountered strange issues. 
It requires 3 parameters: " [num-parts] [records-per-part] [output-path]".
Like Hadoop, I think the sizing of any one row(or record) of the sorting file equals to 100
bytes. So if I want to generate and sort 100 GB data using 4 partitions, is that 
correct to set the parameters as '4, 268435456, /tmp/sort-output'? I computed the records(rows)
number as following way:

100 GB data = 107374182400 byte = 1073741824 row * 100 byte/row = 268435456 row * 4 partition
* 100 byte/row 

So each partition should compute 268435456 row(record), right?


However, If I save the output as sequence file, the size of output 
files is only 20.8 GB(5.2 GB * 4 partition).  if I save the output as text file, not sequence

file, the size of output files is 309.2 GB(77.3 GB * 4 partition), but 
NOT 100 GB. Why? 

Thanks´╝ü

--------------------------------
Sam Liu

Mime
View raw message