spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From matthes <mdiekst...@sensenetworks.com>
Subject Re: Is it possible to use Parquet with Dremel encoding
Date Mon, 29 Sep 2014 15:51:40 GMT
Thank you so much guys for helping me, but I have some more questions about
it!

Do we have to presort the columns to get the benefits of the run length
encoding or do I have to group the data first and wrap it into a case class?

I try to sort the data first and write it down and I get different sizes as
result:
65.191.222 Bytes	unsorted
62.576.598 Bytes	sorted

I see no run time encoding in the debug output:

14/09/29 11:20:59 INFO ColumnChunkPageWriteStore: written 4.572.354B for
[col1] INT64: 683.189 values, 5.465.512B raw, 4.572.211B comp, 6 pages,
encodings: [PLAIN, BIT_PACKED]
14/09/29 11:20:59 INFO ColumnChunkPageWriteStore: written 4.687.432B for
[col2] INT64: 683.189 values, 5.465.512B raw, 4.687.289B comp, 6 pages,
encodings: [PLAIN, BIT_PACKED]
14/09/29 11:20:59 INFO ColumnChunkPageWriteStore: written 847.267B for
[col3] INT32: 683.189 values, 852.104B raw, 847.198B comp, 3 pages,
encodings: [PLAIN_DICTIONARY, BIT_PACKED], dic { 713 entries, 2.852B raw,
713B comp}
14/09/29 11:20:59 INFO ColumnChunkPageWriteStore: written 796.082B for
[col4] INT32: 683.189 values, 907.744B raw, 796.013B comp, 3 pages,
encodings: [PLAIN_DICTIONARY, BIT_PACKED], dic { 1.262 entries, 5.048B raw,
1.262B comp}


By the way why is the schema wrong? I include there repeated values, I'm
very confused!

Thanks 
Matthes



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Is-it-possible-to-use-Parquet-with-Dremel-encoding-tp15186p15344.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Mime
View raw message