spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jörn Franke <jornfra...@gmail.com>
Subject Re: Spark/Parquet/Statistics question
Date Tue, 17 Jan 2017 15:17:43 GMT
Hallo,

I am not sure what you mean by min/max for strings. I do not know if this makes sense. What
the ORC format has is bloom filters for strings etc. - are you referring to this? 

In order to apply min/max filters Spark needs to read the meta data of the file. If the filter
is applied or not - this you can see from the number of bytes read.


Best regards

> On 17 Jan 2017, at 15:28, djiang <djiang@dataxu.com> wrote:
> 
> Hi, 
> 
> I have been looking into how Spark stores statistics (min/max) in Parquet as
> well as how it uses the info for query optimization.
> I have got a few questions.
> First setup: Spark 2.1.0, the following sets up a Dataframe of 1000 rows,
> with a long type and a string type column.
> They are sorted by different columns, though.
> 
> scala> spark.sql("select id, cast(id as string) text from
> range(1000)").sort("id").write.parquet("/secret/spark21-sortById")
> scala> spark.sql("select id, cast(id as string) text from
> range(1000)").sort("Text").write.parquet("/secret/spark21-sortByText")
> 
> I added some code to parquet-tools to print out stats and examine the
> generated parquet files:
> 
> hadoop jar parquet-tools-1.9.1-SNAPSHOT.jar meta
> /secret/spark21-sortById/part-00000-39f7ac12-6038-46ee-b5c3-d7a5a06e4425.snappy.parquet

> file:       
> file:/secret/spark21-sortById/part-00000-39f7ac12-6038-46ee-b5c3-d7a5a06e4425.snappy.parquet

> creator:     parquet-mr version 1.8.1 (build
> 4aba4dae7bb0d4edbcf7923ae1339f28fd3f7fcf) 
> extra:       org.apache.spark.sql.parquet.row.metadata =
> {"type":"struct","fields":[{"name":"id","type":"long","nullable":false,"metadata":{}},{"name":"text","type":"string","nullable":false,"metadata":{}}]}

> 
> file schema: spark_schema 
> --------------------------------------------------------------------------------
> id:          REQUIRED INT64 R:0 D:0
> text:        REQUIRED BINARY O:UTF8 R:0 D:0
> 
> row group 1: RC:5 TS:133 OFFSET:4 
> --------------------------------------------------------------------------------
> id:           INT64 SNAPPY DO:0 FPO:4 SZ:71/81/1.14 VC:5
> ENC:PLAIN,BIT_PACKED STA:[min: 0, max: 4, num_nulls: 0]
> text:         BINARY SNAPPY DO:0 FPO:75 SZ:53/52/0.98 VC:5
> ENC:PLAIN,BIT_PACKED
> 
> hadoop jar parquet-tools-1.9.1-SNAPSHOT.jar meta
> /secret/spark21-sortByText/part-00000-3d7eac74-5ca0-44a0-b8a6-d67cc38a2bde.snappy.parquet

> file:       
> file:/secret/spark21-sortByText/part-00000-3d7eac74-5ca0-44a0-b8a6-d67cc38a2bde.snappy.parquet

> creator:     parquet-mr version 1.8.1 (build
> 4aba4dae7bb0d4edbcf7923ae1339f28fd3f7fcf) 
> extra:       org.apache.spark.sql.parquet.row.metadata =
> {"type":"struct","fields":[{"name":"id","type":"long","nullable":false,"metadata":{}},{"name":"text","type":"string","nullable":false,"metadata":{}}]}

> 
> file schema: spark_schema 
> --------------------------------------------------------------------------------
> id:          REQUIRED INT64 R:0 D:0
> text:        REQUIRED BINARY O:UTF8 R:0 D:0
> 
> row group 1: RC:5 TS:140 OFFSET:4 
> --------------------------------------------------------------------------------
> id:           INT64 SNAPPY DO:0 FPO:4 SZ:71/81/1.14 VC:5
> ENC:PLAIN,BIT_PACKED STA:[min: 0, max: 101, num_nulls: 0]
> text:         BINARY SNAPPY DO:0 FPO:75 SZ:60/59/0.98 VC:5
> ENC:PLAIN,BIT_PACKED
> 
> So the question is why is Spark, particularly, 2.1.0, only generate min/max
> for numeric columns, but not strings(BINARY) fields, even if the string
> field is included in the sort? Maybe I missed a configuraiton?
> 
> The second issue, is how can I confirm Spark is utilizing the min/max?
> scala> sc.setLogLevel("INFO")
> scala> spark.sql("select * from parquet.`/secret/spark21-sortById` where
> id=4").show
> I got many lines like this:
> 17/01/17 09:23:35 INFO FilterCompat: Filtering using predicate:
> and(noteq(id, null), eq(id, 4))
> 17/01/17 09:23:35 INFO FileScanRDD: Reading File path:
> file:///secret/spark21-sortById/part-00000-39f7ac12-6038-46ee-b5c3-d7a5a06e4425.snappy.parquet,
> range: 0-558, partition values: [empty row]
> ...
> 17/01/17 09:23:35 INFO FilterCompat: Filtering using predicate:
> and(noteq(id, null), eq(id, 4))
> 17/01/17 09:23:35 INFO FileScanRDD: Reading File path:
> file:///secret/spark21-sortById/part-00193-39f7ac12-6038-46ee-b5c3-d7a5a06e4425.snappy.parquet,
> range: 0-574, partition values: [empty row]
> ...
> 
> The question is it looks like Spark is scanning every file, even if from the
> min/max, Spark should be able to determine only part-00000 has the relevant
> data. Or maybe I read it wrong, that Spark is skipping the files? Maybe
> Spark can only use partition value for data skipping?
> 
> Thanks,
> 
> Dong
> 
> 
> 
> 
> --
> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Parquet-Statistics-question-tp28312.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
> 

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org


Mime
View raw message