spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vishnusaran Ramaswamy <vishnusa...@gmail.com>
Subject Re: SchemaRDD + SQL , loading projection columns
Date Wed, 03 Dec 2014 18:01:40 GMT
Thanks for the help.. Let me find more info on how to enable statistics in
parquet.

-Vishnu


Michael Armbrust wrote
> There is not a super easy way to do what you are asking since in general
> parquet needs to read all the data in a column. As far as I understand it
> does not have indexes that would allow you to jump to a specific entry in
> a
> column.  There is support for pushing down predicates, but unfortunately
> this is turned off by default (in Spark 1.2) due to bugs in the parquet
> library.  Even with this feature though I believe you still read the data
> and just skip the cost of materializing the row.
> 
> One thing that could speed up that particular query is to sort by 'rid
> before storing to parquet.  Then (when filter pushdown is turned on),
> parquet will keep statistics on the min/max value for each column in a
> given row group.  That would allow it to completely skip row groups that
> cannot contain a given 'rid.
> 
> Michael
> 
> On Tue, Dec 2, 2014 at 12:43 PM, Vishnusaran Ramaswamy <

> vishnusaran@

>> wrote:
> 
>> Hi,
>>
>> I have 16 GB of parquet files in /tmp/logs/ folder with the following
>> schema
>>
>> request_id(String), module(String), payload(Array[Byte])
>>
>> Most of my 16 GB data is the payload field, the request_id, and module
>> fields take less than 200 MB.
>>
>> I want to load the payload only when my filter condition matches.
>>
>> val sqlContext = new SQLContext(sc)
>> val files = sqlContext.parquetFile("/tmp/logs")
>> files.registerTempTable("logs")
>> val filteredLogs = sqlContext.sql("select request_id, payload from logs
>> where rid = 'dd4455ee' and module = 'query' ")
>>
>> when i run filteredLogs.collect.foreach(println) , i see all of the 16GB
>> data loaded.
>>
>> How do I load only the columns used in filters first and then load the
>> payload for the row matching the filter criteria?
>>
>> Let me know if this can be done in a different way.
>>
>> Thanks you,
>> Vishnu.
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/SchemaRDD-SQL-loading-projection-columns-tp20189.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: 

> user-unsubscribe@.apache

>> For additional commands, e-mail: 

> user-help@.apache

>>
>>





--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SchemaRDD-SQL-loading-projection-columns-tp20189p20278.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Mime
View raw message