spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vishnusaran Ramaswamy <vishnusa...@gmail.com>
Subject SchemaRDD + SQL , loading projection columns
Date Tue, 02 Dec 2014 20:43:53 GMT
Hi,

I have 16 GB of parquet files in /tmp/logs/ folder with the following schema 

request_id(String), module(String), payload(Array[Byte])

Most of my 16 GB data is the payload field, the request_id, and module
fields take less than 200 MB.

I want to load the payload only when my filter condition matches. 

val sqlContext = new SQLContext(sc)
val files = sqlContext.parquetFile("/tmp/logs")
files.registerTempTable("logs")
val filteredLogs = sqlContext.sql("select request_id, payload from logs
where rid = 'dd4455ee' and module = 'query' ")

when i run filteredLogs.collect.foreach(println) , i see all of the 16GB
data loaded.

How do I load only the columns used in filters first and then load the
payload for the row matching the filter criteria?

Let me know if this can be done in a different way.

Thanks you,
Vishnu.




--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SchemaRDD-SQL-loading-projection-columns-tp20189.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Mime
View raw message