spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vishnusaran Ramaswamy <>
Subject SchemaRDD + SQL , loading projection columns
Date Tue, 02 Dec 2014 20:43:53 GMT

I have 16 GB of parquet files in /tmp/logs/ folder with the following schema 

request_id(String), module(String), payload(Array[Byte])

Most of my 16 GB data is the payload field, the request_id, and module
fields take less than 200 MB.

I want to load the payload only when my filter condition matches. 

val sqlContext = new SQLContext(sc)
val files = sqlContext.parquetFile("/tmp/logs")
val filteredLogs = sqlContext.sql("select request_id, payload from logs
where rid = 'dd4455ee' and module = 'query' ")

when i run filteredLogs.collect.foreach(println) , i see all of the 16GB
data loaded.

How do I load only the columns used in filters first and then load the
payload for the row matching the filter criteria?

Let me know if this can be done in a different way.

Thanks you,

View this message in context:
Sent from the Apache Spark User List mailing list archive at

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message