spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Xuelin Cao <xuelin...@yahoo.com.INVALID>
Subject Spark SQL: The cached columnar table is not columnar?
Date Wed, 07 Jan 2015 09:37:40 GMT

Hi, 
      Curious and curious. I'm puzzled by the Spark SQL cached table.
      Theoretically, the cached table should be columnar table, and only scan the column
that included in my SQL.
      However, in my test, I always see the whole table is scanned even though I only "select"
one column in my SQL.
      Here is my code:
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext._
sqlContext.jsonFile("/data/ad.json").registerTempTable("adTable")
sqlContext.cacheTable("adTable")  //The table has > 10 columns
//First run, cache the table into memory
sqlContext.sql("select * from adTable").collect
//Second run, only one column is used. It should only scan a small fraction of data
sqlContext.sql("select adId from adTable").collect sqlContext.sql("select adId from adTable").collect
sqlContext.sql("select adId from adTable").collect

        What I found is, every time I run the SQL, in WEB UI, it shows the total amount
of input data is always the same --- the total amount of the table.
        Is anything wrong? My expectation is:        1. The cached table is stored
as columnar table        2. Since I only need one column in my SQL, the total amount of
input data showed in WEB UI should be very small
        But what I found is totally not the case. Why?
        Thanks

Mime
View raw message