spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Cheng Lian <lian.cs....@gmail.com>
Subject Re: Spark SQL: The cached columnar table is not columnar?
Date Thu, 08 Jan 2015 13:36:24 GMT
Weird, which version did you use? Just tried a small snippet in Spark 
1.2.0 shell as follows, the result showed in the web UI meets the 
expectation quite well:

|import  org.apache.spark.sql.SQLContext
import  sc._

val  sqlContext  =  new  SQLContext(sc)
import  sqlContext._

jsonFile("file:///tmp/p.json").registerTempTable("tbl")
cacheTable("tbl")
sql("select * from tbl").collect()
sql("select name from tbl").collect()
|

The input data of the first statement is 292KB, the second is 49.1KB.

The JSON file I used is |examples/src/main/resources/people.json|, I 
copied its contents multiple times to generate a larger file.

Cheng

On 1/8/15 7:43 PM, Xuelin Cao wrote:

>
>
> Hi, Cheng
>
>      I checked the Input data for each stage. For example, in my 
> attached screen snapshot, the input data is 1212.5MB, which is the 
> total amount of the whole table
>
> Inline image 1
>
>      And, I also check the input data for each task (in the stage 
> detail page). And the sum of the input data for each task is also 1212.5MB
>
>
>
>
> On Thu, Jan 8, 2015 at 6:40 PM, Cheng Lian <lian.cs.zju@gmail.com 
> <mailto:lian.cs.zju@gmail.com>> wrote:
>
>     Hey Xuelin, which data item in the Web UI did you check?
>
>
>     On 1/7/15 5:37 PM, Xuelin Cao wrote:
>>
>>     Hi,
>>
>>           Curious and curious. I'm puzzled by the Spark SQL cached table.
>>
>>           Theoretically, the cached table should be columnar table,
>>     and only scan the column that included in my SQL.
>>
>>           However, in my test, I always see the whole table is
>>     scanned even though I only "select" one column in my SQL.
>>
>>           Here is my code:
>>
>>     /val sqlContext = new org.apache.spark.sql.SQLContext(sc)
>>     /
>>     /import sqlContext._
>>     /
>>     /sqlContext.jsonFile("/data/ad.json").registerTempTable("adTable")
>>     /
>>     /sqlContext.cacheTable("adTable")  //The table has > 10 columns/
>>     /
>>     /
>>     ///First run, cache the table into memory//
>>     /
>>     /sqlContext.sql("select * from adTable").collect/
>>     /
>>     /
>>     ///Second run, only one column is used. It should only scan a
>>     small fraction of data//
>>     /
>>     /sqlContext.sql("select adId from adTable").collect /
>>     /sqlContext.sql("select adId from adTable").collect
>>     /
>>     /sqlContext.sql("select adId from adTable").collect/
>>
>>             What I found is, every time I run the SQL, in WEB UI, it
>>     shows the total amount of input data is always the same --- the
>>     total amount of the table.
>>
>>             Is anything wrong? My expectation is:
>>             1. The cached table is stored as columnar table
>>             2. Since I only need one column in my SQL, the total
>>     amount of input data showed in WEB UI should be very small
>>
>>             But what I found is totally not the case. Why?
>>
>>             Thanks
>>
>
>
‚Äč

Mime
View raw message