spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Cheng Lian <lian.cs....@gmail.com>
Subject Re: Spark SQL: The cached columnar table is not columnar?
Date Thu, 08 Jan 2015 14:42:09 GMT
Ah, my bad... You're absolute right!

Just checked how this number is computed. It turned out that once an RDD 
block is retrieved from the block manager, the size of the block is 
added to the input bytes. Spark SQL's in-memory columnar format stores 
all columns within a single partition into a single RDD block, that's 
why the input bytes size always equals to the size of the whole table. 
However, when decompressing, reading values from columnar byte buffers, 
only required byte buffer(s) are actually touched.

Cheng

On 1/8/15 10:13 PM, Xuelin Cao wrote:
>
> Hi, Cheng
>
>       In your code:
>
> cacheTable("tbl")
> sql("select * from tbl").collect() sql("select name from tbl").collect()
>
>      Running the first sql, the whole table is not cached yet. So the 
> *input data will be the original json file. *
>      After it is cached, the json format data is removed, so the total 
> amount of data also drops.
>
>      If you try like this:
>
> cacheTable("tbl")
> sql("select * from tbl").collect() sql("select name from tbl").collect()
> sql("select * from tbl").collect()
>
>      Is the input data of the 3rd SQL bigger than 49.1KB?
>
>
>
>
> On Thu, Jan 8, 2015 at 9:36 PM, Cheng Lian <lian.cs.zju@gmail.com 
> <mailto:lian.cs.zju@gmail.com>> wrote:
>
>     Weird, which version did you use? Just tried a small snippet in
>     Spark 1.2.0 shell as follows, the result showed in the web UI
>     meets the expectation quite well:
>
>     |import  org.apache.spark.sql.SQLContext
>     import  sc._
>
>     val  sqlContext  =  new  SQLContext(sc)
>     import  sqlContext._
>
>     jsonFile("file:///tmp/p.json").registerTempTable("tbl")
>     cacheTable("tbl")
>     sql("select * from tbl").collect()
>     sql("select name from tbl").collect()
>     |
>
>     The input data of the first statement is 292KB, the second is 49.1KB.
>
>     The JSON file I used is |examples/src/main/resources/people.json|,
>     I copied its contents multiple times to generate a larger file.
>
>     Cheng
>
>     On 1/8/15 7:43 PM, Xuelin Cao wrote:
>
>>
>>
>>     Hi, Cheng
>>
>>          I checked the Input data for each stage. For example, in my
>>     attached screen snapshot, the input data is 1212.5MB, which is
>>     the total amount of the whole table
>>
>>     Inline image 1
>>
>>          And, I also check the input data for each task (in the stage
>>     detail page). And the sum of the input data for each task is also
>>     1212.5MB
>>
>>
>>
>>
>>     On Thu, Jan 8, 2015 at 6:40 PM, Cheng Lian <lian.cs.zju@gmail.com
>>     <mailto:lian.cs.zju@gmail.com>> wrote:
>>
>>         Hey Xuelin, which data item in the Web UI did you check?
>>
>>
>>         On 1/7/15 5:37 PM, Xuelin Cao wrote:
>>>
>>>         Hi,
>>>
>>>               Curious and curious. I'm puzzled by the Spark SQL
>>>         cached table.
>>>
>>>         Theoretically, the cached table should be columnar table,
>>>         and only scan the column that included in my SQL.
>>>
>>>               However, in my test, I always see the whole table is
>>>         scanned even though I only "select" one column in my SQL.
>>>
>>>               Here is my code:
>>>
>>>         /val sqlContext = new org.apache.spark.sql.SQLContext(sc)
>>>         /
>>>         /import sqlContext._
>>>         /
>>>         /sqlContext.jsonFile("/data/ad.json").registerTempTable("adTable")
>>>         /
>>>         /sqlContext.cacheTable("adTable")  //The table has > 10 columns/
>>>         /
>>>         /
>>>         ///First run, cache the table into memory//
>>>         /
>>>         /sqlContext.sql("select * from adTable").collect/
>>>         /
>>>         /
>>>         ///Second run, only one column is used. It should only scan
>>>         a small fraction of data//
>>>         /
>>>         /sqlContext.sql("select adId from adTable").collect /
>>>         /sqlContext.sql("select adId from adTable").collect
>>>         /
>>>         /sqlContext.sql("select adId from adTable").collect/
>>>
>>>                 What I found is, every time I run the SQL, in WEB
>>>         UI, it shows the total amount of input data is always the
>>>         same --- the total amount of the table.
>>>
>>>                 Is anything wrong? My expectation is:
>>>                 1. The cached table is stored as columnar table
>>>                 2. Since I only need one column in my SQL, the total
>>>         amount of input data showed in WEB UI should be very small
>>>
>>>                 But what I found is totally not the case. Why?
>>>
>>>                 Thanks
>>>
>>
>>
>     ‚Äč
>
>


Mime
View raw message