spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From qingyang li <liqingyang1...@gmail.com>
Subject Re: on shark, is tachyon less efficient than memory_only cache strategy ?
Date Wed, 16 Jul 2014 07:06:22 GMT
let's me describe my scene:
----------------------
i have 8 machines (24 core , 16G memory, per machine) of spark cluster and
tachyon cluster.  On tachyon,  I create one table which contains 800M data,
when i run query sql on shark,   it will cost 2.43s,  but when i create the
same table on spark memory , i run  the same sql , it will cost 1.56s.
 data on tachyon cost more time than data on spark memory.   they all have
150 map process,  and per node 16-20 map process.
I think the reason is that when data is on tachyon, shark will let spark
slave load data from tachyon salve which is on the same node with tachyon
slave,
i have tried to set some configuration to tune shark and tachyon, but still
can not make the former more fast than 2.43s.
do anyone have some ideas ?

By the way ,  my tachyon block size is 1GB now,  i want to reset block size
,  will it work by setting tachyon.user.default.block.size.byte=8M ?  if
not,  what does tachyon.user.default.block.size.byte mean?


2014-07-14 13:13 GMT+08:00 qingyang li <liqingyang1985@gmail.com>:

> Shark,  thanks for replying.
> Let's me clear my question again.
> ----------------------------------------------
> i create a table using " create table xxx1
> tblproperties("shark.cache"="tachyon") as select * from xxx2"
> when excuting some sql (for example , select * from xxx1) using shark,
>  shark will read data into shark's memory  from tachyon's memory.
> I think if each time we execute sql, shark always load data from tachyon,
> it is less effient.
> could we use some cache policy (such as,  CacheAllPolicy FIFOCachePolicy
> LRUCachePolicy ) to cache data to invoid reading data from tachyon for
> each sql query?
> ----------------------------------------------
>
>
>
> 2014-07-14 2:47 GMT+08:00 Haoyuan Li <haoyuan.li@gmail.com>:
>
> Qingyang,
>>
>> Are you asking Spark or Shark (The first email was "Shark", the last email
>> was "Spark".)?
>>
>> Best,
>>
>> Haoyuan
>>
>>
>> On Wed, Jul 9, 2014 at 7:40 PM, qingyang li <liqingyang1985@gmail.com>
>> wrote:
>>
>> > could i set some cache policy to let spark load data from tachyon only
>> one
>> > time for all sql query?  for example by using CacheAllPolicy
>> > FIFOCachePolicy LRUCachePolicy.  But I have tried that three policy,
>> they
>> > are not useful.
>> > I think , if spark always load data for each sql query,  it will impact
>> the
>> > query speed , it will take more time than the case that data are
>> managed by
>> > spark itself.
>> >
>> >
>> >
>> >
>> > 2014-07-09 1:19 GMT+08:00 Haoyuan Li <haoyuan.li@gmail.com>:
>> >
>> > > Yes. For Shark, two modes, "shark.cache=tachyon" and
>> > "shark.cache=memory",
>> > > have the same ser/de overhead. Shark loads data from outsize of the
>> > process
>> > > in Tachyon mode with the following benefits:
>> > >
>> > >
>> > >    - In-memory data sharing across multiple Shark instances (i.e.
>> > stronger
>> > >    isolation)
>> > >    - Instant recovery of in-memory tables
>> > >    - Reduce heap size => faster GC in shark
>> > >    - If the table is larger than the memory size, only the hot columns
>> > will
>> > >    be cached in memory
>> > >
>> > > from http://tachyon-project.org/master/Running-Shark-on-Tachyon.html
>> and
>> > > https://github.com/amplab/shark/wiki/Running-Shark-with-Tachyon
>> > >
>> > > Haoyuan
>> > >
>> > >
>> > > On Tue, Jul 8, 2014 at 9:58 AM, Aaron Davidson <ilikerps@gmail.com>
>> > wrote:
>> > >
>> > > > Shark's in-memory format is already serialized (it's compressed and
>> > > > column-based).
>> > > >
>> > > >
>> > > > On Tue, Jul 8, 2014 at 9:50 AM, Mridul Muralidharan <
>> mridul@gmail.com>
>> > > > wrote:
>> > > >
>> > > > > You are ignoring serde costs :-)
>> > > > >
>> > > > > - Mridul
>> > > > >
>> > > > > On Tue, Jul 8, 2014 at 8:48 PM, Aaron Davidson <
>> ilikerps@gmail.com>
>> > > > wrote:
>> > > > > > Tachyon should only be marginally less performant than
>> memory_only,
>> > > > > because
>> > > > > > we mmap the data from Tachyon's ramdisk. We do not have
to, say,
>> > > > transfer
>> > > > > > the data over a pipe from Tachyon; we can directly read
from the
>> > > > buffers
>> > > > > in
>> > > > > > the same way that Shark reads from its in-memory columnar
>> format.
>> > > > > >
>> > > > > >
>> > > > > >
>> > > > > > On Tue, Jul 8, 2014 at 1:18 AM, qingyang li <
>> > > liqingyang1985@gmail.com>
>> > > > > > wrote:
>> > > > > >
>> > > > > >> hi, when i create a table, i can point the cache strategy
using
>> > > > > >> shark.cache,
>> > > > > >> i think "shark.cache=memory_only"  means data are managed
by
>> > spark,
>> > > > and
>> > > > > >> data are in the same jvm with excutor;   while
>> > >  "shark.cache=tachyon"
>> > > > > >>  means  data are managed by tachyon which is off heap,
and data
>> > are
>> > > > not
>> > > > > in
>> > > > > >> the same jvm with excutor,  so spark will load data
from
>> tachyon
>> > for
>> > > > > each
>> > > > > >> query sql , so,  is  tachyon less efficient than memory_only
>> cache
>> > > > > strategy
>> > > > > >>  ?
>> > > > > >> if yes, can we let spark load all data once from tachyon
 for
>> all
>> > > sql
>> > > > > query
>> > > > > >>  if i want to use tachyon cache strategy since tachyon
is more
>> HA
>> > > than
>> > > > > >> memory_only ?
>> > > > > >>
>> > > > >
>> > > >
>> > >
>> > >
>> > >
>> > > --
>> > > Haoyuan Li
>> > > AMPLab, EECS, UC Berkeley
>> > > http://www.cs.berkeley.edu/~haoyuan/
>> > >
>> >
>>
>>
>>
>> --
>> Haoyuan Li
>> AMPLab, EECS, UC Berkeley
>> http://www.cs.berkeley.edu/~haoyuan/
>>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message