spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From qingyang li <liqingyang1...@gmail.com>
Subject Re: on shark, is tachyon less efficient than memory_only cache strategy ?
Date Tue, 29 Jul 2014 02:36:42 GMT
hi, haoyuan, thanks for replying.


2014-07-21 16:29 GMT+08:00 Haoyuan Li <haoyuan.li@gmail.com>:

> Qingyang,
>
> Aha. Got it.
>
> 800MB data is pretty small. Loading from Tachyon does have a bit of extra
> overhead. But it will have more benefit when the data size is larger. Also,
> if you store the table in Tachyon, you can have different shark servers to
> query the data at the same time. For more trade-off, please refer to this
> page: http://tachyon-project.org/Running-Shark-on-Tachyon.html
>
> Best,
>
> Haoyuan
>
>
> On Wed, Jul 16, 2014 at 12:06 AM, qingyang li <liqingyang1985@gmail.com>
> wrote:
>
> > let's me describe my scene:
> > ----------------------
> > i have 8 machines (24 core , 16G memory, per machine) of spark cluster
> and
> > tachyon cluster.  On tachyon,  I create one table which contains 800M
> data,
> > when i run query sql on shark,   it will cost 2.43s,  but when i create
> the
> > same table on spark memory , i run  the same sql , it will cost 1.56s.
> >  data on tachyon cost more time than data on spark memory.   they all
> have
> > 150 map process,  and per node 16-20 map process.
> > I think the reason is that when data is on tachyon, shark will let spark
> > slave load data from tachyon salve which is on the same node with tachyon
> > slave,
> > i have tried to set some configuration to tune shark and tachyon, but
> still
> > can not make the former more fast than 2.43s.
> > do anyone have some ideas ?
> >
> > By the way ,  my tachyon block size is 1GB now,  i want to reset block
> size
> > ,  will it work by setting tachyon.user.default.block.size.byte=8M ?  if
> > not,  what does tachyon.user.default.block.size.byte mean?
> >
> >
> > 2014-07-14 13:13 GMT+08:00 qingyang li <liqingyang1985@gmail.com>:
> >
> > > Shark,  thanks for replying.
> > > Let's me clear my question again.
> > > ----------------------------------------------
> > > i create a table using " create table xxx1
> > > tblproperties("shark.cache"="tachyon") as select * from xxx2"
> > > when excuting some sql (for example , select * from xxx1) using shark,
> > >  shark will read data into shark's memory  from tachyon's memory.
> > > I think if each time we execute sql, shark always load data from
> tachyon,
> > > it is less effient.
> > > could we use some cache policy (such as,  CacheAllPolicy
> FIFOCachePolicy
> > > LRUCachePolicy ) to cache data to invoid reading data from tachyon for
> > > each sql query?
> > > ----------------------------------------------
> > >
> > >
> > >
> > > 2014-07-14 2:47 GMT+08:00 Haoyuan Li <haoyuan.li@gmail.com>:
> > >
> > > Qingyang,
> > >>
> > >> Are you asking Spark or Shark (The first email was "Shark", the last
> > email
> > >> was "Spark".)?
> > >>
> > >> Best,
> > >>
> > >> Haoyuan
> > >>
> > >>
> > >> On Wed, Jul 9, 2014 at 7:40 PM, qingyang li <liqingyang1985@gmail.com
> >
> > >> wrote:
> > >>
> > >> > could i set some cache policy to let spark load data from tachyon
> only
> > >> one
> > >> > time for all sql query?  for example by using CacheAllPolicy
> > >> > FIFOCachePolicy LRUCachePolicy.  But I have tried that three policy,
> > >> they
> > >> > are not useful.
> > >> > I think , if spark always load data for each sql query,  it will
> > impact
> > >> the
> > >> > query speed , it will take more time than the case that data are
> > >> managed by
> > >> > spark itself.
> > >> >
> > >> >
> > >> >
> > >> >
> > >> > 2014-07-09 1:19 GMT+08:00 Haoyuan Li <haoyuan.li@gmail.com>:
> > >> >
> > >> > > Yes. For Shark, two modes, "shark.cache=tachyon" and
> > >> > "shark.cache=memory",
> > >> > > have the same ser/de overhead. Shark loads data from outsize
of
> the
> > >> > process
> > >> > > in Tachyon mode with the following benefits:
> > >> > >
> > >> > >
> > >> > >    - In-memory data sharing across multiple Shark instances (i.e.
> > >> > stronger
> > >> > >    isolation)
> > >> > >    - Instant recovery of in-memory tables
> > >> > >    - Reduce heap size => faster GC in shark
> > >> > >    - If the table is larger than the memory size, only the hot
> > columns
> > >> > will
> > >> > >    be cached in memory
> > >> > >
> > >> > > from
> > http://tachyon-project.org/master/Running-Shark-on-Tachyon.html
> > >> and
> > >> > > https://github.com/amplab/shark/wiki/Running-Shark-with-Tachyon
> > >> > >
> > >> > > Haoyuan
> > >> > >
> > >> > >
> > >> > > On Tue, Jul 8, 2014 at 9:58 AM, Aaron Davidson <
> ilikerps@gmail.com>
> > >> > wrote:
> > >> > >
> > >> > > > Shark's in-memory format is already serialized (it's compressed
> > and
> > >> > > > column-based).
> > >> > > >
> > >> > > >
> > >> > > > On Tue, Jul 8, 2014 at 9:50 AM, Mridul Muralidharan <
> > >> mridul@gmail.com>
> > >> > > > wrote:
> > >> > > >
> > >> > > > > You are ignoring serde costs :-)
> > >> > > > >
> > >> > > > > - Mridul
> > >> > > > >
> > >> > > > > On Tue, Jul 8, 2014 at 8:48 PM, Aaron Davidson <
> > >> ilikerps@gmail.com>
> > >> > > > wrote:
> > >> > > > > > Tachyon should only be marginally less performant
than
> > >> memory_only,
> > >> > > > > because
> > >> > > > > > we mmap the data from Tachyon's ramdisk. We do
not have to,
> > say,
> > >> > > > transfer
> > >> > > > > > the data over a pipe from Tachyon; we can directly
read from
> > the
> > >> > > > buffers
> > >> > > > > in
> > >> > > > > > the same way that Shark reads from its in-memory
columnar
> > >> format.
> > >> > > > > >
> > >> > > > > >
> > >> > > > > >
> > >> > > > > > On Tue, Jul 8, 2014 at 1:18 AM, qingyang li <
> > >> > > liqingyang1985@gmail.com>
> > >> > > > > > wrote:
> > >> > > > > >
> > >> > > > > >> hi, when i create a table, i can point the
cache strategy
> > using
> > >> > > > > >> shark.cache,
> > >> > > > > >> i think "shark.cache=memory_only"  means data
are managed
> by
> > >> > spark,
> > >> > > > and
> > >> > > > > >> data are in the same jvm with excutor;   while
> > >> > >  "shark.cache=tachyon"
> > >> > > > > >>  means  data are managed by tachyon which
is off heap, and
> > data
> > >> > are
> > >> > > > not
> > >> > > > > in
> > >> > > > > >> the same jvm with excutor,  so spark will
load data from
> > >> tachyon
> > >> > for
> > >> > > > > each
> > >> > > > > >> query sql , so,  is  tachyon less efficient
than
> memory_only
> > >> cache
> > >> > > > > strategy
> > >> > > > > >>  ?
> > >> > > > > >> if yes, can we let spark load all data once
from tachyon
>  for
> > >> all
> > >> > > sql
> > >> > > > > query
> > >> > > > > >>  if i want to use tachyon cache strategy since
tachyon is
> > more
> > >> HA
> > >> > > than
> > >> > > > > >> memory_only ?
> > >> > > > > >>
> > >> > > > >
> > >> > > >
> > >> > >
> > >> > >
> > >> > >
> > >> > > --
> > >> > > Haoyuan Li
> > >> > > AMPLab, EECS, UC Berkeley
> > >> > > http://www.cs.berkeley.edu/~haoyuan/
> > >> > >
> > >> >
> > >>
> > >>
> > >>
> > >> --
> > >> Haoyuan Li
> > >> AMPLab, EECS, UC Berkeley
> > >> http://www.cs.berkeley.edu/~haoyuan/
> > >>
> > >
> > >
> >
>
>
>
> --
> Haoyuan Li
> AMPLab, EECS, UC Berkeley
> http://www.cs.berkeley.edu/~haoyuan/
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message