spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From qingyang li <liqingyang1...@gmail.com>
Subject Re: on shark, is tachyon less efficient than memory_only cache strategy ?
Date Mon, 14 Jul 2014 05:13:10 GMT
Shark,  thanks for replying.
Let's me clear my question again.
----------------------------------------------
i create a table using " create table xxx1
tblproperties("shark.cache"="tachyon") as select * from xxx2"
when excuting some sql (for example , select * from xxx1) using shark,
 shark will read data into shark's memory  from tachyon's memory.
I think if each time we execute sql, shark always load data from tachyon,
it is less effient.
could we use some cache policy (such as,  CacheAllPolicy FIFOCachePolicy
LRUCachePolicy ) to cache data to invoid reading data from tachyon for each
sql query?
----------------------------------------------



2014-07-14 2:47 GMT+08:00 Haoyuan Li <haoyuan.li@gmail.com>:

> Qingyang,
>
> Are you asking Spark or Shark (The first email was "Shark", the last email
> was "Spark".)?
>
> Best,
>
> Haoyuan
>
>
> On Wed, Jul 9, 2014 at 7:40 PM, qingyang li <liqingyang1985@gmail.com>
> wrote:
>
> > could i set some cache policy to let spark load data from tachyon only
> one
> > time for all sql query?  for example by using CacheAllPolicy
> > FIFOCachePolicy LRUCachePolicy.  But I have tried that three policy, they
> > are not useful.
> > I think , if spark always load data for each sql query,  it will impact
> the
> > query speed , it will take more time than the case that data are managed
> by
> > spark itself.
> >
> >
> >
> >
> > 2014-07-09 1:19 GMT+08:00 Haoyuan Li <haoyuan.li@gmail.com>:
> >
> > > Yes. For Shark, two modes, "shark.cache=tachyon" and
> > "shark.cache=memory",
> > > have the same ser/de overhead. Shark loads data from outsize of the
> > process
> > > in Tachyon mode with the following benefits:
> > >
> > >
> > >    - In-memory data sharing across multiple Shark instances (i.e.
> > stronger
> > >    isolation)
> > >    - Instant recovery of in-memory tables
> > >    - Reduce heap size => faster GC in shark
> > >    - If the table is larger than the memory size, only the hot columns
> > will
> > >    be cached in memory
> > >
> > > from http://tachyon-project.org/master/Running-Shark-on-Tachyon.html
> and
> > > https://github.com/amplab/shark/wiki/Running-Shark-with-Tachyon
> > >
> > > Haoyuan
> > >
> > >
> > > On Tue, Jul 8, 2014 at 9:58 AM, Aaron Davidson <ilikerps@gmail.com>
> > wrote:
> > >
> > > > Shark's in-memory format is already serialized (it's compressed and
> > > > column-based).
> > > >
> > > >
> > > > On Tue, Jul 8, 2014 at 9:50 AM, Mridul Muralidharan <
> mridul@gmail.com>
> > > > wrote:
> > > >
> > > > > You are ignoring serde costs :-)
> > > > >
> > > > > - Mridul
> > > > >
> > > > > On Tue, Jul 8, 2014 at 8:48 PM, Aaron Davidson <ilikerps@gmail.com
> >
> > > > wrote:
> > > > > > Tachyon should only be marginally less performant than
> memory_only,
> > > > > because
> > > > > > we mmap the data from Tachyon's ramdisk. We do not have to,
say,
> > > > transfer
> > > > > > the data over a pipe from Tachyon; we can directly read from
the
> > > > buffers
> > > > > in
> > > > > > the same way that Shark reads from its in-memory columnar format.
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Tue, Jul 8, 2014 at 1:18 AM, qingyang li <
> > > liqingyang1985@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > >> hi, when i create a table, i can point the cache strategy
using
> > > > > >> shark.cache,
> > > > > >> i think "shark.cache=memory_only"  means data are managed
by
> > spark,
> > > > and
> > > > > >> data are in the same jvm with excutor;   while
> > >  "shark.cache=tachyon"
> > > > > >>  means  data are managed by tachyon which is off heap, and
data
> > are
> > > > not
> > > > > in
> > > > > >> the same jvm with excutor,  so spark will load data from
tachyon
> > for
> > > > > each
> > > > > >> query sql , so,  is  tachyon less efficient than memory_only
> cache
> > > > > strategy
> > > > > >>  ?
> > > > > >> if yes, can we let spark load all data once from tachyon
 for
> all
> > > sql
> > > > > query
> > > > > >>  if i want to use tachyon cache strategy since tachyon is
more
> HA
> > > than
> > > > > >> memory_only ?
> > > > > >>
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Haoyuan Li
> > > AMPLab, EECS, UC Berkeley
> > > http://www.cs.berkeley.edu/~haoyuan/
> > >
> >
>
>
>
> --
> Haoyuan Li
> AMPLab, EECS, UC Berkeley
> http://www.cs.berkeley.edu/~haoyuan/
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message