spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Haoyuan Li <haoyuan...@gmail.com>
Subject Re: on shark, is tachyon less efficient than memory_only cache strategy ?
Date Tue, 08 Jul 2014 17:19:06 GMT
Yes. For Shark, two modes, "shark.cache=tachyon" and "shark.cache=memory",
have the same ser/de overhead. Shark loads data from outsize of the process
in Tachyon mode with the following benefits:


   - In-memory data sharing across multiple Shark instances (i.e. stronger
   isolation)
   - Instant recovery of in-memory tables
   - Reduce heap size => faster GC in shark
   - If the table is larger than the memory size, only the hot columns will
   be cached in memory

from http://tachyon-project.org/master/Running-Shark-on-Tachyon.html and
https://github.com/amplab/shark/wiki/Running-Shark-with-Tachyon

Haoyuan


On Tue, Jul 8, 2014 at 9:58 AM, Aaron Davidson <ilikerps@gmail.com> wrote:

> Shark's in-memory format is already serialized (it's compressed and
> column-based).
>
>
> On Tue, Jul 8, 2014 at 9:50 AM, Mridul Muralidharan <mridul@gmail.com>
> wrote:
>
> > You are ignoring serde costs :-)
> >
> > - Mridul
> >
> > On Tue, Jul 8, 2014 at 8:48 PM, Aaron Davidson <ilikerps@gmail.com>
> wrote:
> > > Tachyon should only be marginally less performant than memory_only,
> > because
> > > we mmap the data from Tachyon's ramdisk. We do not have to, say,
> transfer
> > > the data over a pipe from Tachyon; we can directly read from the
> buffers
> > in
> > > the same way that Shark reads from its in-memory columnar format.
> > >
> > >
> > >
> > > On Tue, Jul 8, 2014 at 1:18 AM, qingyang li <liqingyang1985@gmail.com>
> > > wrote:
> > >
> > >> hi, when i create a table, i can point the cache strategy using
> > >> shark.cache,
> > >> i think "shark.cache=memory_only"  means data are managed by spark,
> and
> > >> data are in the same jvm with excutor;   while  "shark.cache=tachyon"
> > >>  means  data are managed by tachyon which is off heap, and data are
> not
> > in
> > >> the same jvm with excutor,  so spark will load data from tachyon for
> > each
> > >> query sql , so,  is  tachyon less efficient than memory_only cache
> > strategy
> > >>  ?
> > >> if yes, can we let spark load all data once from tachyon  for all sql
> > query
> > >>  if i want to use tachyon cache strategy since tachyon is more HA than
> > >> memory_only ?
> > >>
> >
>



-- 
Haoyuan Li
AMPLab, EECS, UC Berkeley
http://www.cs.berkeley.edu/~haoyuan/

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message