spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Haoyuan Li <haoyuan...@gmail.com>
Subject Re: off-heap RDDs
Date Mon, 26 Aug 2013 00:06:04 GMT
Hi Imran,

One possible solution is that you can use
Tachyon<https://github.com/amplab/tachyon>.
When data is in Tachyon, Spark jobs will read it from off-heap memory.
Internally, it uses direct byte buffers to store memory-serialized RDDs as
you mentioned. Also, different Spark jobs can share the same data in
Tachyon's memory. Here is a presentation
(slide<https://docs.google.com/viewer?url=http%3A%2F%2Ffiles.meetup.com%2F3138542%2FTachyon_2013-05-09_Spark_Meetup.pdf>)
we did in May.

Haoyuan


On Sun, Aug 25, 2013 at 3:26 PM, Imran Rashid <imran@therashids.com> wrote:

> Hi,
>
> I was wondering if anyone has thought about putting cached data in an
> RDD into off-heap memory, eg. w/ direct byte buffers.  For really
> long-lived RDDs that use a lot of memory, this seems like a huge
> improvement, since all the memory is now totally ignored during GC.
> (and reading data from direct byte buffers is potentially faster as
> well, buts thats just a nice bonus).
>
> The easiest thing to do is to store memory-serialized RDDs in direct
> byte buffers, but I guess we could also store the serialized RDD on
> disk and use a memory mapped file.  Serializing into off-heap buffers
> is a really simple patch, I just changed a few lines (I haven't done
> any real tests w/ it yet, though).  But I dont' really have a ton of
> experience w/ off-heap memory, so I thought I would ask what others
> think of the idea, if it makes sense or if there are any gotchas I
> should be aware of, etc.
>
> thanks,
> Imran
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message