spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Imran Rashid <im...@therashids.com>
Subject Re: off-heap RDDs
Date Wed, 28 Aug 2013 08:21:30 GMT
Thanks Haoyuan.  It seems like we should try out Tachyon, sounds like
it is what we are looking for.

On Wed, Aug 28, 2013 at 8:18 AM, Haoyuan Li <haoyuan.li@gmail.com> wrote:
> Response inline.
>
>
> On Tue, Aug 27, 2013 at 1:37 AM, Imran Rashid <imran@therashids.com> wrote:
>
>> Thanks for all the great comments & discussion.  Let me expand a bit
>> on our use case, and then I'm gonna combine responses to various
>> questions.
>>
>> In general, when we use spark, we have some really big RDDs that use
>> up a lot of memory (10s of GB per node) that are really our "core"
>> data sets.  We tend to start up a spark application, immediately load
>> all those data sets, and just leave them loaded for the lifetime of
>> that process.  We definitely create a lot of other RDDs along the way,
>> and lots of intermediate objects that we'd like to go through normal
>> garbage collection.  But those all require much less memory, maybe
>> 1/10th of the big RDDs that we just keep around.  I know this is a bit
>> of a special case, but it seems like it probably isn't that different
>> from a lot of use cases.
>>
>> Reynold Xin wrote:
>> > This is especially attractive if the application can read directly from
>> a byte
>> > buffer without generic serialization (like Shark).
>>
>> interesting -- can you explain how this works in Shark?  do you have
>> some general way of storing data in byte buffers that avoids
>> serialization?  Or do you mean that if the user is effectively
>> creating an RDD of ints, that you create a an RDD[ByteBuffer], and
>> then you read / write ints into the byte buffer yourself?
>> Sorry, I'm familiar with the basic idea of shark but not the code at
>> all -- even a pointer to the code would be helpful.
>>
>> Haoyun Li wrote:
>> > One possible solution is that you can use
>> > Tachyon<https://github.com/amplab/tachyon>.
>>
>> This is a good idea, that I had probably overlooked.  There are two
>> potential issues that I can think of with this approach, though:
>> 1) I was under the impression that Tachyon is still not really tested
>> in production systems, and I need something a bit more mature.  Of
>> course, my changes wouldn't be thoroughly tested either, but somehow I
>> feel better about deploying my 5-line patch to a codebase I understand
>> than adding another entire system.  (This isn't a good reason to add
>> this to spark in general, though, just might be a temporary patch we
>> locally deploy)
>>
>
> This is a legitimate concern. The good news is that, several companies have
> been testing it for a while, and some are close to make it to production.
> For example, as Yahoo mentioned in today's meetup, we are working to
> integrate Shark and Tachyon closely, and results are very promising. It
> will be in production soon.
>
>
>> 2) I may have misunderstood Tachyon, but it seems there is a big
>> difference in the data locality in these two approaches.  On a large
>> cluster, HDFS will spread the data all over the cluster, and so any
>> particular piece of on-disk data will only live on a few machines.
>> When you start a spark application, which only uses a small subset of
>> the nodes, odds are the data you want is *not* on those nodes.  So
>> even if tachyon caches data from HDFS into memory, it won't be on the
>> same nodes as the spark application.  Which means that when the spark
>> application reads data from the RDD, even though the data is in memory
>> on some node in the cluster, it will need to be read over the network
>> by the actual spark worker assigned to the application.
>
> Is my understanding correct?  I haven't done any measurements at all
>> of a difference in performance, but it seems this would be much
>> slower.
>>
>
> This is a great question. Actually, from data locality perspective, two
> approaches have no difference. Tachyon does client side caching, which
> means, if a client on a node reads data not on its local machine, the first
> read will cache the data on that node. Therefore, all future access on that
> node will read the data from its local memory. For example, suppose you
> have a cluster with 100 nodes all running HDFS and Tachyon. Then you launch
> a Spark jobs running on 20 nodes only. When it reads or caches the data
> first time, all data will be cached on those 20 nodes. In the future, when
> Spark master tries to schedule tasks, it will query Tachyon about data
> locations, and take advantage of data localities automatically.
>
> Best,
>
> Haoyuan
>
>
>>
>>
>> Mark Hamstra wrote:
>> > What worries me is the
>> > combinatoric explosion of different caching and persistence mechanisms.
>>
>> great points, and I have no ideas of the real advantages yet.  I agree
>> we'd need to actual observe an improvement to add yet another option.
>> (I would really like some alternative to what I'm doing now, but maybe
>> tachyon is all I need ...)
>>
>> Reynold Xin wrote:
>> > Mark - you don't necessarily need to construct a separate storage level.
>> > One simple way to accomplish this is for the user application to pass
>> Spark
>> > a DirectByteBuffer.
>>
>> hmm, that's true I suppose, but I had originally thought of making it
>> another storage level, just for convenience & consistency.  Couldn't
>> you get rid of all the storage levels and just have the user apply
>> various transformations to an RDD?  eg.
>>
>> rdd.cache(MEMORY_ONLY_SER)
>>
>> could be
>>
>> rdd.map{x => serializer.serialize(x)}.cache(MEMORY_ONLY)
>>
>>
>> And I agree with all of Lijie's comments that using off-heap memory is
>> unsafe & difficult.  But I feel that isn't a reason to completely
>> disallow it, if there is a significant performance improvement.  It
>> would need to be clearly documented as an advanced feature with some
>> risks involved.
>>
>> thanks,
>> imran
>>
>> On Mon, Aug 26, 2013 at 4:38 AM, Lijie Xu <csxulijie@gmail.com> wrote:
>> > I remember that I talked about this off-heap approach with Reynold in
>> > person several months ago. I think this approach is attractive to
>> > Spark/Shark, since there are many large objects in JVM. But the main
>> > problem in original Spark (without Tachyon support) is that it uses the
>> > same memory space both for storing critical data and processing temporary
>> > data. Separating storing and processing is more important than looking
>> for
>> > memory-efficient storing technique. So I think this separation is the
>> main
>> > contribution of Tachyon.
>> >
>> >
>> > As for off-heap approach, we are not the first to realize this problem.
>> > Apache DirectMemory is promising, though not mature currently. However, I
>> > think there are some problems while using direct memory.
>> >
>> > 1)       Unsafe. As same as C++, there may be memory leak. Users will
>> also
>> > be confused to set right memory-related configurations such as –Xmx and
>> > –MaxDirectMemorySize.
>> >
>> > 2)       Difficult. Designing an effective and efficient memory
>> management
>> > system is not an easy job. How to allocate, replace, reclaim objects at
>> > right time and at right location is challenging. It’s a bit similar with
>> GC
>> > algorithms.
>> >
>> > 3)       Limited usage. It’s useful for write-once-read-many-times large
>> > objects but not for others.
>> >
>> >
>> >
>> > I also have two related questions:
>> >
>> > 1)       Can JVM’s heap use virtual memory or just use physical memory?
>> >
>> > 2)       Can direct memory use virtual memory or just use physical
>> memory?
>> >
>> >
>> >
>> >
>> > On Mon, Aug 26, 2013 at 8:06 AM, Haoyuan Li <haoyuan.li@gmail.com>
>> wrote:
>> >
>> >> Hi Imran,
>> >>
>> >> One possible solution is that you can use
>> >> Tachyon<https://github.com/amplab/tachyon>.
>> >> When data is in Tachyon, Spark jobs will read it from off-heap memory.
>> >> Internally, it uses direct byte buffers to store memory-serialized RDDs
>> as
>> >> you mentioned. Also, different Spark jobs can share the same data in
>> >> Tachyon's memory. Here is a presentation
>> >> (slide<
>> >>
>> https://docs.google.com/viewer?url=http%3A%2F%2Ffiles.meetup.com%2F3138542%2FTachyon_2013-05-09_Spark_Meetup.pdf
>> >> >)
>> >> we did in May.
>> >>
>> >> Haoyuan
>> >>
>> >>
>> >> On Sun, Aug 25, 2013 at 3:26 PM, Imran Rashid <imran@therashids.com>
>> >> wrote:
>> >>
>> >> > Hi,
>> >> >
>> >> > I was wondering if anyone has thought about putting cached data in
an
>> >> > RDD into off-heap memory, eg. w/ direct byte buffers.  For really
>> >> > long-lived RDDs that use a lot of memory, this seems like a huge
>> >> > improvement, since all the memory is now totally ignored during GC.
>> >> > (and reading data from direct byte buffers is potentially faster as
>> >> > well, buts thats just a nice bonus).
>> >> >
>> >> > The easiest thing to do is to store memory-serialized RDDs in direct
>> >> > byte buffers, but I guess we could also store the serialized RDD on
>> >> > disk and use a memory mapped file.  Serializing into off-heap buffers
>> >> > is a really simple patch, I just changed a few lines (I haven't done
>> >> > any real tests w/ it yet, though).  But I dont' really have a ton of
>> >> > experience w/ off-heap memory, so I thought I would ask what others
>> >> > think of the idea, if it makes sense or if there are any gotchas I
>> >> > should be aware of, etc.
>> >> >
>> >> > thanks,
>> >> > Imran
>> >> >
>> >>
>>

Mime
View raw message