spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Holden Karau <hol...@pigscanfly.ca>
Subject Re: Apache Arrow data in buffer to RDD/DataFrame/Dataset?
Date Fri, 05 Aug 2016 21:22:44 GMT
I don't think there is an approximate timescale right now and its likely
any implementation would depend on a solid Java implementation of Arrow
being ready first (or even a guarantee that it necessarily will - although
I'm interested in making it happen in some places where it makes sense).

On Fri, Aug 5, 2016 at 2:18 PM, Jim Pivarski <jpivarski@gmail.com> wrote:

> I see. I've already started working with Arrow-C++ and talking to members
> of the Arrow community, so I'll keep doing that.
>
> As a follow-up question, is there an approximate timescale for when Spark
> will support Arrow? I'd just like to know that all the pieces will come
> together eventually.
>
> (In this forum, most of the discussion about Arrow is about PySpark and
> Pandas, not Spark in general.)
>
> Best,
> Jim
>
> On Aug 5, 2016 2:43 PM, "Holden Karau" <holden@pigscanfly.ca> wrote:
>
>> Spark does not currently support Apache Arrow - probably a good place to
>> chat would be on the Arrow mailing list where they are making progress
>> towards unified JVM & Python/R support which is sort of a precondition of a
>> functioning Arrow interface between Spark and Python.
>>
>> On Fri, Aug 5, 2016 at 12:40 PM, jpivarski@gmail.com <jpivarski@gmail.com
>> > wrote:
>>
>>> In a few earlier posts [ 1
>>> <http://apache-spark-developers-list.1001551.n3.nabble.com/T
>>> ungsten-off-heap-memory-access-for-C-libraries-td13898.html>
>>> ] [ 2
>>> <http://apache-spark-developers-list.1001551.n3.nabble.com/H
>>> ow-to-access-the-off-heap-representation-of-cached-data-in-
>>> Spark-2-0-td17701.html>
>>> ], I asked about moving data from C++ into a Spark data source (RDD,
>>> DataFrame, or Dataset). The issue is that even the off-heap cache might
>>> not
>>> have a stable representation: it might change from one version to the
>>> next.
>>>
>>> I recently learned about Apache Arrow, a data layer that Spark currently
>>> or
>>> will someday share with Pandas, Impala, etc. Suppose that I can fill a
>>> buffer (such as a direct ByteBuffer) with Arrow-formatted data, is there
>>> an
>>> easy--- or even zero-copy--- way to use that in Spark? Is that an API
>>> that
>>> could be developed?
>>>
>>> I'll be at the KDD Spark 2.0 tutorial on August 15. Is that a good place
>>> to
>>> ask this question?
>>>
>>> Thanks,
>>> -- Jim
>>>
>>>
>>>
>>>
>>> --
>>> View this message in context: http://apache-spark-developers
>>> -list.1001551.n3.nabble.com/Apache-Arrow-data-in-buffer-to-
>>> RDD-DataFrame-Dataset-tp18563.html
>>> Sent from the Apache Spark Developers List mailing list archive at
>>> Nabble.com.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>>
>>>
>>
>>
>> --
>> Cell : 425-233-8271
>> Twitter: https://twitter.com/holdenkarau
>>
>


-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau

Mime
View raw message