spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tang Jinxin <xiaoxingst...@gmail.com>
Subject Re: Can I collect Dataset[Row] to driver without converting it to Array [Row]?
Date Wed, 22 Apr 2020 13:15:11 GMT
maybe could try someway like foreachpartition in foreachrdd,which will not together to driver
take too extra consumption. xiaoxingstack 邮箱:xiaoxingstack@gmail.com 签名由 网易邮箱大师
定制 On 04/22/2020 21:02, Andrew Melo wrote: Hi Maqy On Wed, Apr 22, 2020 at 3:24 AM maqy
<454618260@qq.com> wrote: > > I will traverse this Dataset to convert it to Arrow
and send it to Tensorflow through Socket. (I presume you're using the python tensorflow API,
if you're not, please ignore) There is a JIRA/PR ([1] [2]) which proposes to add native support
for Arrow serialization to python, Under the hood, Spark is already serializing into Arrow
format to transmit to python, it's just additionally doing an unconditional conversion to
pandas once it reaches the python runner -- which is good if you're using pandas, not so great
if pandas isn't what you operate on. The JIRA above would let you receive the arrow buffers
(that already exist) directly. Cheers, Andrew [1] https://issues.apache.org/jira/browse/SPARK-30153
[2] https://github.com/apache/spark/pull/26783 > > I tried to use toLocalIterator()
to traverse the dataset instead of collect  to the driver, but toLocalIterator() will create
a lot of jobs and will bring a lot of time consumption. > > > > Best regards,
> > maqy > > > > 发件人: Michael Artz > 发送时间: 2020年4月22日
16:09 > 收件人: maqy > 抄送: user@spark.apache.org > 主题: Re: Can I collect
Dataset[Row] to driver without converting it to Array [Row]? > > > > What would
you do with it once you get it into driver in a Dataset[Row]? > > Sent from my iPhone
> > > > On Apr 22, 2020, at 3:06 AM, maqy <454618260@qq.com> wrote: >
>  > > When the data is stored in the Dataset [Row] format, the memory usage is
very small. > > When I use collect () to collect data to the driver, each line of the
dataset will be converted to Row and stored in an array, which will bring great memory overhead.
> > So, can I collect Dataset[Row] to driver and keep its data format? > > >
> Best regards, > > maqy > > > > ---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org
Mime
View raw message