spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Taotao.Li" <charles.up...@gmail.com>
Subject Re: Saving data frames on Spark Master/Driver
Date Fri, 15 Jul 2016 05:13:08 GMT
hi, consider transfer dataframe to rdd and then use* rdd.toLocalIterator *to
collect data on the driver node.

On Fri, Jul 15, 2016 at 9:05 AM, Pedro Rodriguez <ski.rodriguez@gmail.com>
wrote:

> Out of curiosity, is there a way to pull all the data back to the driver
> to save without collect()? That is, stream the data in chunks back to the
> driver so that maximum memory used comparable to a single node’s data, but
> all the data is saved on one node.
>
> —
> Pedro Rodriguez
> PhD Student in Large-Scale Machine Learning | CU Boulder
> Systems Oriented Data Scientist
> UC Berkeley AMPLab Alumni
>
> pedrorodriguez.io | 909-353-4423
> github.com/EntilZha | LinkedIn
> <https://www.linkedin.com/in/pedrorodriguezscience>
>
> On July 14, 2016 at 6:02:12 PM, Jacek Laskowski (jacek@japila.pl) wrote:
>
> Hi,
>
> Please re-consider your wish since it is going to move all the
> distributed dataset to the single machine of the driver and may lead
> to OOME. It's more pro to save your result to HDFS or S3 or any other
> distributed filesystem (that is accessible by the driver and
> executors).
>
> If you insist...
>
> Use collect() after select() and work with Array[T].
>
> Pozdrawiam,
> Jacek Laskowski
> ----
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark http://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
>
> On Fri, Jul 15, 2016 at 12:15 AM, vr.n. nachiappan
> <nachiappan_vrn@yahoo.com.invalid> wrote:
> > Hello,
> >
> > I am using data frames to join two cassandra tables.
> >
> > Currently when i invoke save on data frames as shown below it is saving
> the
> > join results on executor nodes.
> >
> > joineddataframe.select(<col1>, <col2>
> > ...).format("com.databricks.spark.csv").option("header",
> > "true").save(<path>)
> >
> > I would like to persist the results of the join on Spark Master/Driver
> node.
> > Is it possible to save the results on Spark Master/Driver and how to do
> it.
> >
> > I appreciate your help.
> >
> > Nachi
> >
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>


-- 
*___________________*
Quant | Engineer | Boy
*___________________*
*blog*:    http://litaotao.github.io
<http://litaotao.github.io?utm_source=spark_mail>
*github*: www.github.com/litaotao

Mime
View raw message