spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marcin Tustin <marcin.tus...@bluevoyant.com.INVALID>
Subject Re: Collecting large dataset
Date Thu, 05 Sep 2019 18:27:07 GMT
Stop using collect for this purpose. Either continue your further
processing in spark (maybe you need to use streaming), or sink the data to
something that can accept the data (gcs/s3/azure
storage/redshift/elasticsearch/whatever), and have further processing read
from that sink.

On Thu, Sep 5, 2019 at 2:23 PM Rishikesh Gawade <rishikeshg1996@gmail.com>
wrote:

> *This Message originated outside your organization.*
> ------------------------------
> Hi.
> I have been trying to collect a large dataset(about 2 gb in size, 30
> columns, more than a million rows) onto the driver side. I am aware that
> collecting such a huge dataset isn't suggested, however, the application
> within which the spark driver is running requires that data.
> While collecting the dataframe, the spark job throws an error,
> TaskResultLost( resultset lost from blockmanager).
> I searched for solutions around this and set the following properties:
> spark.blockManager.port, maxResultSize to 0(unlimited), spark.driver.blockManager.port
> and the application within which spark driver is running has 28 gb of max
> heap size.
> And yet the error arises again.
> There are 22 executors running in my cluster.
> Is there any config/necessary step that i am missing before collecting
> such large data?
> Or is there any other effective approach that would guarantee collecting
> such large data without failure?
>
> Thanks,
> Rishikesh
>

Mime
View raw message