I have been trying to collect a large dataset(about 2 gb in size, 30 columns, more than a million rows) onto the driver side. I am aware that collecting such a huge dataset isn't suggested, however, the application within which the spark driver is running requires that data.
While collecting the dataframe, the spark job throws an error, TaskResultLost( resultset lost from blockmanager).
I searched for solutions around this and set the following properties:
spark.blockManager.port, maxResultSize to 0(unlimited), spark.driver.blockManager.port and the application within which spark driver is running has 28 gb of max heap size.
And yet the error arises again.
There are 22 executors running in my cluster.
Is there any config/necessary step that i am missing before collecting such large data?
Or is there any other effective approach that would guarantee collecting such large data without failure?