spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Cheng Lian <lian.cs....@gmail.com>
Subject Re: Spark SQL - Long running job
Date Mon, 23 Feb 2015 13:24:24 GMT
I meant using |saveAsParquetFile|. As for partition number, you can 
always control it with |spark.sql.shuffle.partitions| property.

Cheng

On 2/23/15 1:38 PM, nitin wrote:

> I believe calling processedSchemaRdd.persist(DISK) and
> processedSchemaRdd.checkpoint() only persists data and I will lose all the
> RDD metadata and when I re-start my driver, that data is kind of useless for
> me (correct me if I am wrong).
>
> I thought of doing processedSchemaRdd.saveAsParquetFile (hdfs file system)
> but I fear that in case my "HDFS block size" > "partition file size", I will
> get more partitions when reading instead of original schemaRdd.
>
>
>
> --
> View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-SQL-Long-running-job-tp10717p10727.html
> Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> For additional commands, e-mail: dev-help@spark.apache.org
>
>
‚Äč

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message