spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aaron Davidson <ilike...@gmail.com>
Subject Re: partitioning of small data sets
Date Tue, 15 Apr 2014 16:53:51 GMT
Take a look at the minSplits argument for SparkContext#textFile [1] -- the
default value is 2. You can simply set this to 1 if you'd prefer not to
split your data.

[1]
http://spark.apache.org/docs/latest/api/core/index.html#org.apache.spark.SparkContext


On Tue, Apr 15, 2014 at 8:44 AM, Diana Carroll <dcarroll@cloudera.com>wrote:

> I loaded a very tiny file into Spark -- 23 lines of text, 2.6kb
>
> Given the size, and that it is a single file, I assumed it would only be
> in a single partition.  But when I cache it,  I can see in the Spark App UI
> that it actually splits it into two partitions:
>
> [image: Inline image 1]
>
> Is this correct behavior?  How does Spark decide how big a partition
> should be, or how many partitions to create for an RDD.
>
> If it matters, I have only a single worker in my "cluster", so both
> partitions are stored on the same worker.
>
> The file was on HDFS and was only a single block.
>
> Thanks for any insight.
>
> Diana
>
>
>

Mime
View raw message