Hi everyone,
I am very new to Spark, so as a learning exercise I've set up a small cluster consisting of 4 EC2 m1.large instances (1 master, 3 slaves), which I'm hoping to use to calculate ngram frequencies from text files of various sizes (I'm not doing anything with them; I just thought this would be slightly more interesting than the usual 'word count' example).  Currently, I'm trying to work with a 1GB text file, but running into memory issues.  I'm wondering what parameters I should be setting (in spark-env.sh) in order to properly utilize the cluster.  Right now, I'd be happy just to have the process complete successfully with the 1 gig file, so I'd really appreciate any suggestions you all might have.

Here's a summary of the code I'm running through the spark shell on the master:

def ngrams(s: String, n: Int = 3): Seq[String] = {
  (s.split("\\s+").sliding(n)).filter(_.length == n).map(_.mkString(" ")).map(_.trim).toList
}

val text = sc.textFile("s3n://my-bucket/my-1gb-text-file")

val mapped = text.filter(_.trim.length > 0).flatMap(ngrams(_, 3))

So far so good; the problems come during the reduce phase.  With small files, I was able to issue the following to calculate the most frequently occurring trigram:

val topNgram = (mapped countByValue) reduce((a:(String, Long), b:(String, Long)) => if (a._2 > b._2) a else b)

With the 1 gig file, though, I've been running into OutOfMemory errors, so I decided to split the reduction to several steps, starting with simply issuing countByValue of my "mapped" RDD, but I have yet to get it to complete successfully.

SPARK_MEM is currently set to 6154m.  I also bumped up the spark.akka.framesize setting to 500 (though at this point, I was grasping at straws; I'm not sure what a "proper" value would be).  What properties should I be setting for a job of this size on a cluster of 3 m1.large slaves? (The cluster was initially configured using the spark-ec2 scripts).  Also, programmatically, what should I be doing differently?  (For example, should I be setting the minimum number of splits when reading the text file?  If so, what would be a good default?).

I apologize for what I'm sure are very naive questions.  I think Spark is a fantastic project and have enjoyed working with it, but I'm still very much a newbie and would appreciate any help you all can provide (as well as any 'rules-of-thumb' or best practices I should be following).

Thanks,
Tim Perrigo