spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Akhil Das <ak...@sigmoidanalytics.com>
Subject Re: spark is running extremely slow with larger data set, like 2G
Date Fri, 24 Oct 2014 07:22:52 GMT
Try providing the level of parallelism parameter to your reduceByKey
operation.

Thanks
Best Regards

On Fri, Oct 24, 2014 at 3:44 AM, xuhongnever <xuhongnever@gmail.com> wrote:

> my code is here:
>
> from pyspark import SparkConf, SparkContext
>
> def Undirect(edge):
>     vector = edge.strip().split('\t')
>     if(vector[0].isdigit()):
>         return [(vector[0], vector[1])]
>     return []
>
>
> conf = SparkConf()
> conf.setMaster("spark://compute-0-14:7077")
> conf.setAppName("adjacencylist")
> conf.set("spark.executor.memory", "1g")
>
> sc = SparkContext(conf = conf)
>
> file = sc.textFile("file:///home/xzhang/data/soc-LiveJournal1.txt")
> records = file.flatMap(lambda line: Undirect(line)).reduceByKey(lambda a,
> b:
> a + "\t" + b )
> #print(records.count())
> #records = records.sortByKey()
> records = records.map(lambda line: line[0] + "\t" + line[1])
> records.saveAsTextFile("file:///home/xzhang/data/result")
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/spark-is-running-extremely-slow-with-larger-data-set-like-2G-tp17152p17153.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>

Mime
View raw message