spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andre Bois-Crettez <andre.b...@kelkoo.com>
Subject Re: Spark running slow for small hadoop files of 10 mb size
Date Tue, 22 Apr 2014 10:07:04 GMT
The data partitionning is done by default *according to the number of
HDFS blocks* of the source.
You can change the partitionning with .repartion, either to increase or
decrease the level of parallelism :

val recordsRDD =
SparkContext.sequenceFile[NullWritable,BytesWritable](FilePath,256)
val recordsRDDInParallel = recordsRDD.repartition(4*32)
infoRdd = recordsRDDInParallel.map(f => info_func()) hdfs_RDD =
infoRDD.reduceByKey(_+_,48) /* makes 48 paritions*/
hdfs_RDD.saveAsNewAPIHadoopFile



André
On 2014-04-21 13:21, neeravsalaria wrote:
> Hi,
>
>    i have been using MapReduce to analyze multiple files whose size can range
> from 10 mb to 200mb per file. recently i  planned to move spark , but my
> spark Job is taking too much time executing a single file in case my file
> size is 10MB and hdfs block size is 64MB. It is executing on a single
> datanode and on single core(my cluster is a 4 node setup / each node having
> 32 cores). each file is having 3 million rows and i have to analyze each
> row(ignore none) and create a set of info from it.
>
> Isn't a way where i can parallelize the processing of the file like either
> on other nodes or use the remaining cores of the same node.
>
>
>
> demo code :
>
>       val recordsRDD =
> SparkContext.sequenceFile[NullWritable,BytesWritable](FilePath,256) /*to
> parallelize */
>
>       infoRdd = recordsRDD.map(f => info_func())
>
>       hdfs_RDD = infoRDD.reduceByKey(_+_,48)  /* makes 48 paritions*/
>
>      hdfs_RDD.saveAsNewAPIHadoopFile
>
>
>
> --
> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-running-slow-for-small-hadoop-files-of-10-mb-size-tp4526.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>


--
André Bois-Crettez

Software Architect
Big Data Developer
http://www.kelkoo.com/


Kelkoo SAS
Société par Actions Simplifiée
Au capital de € 4.168.964,30
Siège social : 8, rue du Sentier 75002 Paris
425 093 069 RCS Paris

Ce message et les pièces jointes sont confidentiels et établis à l'attention exclusive
de leurs destinataires. Si vous n'êtes pas le destinataire de ce message, merci de le détruire
et d'en avertir l'expéditeur.

Mime
View raw message