spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sebastián Ramírez <>
Subject Re: pyspark sc.textFile uses only 4 out of 32 threads per node
Date Tue, 16 Dec 2014 22:25:01 GMT
Are you reading the file from your driver (main / master) program?

Is your file in a distributed system like HDFS? available to all your nodes?

It might be due to the laziness of transformations:

"Transformations" are lazy, and aren't applied until they are needed by an
"action" (and, to me, it happend for readings too some time ago).
You can try calling a .first() in your RDD from once in a while to force it
to load the RDD to your cluster (but it might not be the cleanest way to do

*Sebastián Ramírez*
Diseñador de Algoritmos

 Tel: (+571) 795 7950 ext: 1012
 Cel: (+57) 300 370 77 10
 Calle 73 No 7 - 06  Piso 4
 Twitter: @tiangolo <>

On Tue, Dec 9, 2014 at 1:59 PM, Gautham <> wrote:
> I am having an issue with pyspark launched in ec2 (using spark-ec2) with 5
> r3.4xlarge machines where each has 32 threads and 240GB of RAM. When I do
> sc.textFile to load data from a number of gz files, it does not progress as
> fast as expected. When I log-in to a child node and run top, I see only 4
> threads at 100 cpu. All remaining 28 cores were idle. This is not an issue
> when processing the strings after loading, when all the cores are used to
> process the data.
> Please help me with this? What setting can be changed to get the CPU usage
> back up to full?
> --
> View this message in context:
> Sent from the Apache Spark User List mailing list archive at
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

*This e-mail transmission, including any attachments, is intended only for 
the named recipient(s) and may contain information that is privileged, 
confidential and/or exempt from disclosure under applicable law. If you 
have received this transmission in error, or are not the named 
recipient(s), please notify Senseta immediately by return e-mail and 
permanently delete this transmission, including any attachments.*

View raw message