spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From swastik mittal <>
Subject Read Time from a remote data source
Date Tue, 18 Dec 2018 20:20:52 GMT

I am new to spark. I am running a hdfs file system on a remote cluster
whereas my spark workers are on another cluster. When my textFile RDD gets
executed, does spark worker read from the file according to hdfs partitions
task by task, or do they read it once when the blockmanager sets after the
start of first task and distributes it among the memory of spark cluster?

I have this question because I have a situation where, when I have only one
worker executing a job it shows less run time per task (shown in history
server) then when I have two workers executing the same job in parallel.
Even though the total duration is almost the same.

I am running a simple grep application and no shuffles within the cluster.
Text file is on a remote hdfs cluster and is of 813MB distributed into 7
chunks of 128MB, last chunk is left over size.


Sent from:

To unsubscribe e-mail:

View raw message