spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ashok Kumar <>
Subject Clarification on RDD
Date Fri, 26 Feb 2016 17:40:48 GMT
Spark doco says
Spark’s primary abstraction is a distributed collection of items called a Resilient Distributed
Dataset (RDD). RDDs can be created from Hadoop InputFormats (such as HDFS files) or by transforming
other RDDs
val textFile = sc.textFile("")

my question is when RDD is created like above from a file stored on HDFS, does that mean that
data is distributed among all the nodes in the cluster or data from the md file is copied
to each node of the cluster so each node has complete copy of data? Has the data is actually
moved around or data is not copied over until an action like COUNT() is performed on RDD?


View raw message