spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marcelo Vanzin <van...@cloudera.com>
Subject Re: Spark using HDFS data [newb]
Date Fri, 24 Oct 2014 01:19:00 GMT
You assessment is mostly correct. I think the only thing I'd reword is
the comment about splitting the data, since Spark itself doesn't do
that, but read on.

On Thu, Oct 23, 2014 at 6:12 PM, matan <dev.matan@gmail.com> wrote:
> In case I nailed it, how then does it handle a distributed hdfs file? does
> it pull all of the file to/through one Spark server

Well, Spark here just piggybacks on what HDFS already gives you, since
it's a distributed file system. In HDFS, files are broken into blocks
and each block is stored in one or more machine. Spark uses Hadoop
classes that understand this and give you the information about where
those blocks are.

If there are Spark executors on those machines holding the blocks,
Spark will try to run tasks on those executors. Otherwise, it will
assign some other executor to do the computation, and that executor
will pull that particular block from HDFS over the network.

It can be a lot more complicated than that (since each file format may
have different ways of partitioning data, or you can create your own
way, or repartition data, or Spark may give up waiting for the right
executor, or...), but that's a good first start.

-- 
Marcelo

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Mime
View raw message