spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From matan <>
Subject Re: Spark using non-HDFS data on a distributed file system cluster
Date Fri, 24 Oct 2014 21:35:47 GMT
Thanks Marcelo,

Let me spin this towards a parallel trajectory then, as the title change
implies. I think I will further read some of the articles at but basically, I understand Spark
keeps the data in-memory, and only pulls from hdfs, or at most writes the
final output of a job back to it, rather than depositing intermediary step
outputs to hdfs files like hadoop map reduce would typically do (?).

Just a small departing question then - does it also work with the *Gluster*
or *Ceph* distributed file systems, not just hdfs? reading some of the
documentation I think if my Scala code can read a file from either of
those, and I have a Spark standalone cluster, or one managed by Mesos, then
I am not bound to hdfs nor to hadoop..

I assume that architecture will consume a lot of bandwidth pulling the
inputs from the file system, and not leverage the placement of its
workers/executors along with the data nodes as it does with hadoop/hdfs
(... thus acting as a compute cluster that takes very long to bootstrap an
application). Perhaps however, hadoop's ability to work over GlusterFS or
CephFS would provide that data-locality benefit after all? or is it bound
specifically to the hdfs api of hadoop, for performing a local data pull on
the storage cluster machines?


On Fri, Oct 24, 2014 at 4:19 AM, Marcelo Vanzin [via Apache Spark User
List] <> wrote:

> You assessment is mostly correct. I think the only thing I'd reword is
> the comment about splitting the data, since Spark itself doesn't do
> that, but read on.
> On Thu, Oct 23, 2014 at 6:12 PM, matan <[hidden email]
> <http://user/SendEmail.jtp?type=node&node=17170&i=0>> wrote:
> > In case I nailed it, how then does it handle a distributed hdfs file?
> does
> > it pull all of the file to/through one Spark server
> Well, Spark here just piggybacks on what HDFS already gives you, since
> it's a distributed file system. In HDFS, files are broken into blocks
> and each block is stored in one or more machine. Spark uses Hadoop
> classes that understand this and give you the information about where
> those blocks are.
> If there are Spark executors on those machines holding the blocks,
> Spark will try to run tasks on those executors. Otherwise, it will
> assign some other executor to do the computation, and that executor
> will pull that particular block from HDFS over the network.
> It can be a lot more complicated than that (since each file format may
> have different ways of partitioning data, or you can create your own
> way, or repartition data, or Spark may give up waiting for the right
> executor, or...), but that's a good first start.
> --
> Marcelo
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> <http://user/SendEmail.jtp?type=node&node=17170&i=1>
> For additional commands, e-mail: [hidden email]
> <http://user/SendEmail.jtp?type=node&node=17170&i=2>
> ------------------------------
>  If you reply to this email, your message will be added to the discussion
> below:
>  To unsubscribe from Spark using HDFS data [newb], click here
> <>
> .
> <>

View this message in context:
Sent from the Apache Spark User List mailing list archive at
View raw message