spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Larry White <>
Subject Using spark to distribute jobs to standalone servers
Date Mon, 22 Aug 2016 14:59:41 GMT

I have a bit of an unusual use-case and would *greatly* *appreciate* some
feedback as to whether it is a good fit for spark.

I have a network of compute/data servers configured as a tree as shown below

   - controller
   - server 1
      - server 2
      - server 3
      - etc.

There are ~20 servers, but the number is increasing to ~100.

Each server contains a different dataset, all in the same format. Each is
hosted by a different organization, and the data on every individual server
is unique to that organization.

Data *cannot* be replicated across servers using RDDs or any other means,
for privacy/ownership reasons.

Data *cannot* be retrieved to the controller, except in aggregate form, as
the result of a query, for example.

Because of this, there are currently no operations that treats the data as
if it were a single data set: We could run a classifier on each site
individually, but cannot for legal reasons, pull all the data into a single
*physical* dataframe to run the classifier on all of it together.

The servers are located across a wide geographic region (1,000s of miles)

We would like to send jobs from the controller to be executed in parallel
on all the servers, and retrieve the results to the controller. The jobs
would consist of SQL-Heavy Java code for 'production' queries, and python
or R code for ad-hoc queries and predictive modeling.

Spark seems to have the capability to meet many of the individual
requirements, but is it a reasonable platform overall for building this

Thank you very much for your assistance.


View raw message