spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From nsareen <>
Subject Spark Concepts
Date Wed, 15 Oct 2014 08:39:13 GMT
Hi ,I'm pretty new to Big Data & Spark both. I've just started POC work on
spark and me & my team are evaluating it with other In Memory computing
tools such as GridGain, Bigmemory, Aerospike & some others too, specifically
to solve two sets of problems.1) Data  Storage : Our current application
runs on a single node which is a heavy configuration of 24 cores & 350Geg,
our application loads all the datamart data inclusive  of multiple cubes
into the memory & converts it and keeps it in a Trove Collection in a form
of Key / Value map. This collection is a immutable collection which takes
about 15-20 Gegs of memory space. Our anticipation is that the data would
grow 10-15 folds in the next year or so & we are not very confident of Trove
being able to scale to that level.2) Compute: Ours in a natively Analytical
application doing predictive analytics with lots of simulations and
optimizations of scenarios, at the heart of all this are the Trove
Collections using which we perform our Mathematical algorithms to calculate
the end result, in doing so, the memory consumption of the application goes
beyond 250-300Geg. These are because of lots of intermediate computing
results ( collections ) which are further broken down to the granular level
and then searched in the Trove collection. All this happens on a single node
which obviously starts to perform slowly over a period of time. And based on
the large volume of data incoming in the next year or so, our current
architecture will not be able to handle such massive In Memory data set &
such computing power. Hence we are targeting to change the architecture to a
cluster based in memory distributed computing. We are evaluating all these
products along with Apache Spark. We were very excited by Apache spark
looking at the videos and some online resources, but when it came down to
doing handson we are facing lots of issues.1)What are Standalone Cluster's
limitations ? Can i configure a Cluster on a Single Node with Multiple
Processes of Worker Nodes, Executors etc. ? Is this supported even though
the IP Address would be the same ? 2) Why so many Java Processes ? Why are
there so many Java Processes ? Worker Nodes - Executors ? Will the
communication between them not slow down the performance on a whole ?3) How
is Parallelism on Partitioned Data achieved ? This one is really important
for us to understand, since are doing our benchmarkings on Partitioned data,
We do not know how to configure Partitions on Spark ? Any help here would be
appreciated. We want to partition data present in Cubes, hence we want Each
Cube to be a separate partition.4) What is the difference between Multiple
Nodes executing Jobs & Multiple Tasks Executing Jobs ? How do these handle
the partitioning & parallelism. Help in these questions would be really
appreciated, to get a better sense of Apache Spark.Thanks,Nitin

View this message in context:
Sent from the Apache Spark User List mailing list archive at
View raw message