spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aaron Davidson <ilike...@gmail.com>
Subject Re: Comparative study
Date Tue, 08 Jul 2014 20:16:59 GMT
>
> Not sure exactly what is happening but perhaps there are ways to
> restructure your program for it to work better. Spark is definitely able to
> handle much, much larger workloads.


+1 @Reynold

Spark can handle big "big data". There are known issues with informing the
user about what went wrong and how to fix it that we're actively working
on, but the first impulse when a job fails should be "what did I do wrong"
rather than "Spark can't handle this workload". Messaging is a huge part in
making this clear -- getting things like a job hanging or an out of memory
error can be very difficult to debug, and improving this is one of our
highest priorties.


On Tue, Jul 8, 2014 at 12:47 PM, Reynold Xin <rxin@databricks.com> wrote:

> Not sure exactly what is happening but perhaps there are ways to
> restructure your program for it to work better. Spark is definitely able to
> handle much, much larger workloads.
>
> I've personally run a workload that shuffled 300 TB of data. I've also ran
> something that shuffled 5TB/node and stuffed my disks fairly full that the
> file system is close to breaking.
>
> We can definitely do a better job in Spark to make it output more
> meaningful diagnosis and more robust with partitions of data that don't fit
> in memory though. A lot of the work in the next few releases will be on
> that.
>
>
>
> On Tue, Jul 8, 2014 at 10:04 AM, Surendranauth Hiraman <
> suren.hiraman@velos.io> wrote:
>
>> I'll respond for Dan.
>>
>> Our test dataset was a total of 10 GB of input data (full production
>> dataset for this particular dataflow would be 60 GB roughly).
>>
>> I'm not sure what the size of the final output data was but I think it
>> was on the order of 20 GBs for the given 10 GB of input data. Also, I can
>> say that when we were experimenting with persist(DISK_ONLY), the size of
>> all RDDs on disk was around 200 GB, which gives a sense of overall
>> transient memory usage with no persistence.
>>
>> In terms of our test cluster, we had 15 nodes. Each node had 24 cores and
>> 2 workers each. Each executor got 14 GB of memory.
>>
>> -Suren
>>
>>
>>
>> On Tue, Jul 8, 2014 at 12:06 PM, Kevin Markey <kevin.markey@oracle.com>
>> wrote:
>>
>>>  When you say "large data sets", how large?
>>> Thanks
>>>
>>>
>>> On 07/07/2014 01:39 PM, Daniel Siegmann wrote:
>>>
>>>  From a development perspective, I vastly prefer Spark to MapReduce.
>>> The MapReduce API is very constrained; Spark's API feels much more natural
>>> to me. Testing and local development is also very easy - creating a local
>>> Spark context is trivial and it reads local files. For your unit tests you
>>> can just have them create a local context and execute your flow with some
>>> test data. Even better, you can do ad-hoc work in the Spark shell and if
>>> you want that in your production code it will look exactly the same.
>>>
>>>  Unfortunately, the picture isn't so rosy when it gets to production.
>>> In my experience, Spark simply doesn't scale to the volumes that MapReduce
>>> will handle. Not with a Standalone cluster anyway - maybe Mesos or YARN
>>> would be better, but I haven't had the opportunity to try them. I find jobs
>>> tend to just hang forever for no apparent reason on large data sets (but
>>> smaller than what I push through MapReduce).
>>>
>>>  I am hopeful the situation will improve - Spark is developing quickly
>>> - but if you have large amounts of data you should proceed with caution.
>>>
>>>  Keep in mind there are some frameworks for Hadoop which can hide the
>>> ugly MapReduce with something very similar in form to Spark's API; e.g.
>>> Apache Crunch. So you might consider those as well.
>>>
>>>  (Note: the above is with Spark 1.0.0.)
>>>
>>>
>>>
>>> On Mon, Jul 7, 2014 at 11:07 AM, <santosh.viswanathan@accenture.com>
>>> wrote:
>>>
>>>>  Hello Experts,
>>>>
>>>>
>>>>
>>>> I am doing some comparative study on the below:
>>>>
>>>>
>>>>
>>>> Spark vs Impala
>>>>
>>>> Spark vs MapREduce . Is it worth migrating from existing MR
>>>> implementation to Spark?
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Please share your thoughts and expertise.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Thanks,
>>>> Santosh
>>>>
>>>> ------------------------------
>>>>
>>>> This message is for the designated recipient only and may contain
>>>> privileged, proprietary, or otherwise confidential information. If you have
>>>> received it in error, please notify the sender immediately and delete the
>>>> original. Any other use of the e-mail by you is prohibited. Where allowed
>>>> by local law, electronic communications with Accenture and its affiliates,
>>>> including e-mail and instant messaging (including content), may be scanned
>>>> by our systems for the purposes of information security and assessment of
>>>> internal compliance with Accenture policy.
>>>>
>>>> ______________________________________________________________________________________
>>>>
>>>> www.accenture.com
>>>>
>>>
>>>
>>>
>>> --
>>>  Daniel Siegmann, Software Developer
>>> Velos
>>>  Accelerating Machine Learning
>>>
>>> 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001
>>> E: daniel.siegmann@velos.io W: www.velos.io
>>>
>>>
>>>
>>
>>
>> --
>>
>> SUREN HIRAMAN, VP TECHNOLOGY
>> Velos
>> Accelerating Machine Learning
>>
>> 440 NINTH AVENUE, 11TH FLOOR
>> NEW YORK, NY 10001
>> O: (917) 525-2466 ext. 105
>> F: 646.349.4063
>> E: suren.hiraman@v <suren.hiraman@sociocast.com>elos.io
>> W: www.velos.io
>>
>>
>

Mime
View raw message