spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Roger Hoover <roger.hoo...@gmail.com>
Subject Re: Using Spark on Data size larger than Memory size
Date Fri, 06 Jun 2014 00:16:18 GMT
I think it would very handy to be able to specify that you want sorting
during a partitioning stage.


On Thu, Jun 5, 2014 at 4:42 PM, Roger Hoover <roger.hoover@gmail.com> wrote:

> Hi Aaron,
>
> When you say that sorting is being worked on, can you elaborate a little
> more please?
>
> If particular, I want to sort the items within each partition (not
> globally) without necessarily bringing them all into memory at once.
>
> Thanks,
>
> Roger
>
>
> On Sat, May 31, 2014 at 11:10 PM, Aaron Davidson <ilikerps@gmail.com>
> wrote:
>
>> There is no fundamental issue if you're running on data that is larger
>> than cluster memory size. Many operations can stream data through, and thus
>> memory usage is independent of input data size. Certain operations require
>> an entire *partition* (not dataset) to fit in memory, but there are not
>> many instances of this left (sorting comes to mind, and this is being
>> worked on).
>>
>> In general, one problem with Spark today is that you *can* OOM under
>> certain configurations, and it's possible you'll need to change from the
>> default configuration if you're using doing very memory-intensive jobs.
>> However, there are very few cases where Spark would simply fail as a matter
>> of course *-- *for instance, you can always increase the number of
>> partitions to decrease the size of any given one. or repartition data to
>> eliminate skew.
>>
>> Regarding impact on performance, as Mayur said, there may absolutely be
>> an impact depending on your jobs. If you're doing a join on a very large
>> amount of data with few partitions, then we'll have to spill to disk. If
>> you can't cache your working set of data in memory, you will also see a
>> performance degradation. Spark enables the use of memory to make things
>> fast, but if you just don't have enough memory, it won't be terribly fast.
>>
>>
>> On Sat, May 31, 2014 at 12:14 AM, Mayur Rustagi <mayur.rustagi@gmail.com>
>> wrote:
>>
>>> Clearly thr will be impact on performance but frankly depends on what
>>> you are trying to achieve with the dataset.
>>>
>>> Mayur Rustagi
>>> Ph: +1 (760) 203 3257
>>> http://www.sigmoidanalytics.com
>>> @mayur_rustagi <https://twitter.com/mayur_rustagi>
>>>
>>>
>>>
>>> On Sat, May 31, 2014 at 11:45 AM, Vibhor Banga <vibhorbanga@gmail.com>
>>> wrote:
>>>
>>>> Some inputs will be really helpful.
>>>>
>>>> Thanks,
>>>> -Vibhor
>>>>
>>>>
>>>> On Fri, May 30, 2014 at 7:51 PM, Vibhor Banga <vibhorbanga@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> I am planning to use spark with HBase, where I generate RDD by reading
>>>>> data from HBase Table.
>>>>>
>>>>> I want to know that in the case when the size of HBase Table grows
>>>>> larger than the size of RAM available in the cluster, will the application
>>>>> fail, or will there be an impact in performance ?
>>>>>
>>>>> Any thoughts in this direction will be helpful and are welcome.
>>>>>
>>>>> Thanks,
>>>>> -Vibhor
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Vibhor Banga
>>>> Software Development Engineer
>>>> Flipkart Internet Pvt. Ltd., Bangalore
>>>>
>>>>
>>>
>>
>

Mime
View raw message