spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jörn Franke <jornfra...@gmail.com>
Subject Re: Pros and Cons
Date Thu, 26 May 2016 06:07:48 GMT

Spark can handle this true, but it is optimized for the idea that it works it works on the
same full dataset in-memory due to the underlying nature of machine learning algorithms (iterative).
Of course, you can spill over, but that you should avoid.

That being said you should have read my final sentence about this. Both systems develop and
change.


> On 25 May 2016, at 22:14, Reynold Xin <rxin@databricks.com> wrote:
> 
> 
>> On Wed, May 25, 2016 at 9:52 AM, Jörn Franke <jornfranke@gmail.com> wrote:
>> Spark is more for machine learning working iteravely over the whole same dataset
in memory. Additionally it has streaming and graph processing capabilities that can be used
together. 
> 
> Hi Jörn,
> 
> The first part is actually no true. Spark can handle data far greater than the aggregate
memory available on a cluster. The more recent versions (1.3+) of Spark have external operations
for almost all built-in operators, and while things may not be perfect, those external operators
are becoming more and more robust with each version of Spark.
> 
> 
> 
> 
> 

Mime
View raw message