systemml-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matthias Boehm <>
Subject Re: Comparing scikit-learn, Mahout Samsara and SystemML
Date Tue, 06 Jun 2017 05:56:03 GMT
Thanks for reaching out Gustavo. An objective discussion of how exactly
SystemML and Mahout Samsara compare will probably help other people too. In
order to remove bias, I'm cc'ing Dmitriy and Sebastian from the Samsara
team, so they they can correct me if needed. Scikit-learn is a great and
very popular library of algorithms (which nicely integrates with NumPy),
but I'm excluding it here because it does not focus on large-scale ML.

Fundamentally, both SystemML and Mahout Samsara have a very different
history and represent different points in the design space for custom
large-scale machine learning (ML). Mahout started as a library of
algorithms on Hadoop MapReduce and is, as an overall project, certainly
more mature and a larger community. Samsara itself is a more recent
extension for custom large-scale ML on Spark and Flink. In contrast,
SystemML was build from scratch for custom large-scale ML, originally on
MapReduce and later Spark. After SystemML's initial open source release in
2015, it became just two weeks ago a top-level Apache project and we're
actively working on growing our community.

>From a technical perspective, SystemML follows a compiler approach where
scripts with R- or Python-like syntax (but only syntax) are automatically
compiled to hybrid runtime plans, composed of in-memory, singlenode
operations and operations on MapReduce or Spark. At script level, users
work with matrices, frames, and scalars without specifying physical data
properties such as dense/sparse representations, local/distributed storage,
partitioning or caching. The major advantages are (1) the ability to easily
write custom large-scale ML algorithms, (2) automatic adaptation to
different data characteristics (compile distributed operations only if
needed), and simplified deployment (because the same script can be used for
large-scale or local computations).

In contrast, Samsara is a domain-specific language (DSL), embedded in the
host language Scala. Users can either use local matrices or so-called
Distributed Row Matrices (DRM) for distributed computation. Operations over
local matrices are executed as is, without further optimization. In
contrast, operations over DRMs are collected into a DAG of operations and
lazily optimized and executed on triggering actions such as full
aggregations, write, or explicit collect into a local matrix. Hence, the
user is in charge of deciding between local and distributed operations,
caching, and other data flow properties. At the same time, this lower-level
specification allows for more control and the ability to escape to explicit
distributed operations over rows of the DRM if needed.

At compiler and runtime level, there are a number of similarities but also
major differences. For example, both systems provide different physical
operators (for instance, for matrix multiplication), chosen depending on
operation patterns as well as data and cluster characteristics. This
includes local operators, operators for special patterns like t(X)%*%X,
broadcast-based, co-partitioning, and shuffle-based operators.
Additionally, SystemML uses a variety of simplification rewrites, a
different distributed matrix representation of binary block matrices (w/
various dense, sparse, and ultra-sparse formats), and fused operators in
order to reduce scans, intermediates, and exploit sparsity across chains of
operators. Regarding GPUs, we recently added a GPU backend for
deep-learning and generally compute-intensive operations as an experimental
feature in SystemML, and we're actively working on making it
production-ready. I heard that Mahout is similarly working on GPU support
but I am not sure about the details.

To summarize, both SystemML and Samsara aim at different abstraction
levels, and differ substantially in their compiler and runtime internals.
Of course, there are also shared goals and motivations (such as simplifying
custom, large-scale ML), but competition is good as it drives improvements.
I hope this gives a high-level comparison. If you have additional specific
questions, feel free to ask.


On Mon, Jun 5, 2017 at 6:56 PM, Gustavo Frederico <> wrote:

> Greetings,
> I worked with the theory of SVMs during my Graduate studies and I’m
> relatively new to existing ML software. Assuming that I want to create new
> scalable ML algorithms starting with the Math, the question is: how do
> scikit-learn, Mahout Samsara and SystemML compare to each other?
> I see interesting Python-based frameworks such as scikit-learn, but then I
> read SystemML's article on Wikipedia that made me question the distributive
> scalability of (“pure") Python for large amounts of data:
> "[...] It was observed that data scientists would write machine learning
> algorithms in languages such as R and Python for small data. When it came
> time to scale to big data, a systems programmer would be needed to scale
> the algorithm in a language such as Scala. This process typically involved
> days or weeks per iteration, and errors would occur translating the
> algorithms to operate on big data. " (
> Apache_SystemML )
> And the article starts stating that Apache SystemML has "algorithm
> customizability via [...] Python-like languages”.
> Mahout Samsara is based on Scala. PredictionIO (predictionio.incubator.
> algorithms are based on Mahout Samsara and Scala.  I asked
> Mr. Matthias Boehm at a conference how one could compare Mahout Samsara to
> SystemML. From what I understood, Samsara needs "explicit declarations” in
> expressions for distributed computing, while SystemML doesn’t — please
> correct me if I’m wrong. Also, SystemML will optimize the entire script,
> while Samsara will optimize expressions — again, please correct me if I’m
> wrong.
> While my main criterion is scalability (cluster, GPU support etc), other
> criteria to evaluate these frameworks may be: a) public adoption, b) active
> dev community, c) quality of tools for development, d) backing of big
> companies e) simplicity working with clusters (delegating the complexities
> of clustering to the framework, “hiding” them from the user), f) quality of
> documentation, g) quality of the software itself
> ( My question was deleted from for being
> off-topic and deleted from Stack Overflow for being bound to get answers
> with "opinions rather than facts” [sic]. I’m very much interested in
> hearing balanced and insightful comments from the list. )
> Thank you,
> Gustavo

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message