systemml-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Niketan Pansare" <>
Subject Re: Distinct Item of a column
Date Mon, 17 Apr 2017 19:25:28 GMT

Hi Arijit,

PySpark and SystemML are complimentary and both serve different purpose.
PySpark primarily operates on a collection of datapoints (i.e. RDD) or a
DataFrame and exposes the Spark programming model (i.e. transformation and
actions). SystemML primarily operates on matrices and provides wide variety
of linear algebra operators required for implementing Machine Learning
algorithms. Personally, I would use PySpark for data preprocessing and
SystemML for training/prediction (YMMV!!). As an example: in our breast
cancer project, we use PySpark APIs in
 and SystemML APIs in
 ... Yes, some operations (such as distinct) can be done in both SystemML
and PySpark, in which case, you should chose the one that best fits your

PySpark ML (or MLLib) is more closer to SystemML. I agree with you that
there is not enough comparisons out there, probably because benchmarking ML
systems is non-trivial. For apples to apples comparison, you need compare
both accuracy and runtime performance of a given ML model on variety of
datasets. I am using the term "accuracy" broadly, so please refer to
Also, since different ML systems use different optimization algorithms
(i.e. SGD, conjugate gradient, direct solve, ...), one needs to reason
about hyperparameters as well as convergence behavior before making a


Niketan Pansare
IBM Almaden Research Center
E-mail: npansar At

PS: SystemML has recently added support for frames (
) that simplifies common data transformation operations such as recoding,
dummy coding, binning and handling of missing values.

From:	arijit chakraborty <>
To:	""
Date:	04/17/2017 08:50 AM
Subject:	Distinct Item of a column


I'm curious to know what's the advantage of systemML over pyspark?
Especially in terms of performance. I tried looking for some reading on it,
but hardly could find one.

Thank you!


  • Unnamed multipart/related (inline, None, 0 bytes)
View raw message