spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Apache Spark (Jira)" <>
Subject [jira] [Assigned] (SPARK-31454) An optimized K-Means based on DenseMatrix and GEMM
Date Fri, 01 May 2020 16:52:02 GMT


Apache Spark reassigned SPARK-31454:

    Assignee: Apache Spark

> An optimized K-Means based on DenseMatrix and GEMM
> --------------------------------------------------
>                 Key: SPARK-31454
>                 URL:
>             Project: Spark
>          Issue Type: Improvement
>          Components: ML
>    Affects Versions: 3.1.0
>            Reporter: Xiaochang Wu
>            Assignee: Apache Spark
>            Priority: Major
>              Labels: performance
> The main computations in K-Means are calculating distances between individual points
and center points. Currently K-Means implementation is vector-based which can't take advantage
of optimized native BLAS libraries.
> When the original points are represented as dense vectors, our approach is to modify
the original input data structures to a DenseMatrix-based one by grouping several points together.
The original distance calculations can be translated into a Matrix multiplication then optimized
native GEMM routines (Intel MKL, OpenBLAS etc.) can be used. This approach can also work with
sparse vectors despite having larger memory consumption when translating sparse vectors to
dense matrix.
> Our preliminary benchmark shows this DenseMatrix+GEMM approach can boost the training
performance by *3.5x* with Intel MKL, looks very promising!
> To minimize end user impact, proposed changes are to use config parameters to control
if turn on this implementation without modifying public interfaces. Parameter rowsPerMatrix
is used to control how many points are grouped together to build a DenseMatrix. An example:
> $ spark-submit --master $SPARK_MASTER \
>     --conf "" \
>     --conf "" \
>     --class 
> Several code changes are made in "" namespace as we think "spark.mllib" is in
maintenance mode, some are duplications from spark.mllib for using private definitions in
the same package: 
>  - Modified: KMeans.scala, DatasetUtils.scala
>  - Added: KMeansMatrixImpl.scala
>  - Duplications: DistanceMeasure.scala, LocalKMeans.scala
> If this general idea is accepted by community, we are willing to contribute our code
to upstream and polish the implementation according to feedbacks and produce benchmarks.

This message was sent by Atlassian Jira

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message