spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Meethu Mathew (JIRA)" <>
Subject [jira] [Commented] (SPARK-3588) Gaussian Mixture Model clustering
Date Wed, 01 Oct 2014 06:30:33 GMT


Meethu Mathew commented on SPARK-3588:

Ok. We will start implementing the Scala version of Gaussian Mixture Model.

> Gaussian Mixture Model clustering
> ---------------------------------
>                 Key: SPARK-3588
>                 URL:
>             Project: Spark
>          Issue Type: New Feature
>          Components: MLlib, PySpark
>            Reporter: Meethu Mathew
>            Assignee: Meethu Mathew
>         Attachments:
> Gaussian Mixture Models (GMM) is a popular technique for soft clustering. GMM models
the entire data set as a finite mixture of Gaussian distributions,each parameterized by a
mean vector µ ,a covariance matrix ∑ and  a mixture weight π. In this technique, probability
of  each point to belong to each cluster is computed along with the cluster statistics.
> We have come up with an initial distributed implementation of GMM in pyspark where the
parameters are estimated using the  Expectation-Maximization algorithm.Our current implementation
considers diagonal covariance matrix for each component.
> We did an initial benchmark study on a  2 node Spark standalone cluster setup where each
node config is 8 Cores,8 GB RAM, the spark version used is 1.0.0. We also evaluated python
version of k-means available in spark on the same datasets.
> Below are the results from this benchmark study. The reported stats are average from
10 runs.Tests were done on multiple datasets with varying number of features and instances.
> ||&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Dataset
mixture model&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;|| &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Kmeans(Python)&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;||
> |Instances|Dimensions |Avg time per iteration|Time for  100 iterations |Avg time per
iteration |Time for 100 iterations | 
> |0.7million| &nbsp;&nbsp;&nbsp;13 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;|
 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;   7s &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
    | &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;   
 12min &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;  
|  &nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;     13s  &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
 |  &nbsp;&nbsp;&nbsp;&nbsp;    26min &nbsp;&nbsp;&nbsp;    |
> |1.8million| &nbsp;&nbsp;&nbsp;11 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;|
  &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;  17s &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
    | &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;     29min &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
 |  &nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;     33s  &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
  |  &nbsp;&nbsp;&nbsp;&nbsp;    53min &nbsp;&nbsp;&nbsp;  |
> |10million|&nbsp;&nbsp;&nbsp;16 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;|
 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;  1.6min &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
   | &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;     2.7hr &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
  |  &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;     1.2min &nbsp;&nbsp;&nbsp;&nbsp;
   |  &nbsp;&nbsp;&nbsp;&nbsp;    2hr &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message