spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Owen <>
Subject Re: Customizing K-Means for Anomaly Detection
Date Tue, 12 Jan 2021 19:11:14 GMT
You could fit the k-means pipeline, get the cluster centers, create a
Transformer using that info, then create a new PipelineModel including all
the original elements and the new Transformer. Does that work?
It's not out of the question to expose a new parameter in KMeansModel that
lets you also add a column with the cost; I'd review that kind of PR.

On Tue, Jan 12, 2021 at 12:59 PM Artemis User <>

> First some background:
>    - We want to use the k-means model for anomaly detection against a
>    multi-dimensional dataset.  The current k-means implementation in Spark is
>    designed for clustering purpose, not exactly for anomaly detection.  Once a
>    model is trained and pipeline is instantiated, the prediction data frame
>    generated from the transform function only associates each data points with
>    individual clusters.  To enable anomaly detection, we would need to
>    recalculate distance of each data point to its corresponding or nearest
>    cluster centroid, and compare with a predefined threshold value to
>    determine anomalies (e.g. normal = distance <= threshold, and anomaly =
>    distance > threshold).
>    - The anomaly detection procedure (e.g. calculating the distances and
>    compare them with the threshold) occurs outside the ML pipeline (e.g. after
>    invoking the transform method).  This causes problems when we try to
>    persist the pipeline model and later retrieve and instantiate and use it in
>    production.   We really would like one Estimator to do this whole process,
>    from ingesting data to anomaly detection in a single pipeline, without the
>    extra code at the end (e.g. after pipeline.transform() is called).
> Questions:
>    - We wanted to just make a custom Transformer to append to the end of
>    the Pipeline so to enable anomaly detection for the test dataset, BUT it
>    requires the clusterCenters from the KMeansModel stage.  We can’t figure
>    out how to pass this data, which comes from a fitted stage, to a later
>    stage during runtime. Any Ideas?
>    - Is there a way add a callback to the KMeansModel to persist the
>    clusterCenters in the dataframe, or in a file?  or add a ParamMap to
>    dynamically set this parameter during runtime?
> Thanks a lot in advance!
> -- ND

View raw message