spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Artemis User <>
Subject Customizing K-Means for Anomaly Detection
Date Tue, 12 Jan 2021 18:58:06 GMT
First some background:

  * We want to use the k-means model for anomaly detection against a
    multi-dimensional dataset.  The current k-means implementation in
    Spark is designed for clustering purpose, not exactly for anomaly
    detection.  Once a model is trained and pipeline is instantiated,
    the prediction data frame generated from the transform function only
    associates each data points with individual clusters.  To enable
    anomaly detection, we would need to recalculate distance of each
    data point to its corresponding or nearest cluster centroid, and
    compare with a predefined threshold value to determine anomalies
    (e.g. normal = distance <= threshold, and anomaly = distance >
  * The anomaly detection procedure (e.g. calculating the distances and
    compare them with the threshold) occurs outside the ML pipeline
    (e.g. after invoking the transform method). This causes problems
    when we try to persist the pipeline model and later retrieve and
    instantiate and use it in production. We really would like one
    Estimator to do this whole process, from ingesting data to anomaly
    detection in a single pipeline, without the extra code at the end
    (e.g. after pipeline.transform() is called).


  * We wanted to just make a custom Transformer to append to the end of
    the Pipeline so to enable anomaly detection for the test dataset,
    BUT it requires the clusterCenters from the KMeansModel stage.  We
    can’t figure out how to pass this data, which comes from a fitted
    stage, to a later stage during runtime. Any Ideas?
  * Is there a way add a callback to the KMeansModel to persist the
    clusterCenters in the dataframe, or in a file?  or add a ParamMap to
    dynamically set this parameter during runtime?

Thanks a lot in advance!

-- ND

View raw message