spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sergio Ramírez-Gallego (JIRA) <>
Subject [jira] [Created] (SPARK-6509) MDLP discretizer
Date Tue, 24 Mar 2015 18:19:53 GMT
Sergio Ramírez-Gallego created SPARK-6509:

             Summary: MDLP discretizer
                 Key: SPARK-6509
             Project: Spark
          Issue Type: New Feature
          Components: MLlib
            Reporter: Sergio Ramírez-Gallego

Minimum Description Lenght Discretizer

This method implements Fayyad's discretizer [1] based on Minimum Description Length Principle
(MDLP) in order to treat non discrete datasets from a distributed perspective. We have developed
a distributed version from the original one performing some important changes.

-- Improvements on discretizer:

    Support for sparse data.
    Multi-attribute processing. The whole process is carried out in a single step when the
number of boundary points per attribute fits well in one partition (<= 100K boundary points
per attribute).
    Support for attributes with a huge number of boundary points (> 100K boundary points
per attribute). Extremely rare situation.

This software has been proved with two large real-world datasets such as:

    A dataset selected for the GECCO-2014 in Vancouver, July 13th, 2014 competition, which
comes from the Protein Structure Prediction field ( The
dataset has 32 million instances, 631 attributes, 2 classes, 98% of negative examples and
occupies, when uncompressed, about 56GB of disk space.
    Epsilon dataset:
400K instances and 2K attributes

We have demonstrated that our method performs 300 times faster than the sequential version
for the first dataset, and also improves the accuracy for Naive Bayes.


[1] Fayyad, U., & Irani, K. (1993).
"Multi-interval discretization of continuous-valued attributes for classification learning."

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message