spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sergio Ramírez-Gallego (JIRA) <>
Subject [jira] [Commented] (SPARK-6509) MDLP discretizer
Date Wed, 25 Mar 2015 10:50:53 GMT


Sergio Ramírez-Gallego commented on SPARK-6509:

Same answer. The reviewer from Spark suggested a new issue for this proposal instead of using
the general discussion thread.

> MDLP discretizer
> ----------------
>                 Key: SPARK-6509
>                 URL:
>             Project: Spark
>          Issue Type: New Feature
>          Components: MLlib
>            Reporter: Sergio Ramírez-Gallego
> Minimum Description Lenght Discretizer
> This method implements Fayyad's discretizer [1] based on Minimum Description Length Principle
(MDLP) in order to treat non discrete datasets from a distributed perspective. We have developed
a distributed version from the original one performing some important changes.
> -- Improvements on discretizer:
>     Support for sparse data.
>     Multi-attribute processing. The whole process is carried out in a single step when
the number of boundary points per attribute fits well in one partition (<= 100K boundary
points per attribute).
>     Support for attributes with a huge number of boundary points (> 100K boundary
points per attribute). Rare situation.
> This software has been proved with two large real-world datasets such as:
>     A dataset selected for the GECCO-2014 in Vancouver, July 13th, 2014 competition,
which comes from the Protein Structure Prediction field (
The dataset has 32 million instances, 631 attributes, 2 classes, 98% of negative examples
and occupies, when uncompressed, about 56GB of disk space.
>     Epsilon dataset:
400K instances and 2K attributes
> We have demonstrated that our method performs 300 times faster than the sequential version
for the first dataset, and also improves the accuracy for Naive Bayes.
> References
> [1] Fayyad, U., & Irani, K. (1993).
> "Multi-interval discretization of continuous-valued attributes for classification learning."

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message