spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ulanov, Alexander" <alexander.ula...@hp.com>
Subject Feature selection interface
Date Thu, 10 Jul 2014 17:38:33 GMT
Hi,

I've implemented a class that does Chi-squared feature selection for RDD[LabeledPoint]. It
also computes basic class/feature occurrence statistics and other methods like mutual information
or information gain can be easily implemented. I would like to make a pull request. However,
MLlib master branch doesn't have any feature selection methods implemented. So, I need to
create a proper interface that my class will extend or mix. It should be easy to use from
developers and users prospective.

I was thinking that there should be FeatureEvaluator that for each feature from RDD[LabeledPoint]
returns RDD[((featureIndex: Int, label: Double), value: Double)].
Then there should be FeatureSelector that selects top N features or top N features group by
class etc.
And the simplest one, FeatureFilter that filters the data based on set of feature indices.

Additionally, there should be the interface for FeatureEvaluators that don't use class labels,
i.e. for RDD[Vector].

I am concerned that such design looks rather "disconnected" because there are 3 disconnected
objects.

As a result of use, I would like to see something like "val filteredData = Filter(data, ChiSquared(data).selectTop(100))".

Any ideas or suggestions?

Best regards, Alexander

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message