spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ulanov, Alexander" <alexander.ula...@hp.com>
Subject RE: Feature selection interface
Date Fri, 18 Jul 2014 11:42:29 GMT
FYI This is my first take on feature selection, filtering and chi-squared:
https://github.com/apache/spark/pull/1484


-----Original Message-----
From: Ulanov, Alexander 
Sent: Thursday, July 10, 2014 9:39 PM
To: dev@spark.apache.org
Subject: Feature selection interface

Hi,

I've implemented a class that does Chi-squared feature selection for RDD[LabeledPoint]. It
also computes basic class/feature occurrence statistics and other methods like mutual information
or information gain can be easily implemented. I would like to make a pull request. However,
MLlib master branch doesn't have any feature selection methods implemented. So, I need to
create a proper interface that my class will extend or mix. It should be easy to use from
developers and users prospective.

I was thinking that there should be FeatureEvaluator that for each feature from RDD[LabeledPoint]
returns RDD[((featureIndex: Int, label: Double), value: Double)].
Then there should be FeatureSelector that selects top N features or top N features group by
class etc.
And the simplest one, FeatureFilter that filters the data based on set of feature indices.

Additionally, there should be the interface for FeatureEvaluators that don't use class labels,
i.e. for RDD[Vector].

I am concerned that such design looks rather "disconnected" because there are 3 disconnected
objects.

As a result of use, I would like to see something like "val filteredData = Filter(data, ChiSquared(data).selectTop(100))".

Any ideas or suggestions?

Best regards, Alexander

Mime
View raw message