spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rohit Chaddha <rohitchaddha1...@gmail.com>
Subject Re: Machine learning question (suing spark)- removing redundant factors while doing clustering
Date Tue, 09 Aug 2016 05:02:19 GMT
@Peyman - does any of the clustering algorithms have "feature Importance"
or "feature selection" ability ?  I can't seem to pinpoint



On Tue, Aug 9, 2016 at 8:49 AM, Peyman Mohajerian <mohajeri@gmail.com>
wrote:

> You can try 'feature Importances' or 'feature selection' depending on what
> else you want to do with the remaining features that's a possibility. Let's
> say you are trying to do classification then some of the Spark Libraries
> have a model parameter called 'featureImportances' that tell you which
> feature(s) are more dominant in you classification, you can then run your
> model again with the smaller set of features.
> The two approaches are quite different, what I'm suggesting involves
> training (supervised learning) in the context of a target function, with
> SVD you are doing unsupervised learning.
>
> On Mon, Aug 8, 2016 at 7:23 PM, Rohit Chaddha <rohitchaddha1234@gmail.com>
> wrote:
>
>> I would rather have less features to make better inferences on the data
>> based on the smaller number of factors,
>> Any suggestions Sean ?
>>
>> On Mon, Aug 8, 2016 at 11:37 PM, Sean Owen <sowen@cloudera.com> wrote:
>>
>>> Yes, that's exactly what PCA is for as Sivakumaran noted. Do you
>>> really want to select features or just obtain a lower-dimensional
>>> representation of them, with less redundancy?
>>>
>>> On Mon, Aug 8, 2016 at 4:10 PM, Tony Lane <tonylane.nyc@gmail.com>
>>> wrote:
>>> > There must be an algorithmic way to figure out which of these factors
>>> > contribute the least and remove them in the analysis.
>>> > I am hoping same one can throw some insight on this.
>>> >
>>> > On Mon, Aug 8, 2016 at 7:41 PM, Sivakumaran S <siva.kumaran@me.com>
>>> wrote:
>>> >>
>>> >> Not an expert here, but the first step would be devote some time and
>>> >> identify which of these 112 factors are actually causative. Some
>>> domain
>>> >> knowledge of the data may be required. Then, you can start of with
>>> PCA.
>>> >>
>>> >> HTH,
>>> >>
>>> >> Regards,
>>> >>
>>> >> Sivakumaran S
>>> >>
>>> >> On 08-Aug-2016, at 3:01 PM, Tony Lane <tonylane.nyc@gmail.com>
wrote:
>>> >>
>>> >> Great question Rohit.  I am in my early days of ML as well and it
>>> would be
>>> >> great if we get some idea on this from other experts on this group.
>>> >>
>>> >> I know we can reduce dimensions by using PCA, but i think that does
>>> not
>>> >> allow us to understand which factors from the original are we using
>>> in the
>>> >> end.
>>> >>
>>> >> - Tony L.
>>> >>
>>> >> On Mon, Aug 8, 2016 at 5:12 PM, Rohit Chaddha <
>>> rohitchaddha1234@gmail.com>
>>> >> wrote:
>>> >>>
>>> >>>
>>> >>> I have a data-set where each data-point has 112 factors.
>>> >>>
>>> >>> I want to remove the factors which are not relevant, and say reduce
>>> to 20
>>> >>> factors out of these 112 and then do clustering of data-points using
>>> these
>>> >>> 20 factors.
>>> >>>
>>> >>> How do I do these and how do I figure out which of the 20 factors
are
>>> >>> useful for analysis.
>>> >>>
>>> >>> I see SVD and PCA implementations, but I am not sure if these give
>>> which
>>> >>> elements are removed and which are remaining.
>>> >>>
>>> >>> Can someone please help me understand what to do here
>>> >>>
>>> >>> thanks,
>>> >>> -Rohit
>>> >>>
>>> >>
>>> >>
>>> >
>>>
>>
>>
>

Mime
View raw message