spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rohit Chaddha <rohitchaddha1...@gmail.com>
Subject Re: Machine learning question (suing spark)- removing redundant factors while doing clustering
Date Tue, 09 Aug 2016 02:23:27 GMT
I would rather have less features to make better inferences on the data
based on the smaller number of factors,
Any suggestions Sean ?

On Mon, Aug 8, 2016 at 11:37 PM, Sean Owen <sowen@cloudera.com> wrote:

> Yes, that's exactly what PCA is for as Sivakumaran noted. Do you
> really want to select features or just obtain a lower-dimensional
> representation of them, with less redundancy?
>
> On Mon, Aug 8, 2016 at 4:10 PM, Tony Lane <tonylane.nyc@gmail.com> wrote:
> > There must be an algorithmic way to figure out which of these factors
> > contribute the least and remove them in the analysis.
> > I am hoping same one can throw some insight on this.
> >
> > On Mon, Aug 8, 2016 at 7:41 PM, Sivakumaran S <siva.kumaran@me.com>
> wrote:
> >>
> >> Not an expert here, but the first step would be devote some time and
> >> identify which of these 112 factors are actually causative. Some domain
> >> knowledge of the data may be required. Then, you can start of with PCA.
> >>
> >> HTH,
> >>
> >> Regards,
> >>
> >> Sivakumaran S
> >>
> >> On 08-Aug-2016, at 3:01 PM, Tony Lane <tonylane.nyc@gmail.com> wrote:
> >>
> >> Great question Rohit.  I am in my early days of ML as well and it would
> be
> >> great if we get some idea on this from other experts on this group.
> >>
> >> I know we can reduce dimensions by using PCA, but i think that does not
> >> allow us to understand which factors from the original are we using in
> the
> >> end.
> >>
> >> - Tony L.
> >>
> >> On Mon, Aug 8, 2016 at 5:12 PM, Rohit Chaddha <
> rohitchaddha1234@gmail.com>
> >> wrote:
> >>>
> >>>
> >>> I have a data-set where each data-point has 112 factors.
> >>>
> >>> I want to remove the factors which are not relevant, and say reduce to
> 20
> >>> factors out of these 112 and then do clustering of data-points using
> these
> >>> 20 factors.
> >>>
> >>> How do I do these and how do I figure out which of the 20 factors are
> >>> useful for analysis.
> >>>
> >>> I see SVD and PCA implementations, but I am not sure if these give
> which
> >>> elements are removed and which are remaining.
> >>>
> >>> Can someone please help me understand what to do here
> >>>
> >>> thanks,
> >>> -Rohit
> >>>
> >>
> >>
> >
>

Mime
View raw message