spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rohit Chaddha <rohitchaddha1...@gmail.com>
Subject Re: Machine learning question (suing spark)- removing redundant factors while doing clustering
Date Wed, 10 Aug 2016 11:46:07 GMT
Hi Sean,

So basically I am trying to cluster a number of elements (its a domain
object called PItem) based on a the quality factors of these items.
These elements have 112 quality factors each.

Now the issue is that when I am scaling the factors using StandardScaler I
get a Sum of Squared Errors = 13300
When I don't use scaling the Sum of Squared Errors = 5

I was always of the opinion that different factors being on different scale
should always be normalized, but I am confused based on the results above
and I am wondering what factors should be removed to get a meaningful
result (may be with 5% less accuracy)

Will appreciate any help here.

-Rohit

On Tue, Aug 9, 2016 at 12:55 PM, Sean Owen <sowen@cloudera.com> wrote:

> Fewer features doesn't necessarily mean better predictions, because indeed
> you are subtracting data. It might, because when done well you subtract
> more noise than signal. It is usually done to make data sets smaller or
> more tractable or to improve explainability.
>
> But you have an unsupervised clustering problem where talking about
> feature importance doesnt make as much sense. Important to what? There is
> no target variable.
>
> PCA will not 'improve' clustering per se but can make it faster.
> You may want to specify what you are actually trying to optimize.
>
>
> On Tue, Aug 9, 2016, 03:23 Rohit Chaddha <rohitchaddha1234@gmail.com>
> wrote:
>
>> I would rather have less features to make better inferences on the data
>> based on the smaller number of factors,
>> Any suggestions Sean ?
>>
>> On Mon, Aug 8, 2016 at 11:37 PM, Sean Owen <sowen@cloudera.com> wrote:
>>
>>> Yes, that's exactly what PCA is for as Sivakumaran noted. Do you
>>> really want to select features or just obtain a lower-dimensional
>>> representation of them, with less redundancy?
>>>
>>> On Mon, Aug 8, 2016 at 4:10 PM, Tony Lane <tonylane.nyc@gmail.com>
>>> wrote:
>>> > There must be an algorithmic way to figure out which of these factors
>>> > contribute the least and remove them in the analysis.
>>> > I am hoping same one can throw some insight on this.
>>> >
>>> > On Mon, Aug 8, 2016 at 7:41 PM, Sivakumaran S <siva.kumaran@me.com>
>>> wrote:
>>> >>
>>> >> Not an expert here, but the first step would be devote some time and
>>> >> identify which of these 112 factors are actually causative. Some
>>> domain
>>> >> knowledge of the data may be required. Then, you can start of with
>>> PCA.
>>> >>
>>> >> HTH,
>>> >>
>>> >> Regards,
>>> >>
>>> >> Sivakumaran S
>>> >>
>>> >> On 08-Aug-2016, at 3:01 PM, Tony Lane <tonylane.nyc@gmail.com>
wrote:
>>> >>
>>> >> Great question Rohit.  I am in my early days of ML as well and it
>>> would be
>>> >> great if we get some idea on this from other experts on this group.
>>> >>
>>> >> I know we can reduce dimensions by using PCA, but i think that does
>>> not
>>> >> allow us to understand which factors from the original are we using
>>> in the
>>> >> end.
>>> >>
>>> >> - Tony L.
>>> >>
>>> >> On Mon, Aug 8, 2016 at 5:12 PM, Rohit Chaddha <
>>> rohitchaddha1234@gmail.com>
>>> >> wrote:
>>> >>>
>>> >>>
>>> >>> I have a data-set where each data-point has 112 factors.
>>> >>>
>>> >>> I want to remove the factors which are not relevant, and say reduce
>>> to 20
>>> >>> factors out of these 112 and then do clustering of data-points using
>>> these
>>> >>> 20 factors.
>>> >>>
>>> >>> How do I do these and how do I figure out which of the 20 factors
are
>>> >>> useful for analysis.
>>> >>>
>>> >>> I see SVD and PCA implementations, but I am not sure if these give
>>> which
>>> >>> elements are removed and which are remaining.
>>> >>>
>>> >>> Can someone please help me understand what to do here
>>> >>>
>>> >>> thanks,
>>> >>> -Rohit
>>> >>>
>>> >>
>>> >>
>>> >
>>>
>>
>>

Mime
View raw message