spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tobi Bosede <ani.to...@gmail.com>
Subject Re: Standardization with Sparse Vectors
Date Thu, 11 Aug 2016 15:29:49 GMT
Opening this follow-up question to the entire mailing list. Anyone
have thoughts
on how I can add a column of dense vectors (created by converting a column
of sparse features) to a data frame? My efforts are below.

Although I know this is not the best approach for something I plan to put
in production, I have been trying to write a udf to turn the sparse vector
into a dense one and apply the udf in withcolumn(). withColumn() complains
that the data is a tuple. I think the issue might be the datatype
parameter. The function returns a vector of doubles but there is no type
that would be adequate for this.


*sparseToDense=udf(lambda data:
float(DenseVector([data.toArray()])), DoubleType())*
*denseTrainingRdf=trainingRdfAssemb.withColumn("denseFeatures",
sparseToDense("features"))*

However the function works outside the udf, but I am unable to add an
arbitrary column to the data frame I started out working with.

*denseFeatures=TrainingRdf.select("features").map(lambda data:
DenseVector([data.features.toArray()]))*
*denseTrainingRdf=trainingRdfAssemb.withColumn("denseFeatures", denseFeatures)*

Thanks,
Tobi

On Thu, Aug 11, 2016 at 5:02 AM, Sean Owen <sowen@cloudera.com> wrote:

> No, that doesn't describe the change being discussed, since you've
> copied the discussion about adding an 'offset'. That's orthogonal.
> You're also suggesting making withMean=True the default, which we
> don't want. The point is that if this is *explicitly* requested, the
> scaler shouldn't refuse to subtract the mean from a sparse vector, and
> fail.
>
> On Wed, Aug 10, 2016 at 8:47 PM, Tobi Bosede <ani.tobib@gmail.com> wrote:
> > Sean,
> >
> > I have created a jira; I hope you don't mind that I borrowed your
> > explanation of "offset". https://issues.apache.org/
> jira/browse/SPARK-17001
> >
> > So what did you do to standardize your data, if you didn't use
> > standardScaler? Did you write a udf to subtract mean and divide by
> standard
> > deviation?
> >
> > Although I know this is not the best approach for something I plan to
> put in
> > production, I have been trying to write a udf to turn the sparse vector
> into
> > a dense one and apply the udf in withcolumn(). withColumn() complains
> that
> > the data is a tuple. I think the issue might be the datatype parameter.
> The
> > function returns a vector of doubles but there is no type that would be
> > adequate for this.
> >
> > sparseToDense=udf(lambda data: float(DenseVector([data.toArray()])),
> > DoubleType())
> > denseTrainingRdf=trainingRdfAssemb.withColumn("denseFeatures",
> > sparseToDense("features"))
> >
> > However the function works outside the udf, but I am unable to add an
> > arbitrary column to the data frame I started out working with. Thoughts?
> >
> > denseFeatures=TrainingRdf.select("features").map(lambda data:
> > DenseVector([data.features.toArray()]))
> > denseTrainingRdf=trainingRdfAssemb.withColumn("denseFeatures",
> > denseFeatures)
> >
> > Thanks,
> > Tobi
> >
> >
> > On Wed, Aug 10, 2016 at 1:01 PM, Nick Pentreath <
> nick.pentreath@gmail.com>
> > wrote:
> >>
> >> Ah right, got it. As you say for storage it helps significantly, but for
> >> operations I suspect it puts one back in a "dense-like" position.
> Still, for
> >> online / mini-batch algorithms it may still be feasible I guess.
> >> On Wed, 10 Aug 2016 at 19:50, Sean Owen <sowen@cloudera.com> wrote:
> >>>
> >>> All elements, I think. Imagine a sparse vector 1:3 3:7 which
> conceptually
> >>> represents 0 3 0 7. Imagine it also has an offset stored which applies
> to
> >>> all elements. If it is -2 then it now represents -2 1 -2 5, but this
> >>> requires just one extra value to store. It only helps with storage of a
> >>> shifted sparse vector; iterating still typically requires iterating all
> >>> elements.
> >>>
> >>> Probably, where this would help, the caller can track this offset and
> >>> even more efficiently apply this knowledge. I remember digging into
> this in
> >>> how sparse covariance matrices are computed. It almost but not quite
> enabled
> >>> an optimization.
> >>>
> >>>
> >>> On Wed, Aug 10, 2016, 18:10 Nick Pentreath <nick.pentreath@gmail.com>
> >>> wrote:
> >>>>
> >>>> Sean by 'offset' do you mean basically subtracting the mean but only
> >>>> from the non-zero elements in each row?
> >>>> On Wed, 10 Aug 2016 at 19:02, Sean Owen <sowen@cloudera.com> wrote:
> >>>>>
> >>>>> Yeah I had thought the same, that perhaps it's fine to let the
> >>>>> StandardScaler proceed, if it's explicitly asked to center, rather
> >>>>> than refuse to. It's not really much more rope to let a user hang
> >>>>> herself with, and, blocks legitimate usages (we ran into this last
> >>>>> week and couldn't use StandardScaler as a result).
> >>>>>
> >>>>> I'm personally supportive of the change and don't see a JIRA. I
think
> >>>>> you could at least make one.
> >>>>>
> >>>>> On Wed, Aug 10, 2016 at 5:57 PM, Tobi Bosede <ani.tobib@gmail.com>
> >>>>> wrote:
> >>>>> > Thanks Sean, I agree with 100% that the math is math and dense
vs
> >>>>> > sparse is
> >>>>> > just a matter of representation. I was trying to convince a
> co-worker
> >>>>> > of
> >>>>> > this to no avail. Sending this email was mainly a sanity check.
> >>>>> >
> >>>>> > I think having an offset would be a great idea, although I
am not
> >>>>> > sure how
> >>>>> > to implement this. However, if anything should be done to rectify
> >>>>> > this
> >>>>> > issue, it should be done in the standardScaler, not
> vectorAssembler.
> >>>>> > There
> >>>>> > should not be any forcing of vectorAssembler to produce only
dense
> >>>>> > vectors
> >>>>> > so as to avoid performance problems with data that does not
fit in
> >>>>> > memory.
> >>>>> > Furthermore, not every machine learning algo requires
> >>>>> > standardization.
> >>>>> > Instead, standardScaler should have withmean=True as default
and
> >>>>> > should
> >>>>> > apply an offset if the vector is sparse, whereas there would
be
> >>>>> > normal
> >>>>> > subtraction if the vector is dense. This way the default behavior
> of
> >>>>> > standardScaler will always be what is generally understood
to be
> >>>>> > standardization, as opposed to people thinking they are
> standardizing
> >>>>> > when
> >>>>> > they actually are not.
> >>>>> >
> >>>>> > Can anyone confirm whether there is a jira already?
> >>>>> >
> >>>>> > On Wed, Aug 10, 2016 at 10:58 AM, Sean Owen <sowen@cloudera.com>
> >>>>> > wrote:
> >>>>> >>
> >>>>> >> Dense vs sparse is just a question of representation, so
doesn't
> >>>>> >> make
> >>>>> >> an operation on a vector more or less important as a result.
> You've
> >>>>> >> identified the reason that subtracting the mean can be
> undesirable:
> >>>>> >> a
> >>>>> >> notionally billion-element sparse vector becomes too big
to fit in
> >>>>> >> memory at once.
> >>>>> >>
> >>>>> >> I know this came up as a problem recently (I think there's
a
> JIRA?)
> >>>>> >> because VectorAssembler will *sometimes* output a small
dense
> vector
> >>>>> >> and sometimes output a small sparse vector based on how
many
> zeroes
> >>>>> >> there are. But that's bad because then the StandardScaler
can't
> >>>>> >> process the output at all. You can work on this if you're
> >>>>> >> interested;
> >>>>> >> I think the proposal was to be able to force a dense
> representation
> >>>>> >> only in VectorAssembler. I don't know if that's the nature
of the
> >>>>> >> problem you're hitting.
> >>>>> >>
> >>>>> >> It can be meaningful to only scale the dimension without
centering
> >>>>> >> it,
> >>>>> >> but it's not the same thing, no. The math is the math.
> >>>>> >>
> >>>>> >> This has come up a few times -- it's necessary to center
a sparse
> >>>>> >> vector but prohibitive to do so. One idea I'd toyed with
in the
> past
> >>>>> >> was to let a sparse vector have an 'offset' value applied
to all
> >>>>> >> elements. That would let you shift all values while preserving
a
> >>>>> >> sparse representation. I'm not sure if it's worth implementing
but
> >>>>> >> would help this case.
> >>>>> >>
> >>>>> >>
> >>>>> >>
> >>>>> >>
> >>>>> >> On Wed, Aug 10, 2016 at 4:41 PM, Tobi Bosede <ani.tobib@gmail.com
> >
> >>>>> >> wrote:
> >>>>> >> > Hi everyone,
> >>>>> >> >
> >>>>> >> > I am doing some standardization using standardScaler
on data
> from
> >>>>> >> > VectorAssembler which is represented as sparse vectors.
I plan
> to
> >>>>> >> > fit a
> >>>>> >> > regularized model.  However, standardScaler does not
allow the
> >>>>> >> > mean to
> >>>>> >> > be
> >>>>> >> > subtracted from sparse vectors. It will only divide
by the
> >>>>> >> > standard
> >>>>> >> > deviation, which I understand is to keep the vector
sparse.
> Thus I
> >>>>> >> > am
> >>>>> >> > trying
> >>>>> >> > to convert my sparse vectors into dense vectors, but
this may
> not
> >>>>> >> > be
> >>>>> >> > worthwhile.
> >>>>> >> >
> >>>>> >> > So my questions are:
> >>>>> >> > Is subtracting the mean during standardization only
important
> when
> >>>>> >> > working
> >>>>> >> > with dense vectors? Does it not matter for sparse
vectors? Is
> just
> >>>>> >> > dividing
> >>>>> >> > by the standard deviation with sparse vectors equivalent
to also
> >>>>> >> > dividing by
> >>>>> >> > standard deviation w and subtracting mean with dense
vectors?
> >>>>> >> >
> >>>>> >> > Thank you,
> >>>>> >> > Tobi
> >>>>> >
> >>>>> >
> >>>>>
> >>>>> ------------------------------------------------------------
> ---------
> >>>>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
> >>>>>
> >
>

Mime
View raw message