spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tobi Bosede <ani.to...@gmail.com>
Subject Re: Standardization with Sparse Vectors
Date Thu, 11 Aug 2016 18:21:32 GMT
Can someone also provide input on why my code may not be working? Below, I
have pasted part of my previous reply which describes the issue I am having
here. I am really more perplexed about the first set of code (in bold). I
know why the second set of code doesn't work, it is just something I
initially tried.

>> Although I know this is not the best approach for something I plan to
put in
>> production, I have been trying to write a udf to turn the sparse vector
into
>> a dense one and apply the udf in withcolumn(). withColumn() complains
that
>> the data is a tuple. I think the issue might be the datatype parameter.
The
>> function returns a vector of doubles but there is no type that would be
>> adequate for this.
>>



*>> sparseToDense=udf(lambda data: float(DenseVector([data.toArray()])),>>
DoubleType())>>
denseTrainingRdf=trainingRdfAssemb.withColumn("denseFeatures",>>
sparseToDense("features"))*
>>
>> However the function works outside the udf, but I am unable to add an
>> arbitrary column to the data frame I started out working with.
*Thoughts?*
>>
>> denseFeatures=TrainingRdf.select("features").map(lambda data:
>> DenseVector([data.features.toArray()]))
>> denseTrainingRdf=trainingRdfAssemb.withColumn("denseFeatures",
>> denseFeatures)

On Thu, Aug 11, 2016 at 12:55 PM, Sean Owen <sowen@cloudera.com> wrote:

> I should be more clear, since the outcome of the discussion above was
> not that obvious actually.
>
> - I agree a change should be made to StandardScaler, and not
> VectorAssembler
> - However I do think withMean should still be false by default and be
> explicitly enabled
> - The 'offset' idea is orthogonal, and as Nick says may be problematic
> anyway a step or two down the line. I'm proposing just converting to
> dense vectors if asked to center (which is why it shouldn't be the
> default)
>
> Indeed to answer your question, that's how I had resolved this in user
> code earlier. It's the same thing you're suggesting here, to make a
> UDF that converts the vectors to dense vectors manually.
>
> I updated the JIRA accordingly, to suggest converting to DenseVector
> in StandardScaler if withMean is set explicitly to true. I think we
> should consider something like the 'offset' idea separately if at all.
>
> On Thu, Aug 11, 2016 at 11:02 AM, Sean Owen <sowen@cloudera.com> wrote:
> > No, that doesn't describe the change being discussed, since you've
> > copied the discussion about adding an 'offset'. That's orthogonal.
> > You're also suggesting making withMean=True the default, which we
> > don't want. The point is that if this is *explicitly* requested, the
> > scaler shouldn't refuse to subtract the mean from a sparse vector, and
> > fail.
> >
> > On Wed, Aug 10, 2016 at 8:47 PM, Tobi Bosede <ani.tobib@gmail.com>
> wrote:
> >> Sean,
> >>
> >> I have created a jira; I hope you don't mind that I borrowed your
> >> explanation of "offset". https://issues.apache.org/
> jira/browse/SPARK-17001
> >>
> >> So what did you do to standardize your data, if you didn't use
> >> standardScaler? Did you write a udf to subtract mean and divide by
> standard
> >> deviation?
> >>
> >> Although I know this is not the best approach for something I plan to
> put in
> >> production, I have been trying to write a udf to turn the sparse vector
> into
> >> a dense one and apply the udf in withcolumn(). withColumn() complains
> that
> >> the data is a tuple. I think the issue might be the datatype parameter.
> The
> >> function returns a vector of doubles but there is no type that would be
> >> adequate for this.
> >>
> >> sparseToDense=udf(lambda data: float(DenseVector([data.toArray()])),
> >> DoubleType())
> >> denseTrainingRdf=trainingRdfAssemb.withColumn("denseFeatures",
> >> sparseToDense("features"))
> >>
> >> However the function works outside the udf, but I am unable to add an
> >> arbitrary column to the data frame I started out working with. Thoughts?
> >>
> >> denseFeatures=TrainingRdf.select("features").map(lambda data:
> >> DenseVector([data.features.toArray()]))
> >> denseTrainingRdf=trainingRdfAssemb.withColumn("denseFeatures",
> >> denseFeatures)
> >>
> >> Thanks,
> >> Tobi
> >>
> >>
> >> On Wed, Aug 10, 2016 at 1:01 PM, Nick Pentreath <
> nick.pentreath@gmail.com>
> >> wrote:
> >>>
> >>> Ah right, got it. As you say for storage it helps significantly, but
> for
> >>> operations I suspect it puts one back in a "dense-like" position.
> Still, for
> >>> online / mini-batch algorithms it may still be feasible I guess.
> >>> On Wed, 10 Aug 2016 at 19:50, Sean Owen <sowen@cloudera.com> wrote:
> >>>>
> >>>> All elements, I think. Imagine a sparse vector 1:3 3:7 which
> conceptually
> >>>> represents 0 3 0 7. Imagine it also has an offset stored which
> applies to
> >>>> all elements. If it is -2 then it now represents -2 1 -2 5, but this
> >>>> requires just one extra value to store. It only helps with storage of
> a
> >>>> shifted sparse vector; iterating still typically requires iterating
> all
> >>>> elements.
> >>>>
> >>>> Probably, where this would help, the caller can track this offset and
> >>>> even more efficiently apply this knowledge. I remember digging into
> this in
> >>>> how sparse covariance matrices are computed. It almost but not quite
> enabled
> >>>> an optimization.
> >>>>
> >>>>
> >>>> On Wed, Aug 10, 2016, 18:10 Nick Pentreath <nick.pentreath@gmail.com>
> >>>> wrote:
> >>>>>
> >>>>> Sean by 'offset' do you mean basically subtracting the mean but
only
> >>>>> from the non-zero elements in each row?
> >>>>> On Wed, 10 Aug 2016 at 19:02, Sean Owen <sowen@cloudera.com>
wrote:
> >>>>>>
> >>>>>> Yeah I had thought the same, that perhaps it's fine to let the
> >>>>>> StandardScaler proceed, if it's explicitly asked to center,
rather
> >>>>>> than refuse to. It's not really much more rope to let a user
hang
> >>>>>> herself with, and, blocks legitimate usages (we ran into this
last
> >>>>>> week and couldn't use StandardScaler as a result).
> >>>>>>
> >>>>>> I'm personally supportive of the change and don't see a JIRA.
I
> think
> >>>>>> you could at least make one.
> >>>>>>
> >>>>>> On Wed, Aug 10, 2016 at 5:57 PM, Tobi Bosede <ani.tobib@gmail.com>
> >>>>>> wrote:
> >>>>>> > Thanks Sean, I agree with 100% that the math is math and
dense vs
> >>>>>> > sparse is
> >>>>>> > just a matter of representation. I was trying to convince
a
> co-worker
> >>>>>> > of
> >>>>>> > this to no avail. Sending this email was mainly a sanity
check.
> >>>>>> >
> >>>>>> > I think having an offset would be a great idea, although
I am not
> >>>>>> > sure how
> >>>>>> > to implement this. However, if anything should be done
to rectify
> >>>>>> > this
> >>>>>> > issue, it should be done in the standardScaler, not
> vectorAssembler.
> >>>>>> > There
> >>>>>> > should not be any forcing of vectorAssembler to produce
only dense
> >>>>>> > vectors
> >>>>>> > so as to avoid performance problems with data that does
not fit in
> >>>>>> > memory.
> >>>>>> > Furthermore, not every machine learning algo requires
> >>>>>> > standardization.
> >>>>>> > Instead, standardScaler should have withmean=True as default
and
> >>>>>> > should
> >>>>>> > apply an offset if the vector is sparse, whereas there
would be
> >>>>>> > normal
> >>>>>> > subtraction if the vector is dense. This way the default
behavior
> of
> >>>>>> > standardScaler will always be what is generally understood
to be
> >>>>>> > standardization, as opposed to people thinking they are
> standardizing
> >>>>>> > when
> >>>>>> > they actually are not.
> >>>>>> >
> >>>>>> > Can anyone confirm whether there is a jira already?
> >>>>>> >
> >>>>>> > On Wed, Aug 10, 2016 at 10:58 AM, Sean Owen <sowen@cloudera.com>
> >>>>>> > wrote:
> >>>>>> >>
> >>>>>> >> Dense vs sparse is just a question of representation,
so doesn't
> >>>>>> >> make
> >>>>>> >> an operation on a vector more or less important as
a result.
> You've
> >>>>>> >> identified the reason that subtracting the mean can
be
> undesirable:
> >>>>>> >> a
> >>>>>> >> notionally billion-element sparse vector becomes too
big to fit
> in
> >>>>>> >> memory at once.
> >>>>>> >>
> >>>>>> >> I know this came up as a problem recently (I think
there's a
> JIRA?)
> >>>>>> >> because VectorAssembler will *sometimes* output a small
dense
> vector
> >>>>>> >> and sometimes output a small sparse vector based on
how many
> zeroes
> >>>>>> >> there are. But that's bad because then the StandardScaler
can't
> >>>>>> >> process the output at all. You can work on this if
you're
> >>>>>> >> interested;
> >>>>>> >> I think the proposal was to be able to force a dense
> representation
> >>>>>> >> only in VectorAssembler. I don't know if that's the
nature of the
> >>>>>> >> problem you're hitting.
> >>>>>> >>
> >>>>>> >> It can be meaningful to only scale the dimension without
> centering
> >>>>>> >> it,
> >>>>>> >> but it's not the same thing, no. The math is the math.
> >>>>>> >>
> >>>>>> >> This has come up a few times -- it's necessary to center
a sparse
> >>>>>> >> vector but prohibitive to do so. One idea I'd toyed
with in the
> past
> >>>>>> >> was to let a sparse vector have an 'offset' value applied
to all
> >>>>>> >> elements. That would let you shift all values while
preserving a
> >>>>>> >> sparse representation. I'm not sure if it's worth implementing
> but
> >>>>>> >> would help this case.
> >>>>>> >>
> >>>>>> >>
> >>>>>> >>
> >>>>>> >>
> >>>>>> >> On Wed, Aug 10, 2016 at 4:41 PM, Tobi Bosede <
> ani.tobib@gmail.com>
> >>>>>> >> wrote:
> >>>>>> >> > Hi everyone,
> >>>>>> >> >
> >>>>>> >> > I am doing some standardization using standardScaler
on data
> from
> >>>>>> >> > VectorAssembler which is represented as sparse
vectors. I plan
> to
> >>>>>> >> > fit a
> >>>>>> >> > regularized model.  However, standardScaler does
not allow the
> >>>>>> >> > mean to
> >>>>>> >> > be
> >>>>>> >> > subtracted from sparse vectors. It will only divide
by the
> >>>>>> >> > standard
> >>>>>> >> > deviation, which I understand is to keep the vector
sparse.
> Thus I
> >>>>>> >> > am
> >>>>>> >> > trying
> >>>>>> >> > to convert my sparse vectors into dense vectors,
but this may
> not
> >>>>>> >> > be
> >>>>>> >> > worthwhile.
> >>>>>> >> >
> >>>>>> >> > So my questions are:
> >>>>>> >> > Is subtracting the mean during standardization
only important
> when
> >>>>>> >> > working
> >>>>>> >> > with dense vectors? Does it not matter for sparse
vectors? Is
> just
> >>>>>> >> > dividing
> >>>>>> >> > by the standard deviation with sparse vectors
equivalent to
> also
> >>>>>> >> > dividing by
> >>>>>> >> > standard deviation w and subtracting mean with
dense vectors?
> >>>>>> >> >
> >>>>>> >> > Thank you,
> >>>>>> >> > Tobi
> >>>>>> >
> >>>>>> >
> >>>>>>
> >>>>>> ------------------------------------------------------------
> ---------
> >>>>>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
> >>>>>>
> >>
>

Mime
View raw message