spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nick Pentreath <nick.pentre...@gmail.com>
Subject Re: can mllib Logistic Regression package handle 10 million sparse features?
Date Tue, 11 Oct 2016 10:09:10 GMT
That's a good point about shuffle data compression. Still, it would be good
to benchmark the ideas behind https://github.com/apache/spark/pull/12761 I
think.

For many datasets, even within one partition the gradient sums etc can
remain very sparse. For example Criteo DAC data is extremely sparse - and
it has roughly 5% of active features per partition. However, you're correct
that as the coefficients (and intermediate stats counters) get aggregated
they will become more and more dense. But there is also the intermediate
memory overhead of the dense structures, though that comes into play in the
100s - 1000s millions feature range.

The situation in the PR above is actually different in that even the
coefficient vector itself is truly sparse (through some encoding they did
IRC). This is not an uncommon scenario however, as for high-dimensional
features users may want to use feature hashing which may result in actually
sparse coefficient vectors. With hashing often the feature dimension will
be chosen as power of 2 and higher (in some cases significantly) than the
true feature dimension to reduce collisions. So sparsity is critical here
for storage efficiency.

Your result for the final stage does seem to indicate something can be
improved - perhaps it is due to some level of fetch parallelism - so more
partitions may fetch more data in parallel? Because with just default
setting for `treeAggregate` I was seeing much faster times for the final
stage with 34 million feature dimension (though the final shuffle size
seems 50% of yours with 2x the features - this is with Spark 2.0.1, I
haven't tested out master yet with this data).

[image: Screen Shot 2016-10-11 at 12.03.55 PM.png]



On Fri, 7 Oct 2016 at 08:11 DB Tsai <dbtsai@dbtsai.com> wrote:

> Hi Nick,
>
>
>
> I'm also working on the benchmark of liner models in Spark. :)
>
>
>
> One thing I saw is that for sparse features, 14 million features, with
>
> multi-depth aggregation, the final aggregation to the driver is
>
> extremely slow. See the attachment. The amount of data being exchanged
>
> between executor and executor is significantly larger than collecting
>
> the data into driver, but the time for collecting the data back to
>
> driver takes 4mins while the aggregation between executors only takes
>
> 20secs. Seems that the code path is different, and I suspect that
>
> there may be something in the spark core that we can optimize.
>
>
>
> Regrading using sparse data structure for aggregation, I'm not so sure
>
> how much this will improve the performance. Since after computing the
>
> gradient sum for all the data in one partitions, the vector will be no
>
> longer to be very sparse. Even it's sparse, after couple depth of
>
> aggregation, it will be very dense. Also, we perform the compression
>
> in the shuffle phase, so if there are many zeros, even it's in dense
>
> vector representation, the vector should take around the same size as
>
> sparse representation. I can be wrong since I never do a study on
>
> this, and I wonder how much performance we can gain in practice by
>
> using sparse vector for aggregating the gradients.
>
>
>
> Sincerely,
>
>
>
> DB Tsai
>
> ----------------------------------------------------------
>
> Web: https://www.dbtsai.com
>
> PGP Key ID: 0xAF08DF8D
>
>
>
>
>
> On Thu, Oct 6, 2016 at 4:09 AM, Nick Pentreath <nick.pentreath@gmail.com>
> wrote:
>
> > I'm currently working on various performance tests for large, sparse
> feature
>
> > spaces.
>
> >
>
> > For the Criteo DAC data - 45.8 million rows, 34.3 million features
>
> > (categorical, extremely sparse), the time per iteration for
>
> > ml.LogisticRegression is about 20-30s.
>
> >
>
> > This is with 4x worker nodes, 48 cores & 120GB RAM each. I haven't yet
> tuned
>
> > the tree aggregation depth. But the number of partitions can make a
>
> > difference - generally fewer is better since the cost is mostly
>
> > communication of the gradient (the gradient computation is < 10% of the
>
> > per-iteration time).
>
> >
>
> > Note that the current impl forces dense arrays for intermediate data
>
> > structures, increasing the communication cost significantly. See this PR
> for
>
> > info: https://github.com/apache/spark/pull/12761. Once sparse data
>
> > structures are supported for this, the linear models will be orders of
>
> > magnitude more scalable for sparse data.
>
> >
>
> >
>
> > On Wed, 5 Oct 2016 at 23:37 DB Tsai <dbtsai@dbtsai.com> wrote:
>
> >>
>
> >> With the latest code in the current master, we're successfully
>
> >> training LOR using Spark ML's implementation with 14M sparse features.
>
> >> You need to tune the depth of aggregation to make it efficient.
>
> >>
>
> >> Sincerely,
>
> >>
>
> >> DB Tsai
>
> >> ----------------------------------------------------------
>
> >> Web: https://www.dbtsai.com
>
> >> PGP Key ID: 0x9DCC1DBD7FC7BBB2
>
> >>
>
> >>
>
> >> On Wed, Oct 5, 2016 at 12:00 PM, Yang <teddyyyy123@gmail.com> wrote:
>
> >> > anybody had actual experience applying it to real problems of this
>
> >> > scale?
>
> >> >
>
> >> > thanks
>
> >> >
>
> >>
>
> >> ---------------------------------------------------------------------
>
> >> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
> >>
>
> >
>
>

Mime
View raw message