spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Owen <>
Subject Re: Question about differences between batch and streaming training of LogisticRegression Algorithm in Spark3.0
Date Wed, 09 Sep 2020 12:45:55 GMT
I'm not sure that second count can be optimized away, as it's used a few times.
Are you sure it takes that long? how are you measuring that and is it
not perhaps the effect of caching the data the first time?
What is the nature of the data that it takes that long?

On Wed, Sep 9, 2020 at 6:21 AM cfang1109 <> wrote:
> We want to use socket streaming data to train a LR Model with StreamingLogisticRegressionWithSGD
and now have some questions.
> 1,The trainOn method of StreamingLogisticRegressionWithSGD contains a part of code
like this,
> data.foreachRDD{ (rdd, time) =>
>        if (!rdd.isEmpty) { ... }
> }
> And we found that the rdd.isEmpty cost too much time, actually, 2s while this batch RDD
training cost 9s. We believe this is a point that we could optimize, but we don't konw how.
> 2,The Optimizer instance between LogisticRegressionWithSGD and LogisticRegressionWithLBFGS
is different, the former is GradientDescent while the latter LBFGS.
> Now the following description is interesting. We found that GradientDescent contains
a line code like this,
> val numExamples = data.count()
> // if no data, return initial weights to avoid NaNs
> if (numExamples == 0) {
>   logWarning("GradientDescent.runMiniBatchSGD returning initial weights, no data found")
>   return (initialWeights, stochasticLossHistory.toArray)
> }
> if (numExamples * miniBatchFraction < 1) {
>   logWarning("The miniBatchFraction is too small")
> }
> ,where data is the input training data with the form (label, [feature values]) .
> And we found the data.count() action operation cost too much time, actually 5s while
this data training costs 9s.
> However, another Optimizer implement LBFGS does not have this problem.
> Now the interesting point is that, the streaming implement for LR is StreamingLogisticRegressionWithSGD
whose inner algorithm is LogisticRegressionWithSGD with GradientDescent Optimizer, while batch
implement for LR is LogisticRegresionWithLBFGS with LBFS Optimizer. The result of this that
the performance of batch implement LR is better.  I think that's unacceptable, please help
me and any comment is appreciated.

To unsubscribe e-mail:

View raw message