spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From DB Tsai <dbt...@stanford.edu>
Subject Re: Logistic Regression MLLib Slow
Date Thu, 05 Jun 2014 06:55:56 GMT
Hi Krishna,

It should work, and we use it in production with great success.
However, the constructor of LogisticRegressionModel is private[mllib],
so you have to write your code, and have the package name under
org.apache.spark.mllib instead of using scala console.

Sincerely,

DB Tsai
-------------------------------------------------------
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai


On Wed, Jun 4, 2014 at 11:47 PM, Srikrishna S <srikrishna097@gmail.com> wrote:
> Does L-BFSG work with spark 1.0? (see code sample below).
>
> Eventually, I would like to have L-BFGS working but I was facing an issue
> where 10 passes over the data was taking forever. I ran spark in standalone
> mode and the performance is much better!
>
> Regards,
> Krishna
>
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> I am using http://spark.apache.org/docs/latest/mllib-optimization.html
>
> scala> val model = new LogisticRegressionModel(
>
>   Vectors.dense(weightsWithIntercept.toArray.slice(0,
> weightsWithIntercept.size - 1)),
>
>   weightsWithIntercept(weightsWithIntercept.size - 1))
>
>
> val model = new LogisticRegressionModel(
>
>      |   Vectors.dense(weightsWithIntercept.toArray.slice(0,
> weightsWithIntercept.size - 1)),
>
>      |   weightsWithIntercept(weightsWithIntercept.size - 1))
>
> <console>:20: error: constructor LogisticRegressionModel in class
> LogisticRegressionModel cannot be accessed in class $iwC
>
>        val model = new LogisticRegressionModel(
>
> Based on the documentation, it would seem like LogisticRegressionModel
> doesn't have a constructor:
> http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.classification.LogisticRegressionModel
>
> LogisticRegression *does* have a constructor:
> http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.classification.LogisticRegressionWithSGD
>
>
>
> On Wed, Jun 4, 2014 at 11:33 PM, DB Tsai <dbtsai@stanford.edu> wrote:
>>
>> Hi Krishna,
>>
>> Also, the default optimizer with SGD converges really slow. If you are
>> willing to write scala code, there is a full working example for
>> training Logistic Regression with L-BFGS (a quasi-Newton method) in
>> scala. It converges a way faster than SGD.
>>
>> See
>> http://spark.apache.org/docs/latest/mllib-optimization.html
>> for detail.
>>
>> Sincerely,
>>
>> DB Tsai
>> -------------------------------------------------------
>> My Blog: https://www.dbtsai.com
>> LinkedIn: https://www.linkedin.com/in/dbtsai
>>
>>
>> On Wed, Jun 4, 2014 at 7:56 PM, Xiangrui Meng <mengxr@gmail.com> wrote:
>> > Hi Krishna,
>> >
>> > Specifying executor memory in local mode has no effect, because all of
>> > the threads run inside the same JVM. You can either try
>> > --driver-memory 60g or start a standalone server.
>> >
>> > Best,
>> > Xiangrui
>> >
>> > On Wed, Jun 4, 2014 at 7:28 PM, Xiangrui Meng <mengxr@gmail.com> wrote:
>> >> 80M by 4 should be about 2.5GB uncompressed. 10 iterations shouldn't
>> >> take that long, even on a single executor. Besides what Matei
>> >> suggested, could you also verify the executor memory in
>> >> http://localhost:4040 in the Executors tab. It is very likely the
>> >> executors do not have enough memory. In that case, caching may be
>> >> slower than reading directly from disk. -Xiangrui
>> >>
>> >> On Wed, Jun 4, 2014 at 7:06 PM, Matei Zaharia <matei.zaharia@gmail.com>
>> >> wrote:
>> >>> Ah, is the file gzipped by any chance? We can’t decompress gzipped
>> >>> files in
>> >>> parallel so they get processed by a single task.
>> >>>
>> >>> It may also be worth looking at the application UI
>> >>> (http://localhost:4040)
>> >>> to see 1) whether all the data fits in memory in the Storage tab
>> >>> (maybe it
>> >>> somehow becomes larger, though it seems unlikely that it would exceed
>> >>> 20 GB)
>> >>> and 2) how many parallel tasks run in each iteration.
>> >>>
>> >>> Matei
>> >>>
>> >>> On Jun 4, 2014, at 6:56 PM, Srikrishna S <srikrishna097@gmail.com>
>> >>> wrote:
>> >>>
>> >>> I am using the MLLib one (LogisticRegressionWithSGD)  with PySpark.
I
>> >>> am
>> >>> running to only 10 iterations.
>> >>>
>> >>> The MLLib version of logistic regression doesn't seem to use all the
>> >>> cores
>> >>> on my machine.
>> >>>
>> >>> Regards,
>> >>> Krishna
>> >>>
>> >>>
>> >>>
>> >>> On Wed, Jun 4, 2014 at 6:47 PM, Matei Zaharia
>> >>> <matei.zaharia@gmail.com>
>> >>> wrote:
>> >>>>
>> >>>> Are you using the logistic_regression.py in examples/src/main/python
>> >>>> or
>> >>>> examples/src/main/python/mllib? The first one is an example of
>> >>>> writing
>> >>>> logistic regression by hand and won’t be as efficient as the MLlib
>> >>>> one. I
>> >>>> suggest trying the MLlib one.
>> >>>>
>> >>>> You may also want to check how many iterations it runs — by default
I
>> >>>> think it runs 100, which may be more than you need.
>> >>>>
>> >>>> Matei
>> >>>>
>> >>>> On Jun 4, 2014, at 5:47 PM, Srikrishna S <srikrishna097@gmail.com>
>> >>>> wrote:
>> >>>>
>> >>>> > Hi All.,
>> >>>> >
>> >>>> > I am new to Spark and I am trying to run LogisticRegression
(with
>> >>>> > SGD)
>> >>>> > using MLLib on a beefy single machine with about 128GB RAM.
The
>> >>>> > dataset has
>> >>>> > about 80M rows with only 4 features so it barely occupies 2Gb
on
>> >>>> > disk.
>> >>>> >
>> >>>> > I am running the code using all 8 cores with 20G memory using
>> >>>> > spark-submit --executor-memory 20G --master local[8]
>> >>>> > logistic_regression.py
>> >>>> >
>> >>>> > It seems to take about 3.5 hours without caching and over 5
hours
>> >>>> > with
>> >>>> > caching.
>> >>>> >
>> >>>> > What is the recommended use for Spark on a beefy single machine?
>> >>>> >
>> >>>> > Any suggestions will help!
>> >>>> >
>> >>>> > Regards,
>> >>>> > Krishna
>> >>>> >
>> >>>> >
>> >>>> > Code sample:
>> >>>> >
>> >>>> >
>> >>>> > ---------------------------------------------------------------------------------------------------------------------
>> >>>> > # Dataset
>> >>>> > d = sys.argv[1]
>> >>>> > data = sc.textFile(d)
>> >>>> >
>> >>>> > # Load and parse the data
>> >>>> > #
>> >>>> >
>> >>>> > ----------------------------------------------------------------------------------------------------------
>> >>>> > def parsePoint(line):
>> >>>> >     values = [float(x) for x in line.split(',')]
>> >>>> >     return LabeledPoint(values[0], values[1:])
>> >>>> > _parsedData = data.map(parsePoint)
>> >>>> > parsedData = _parsedData.cache()
>> >>>> > results = {}
>> >>>> >
>> >>>> > # Spark
>> >>>> > #
>> >>>> >
>> >>>> > ----------------------------------------------------------------------------------------------------------
>> >>>> > start_time = time.time()
>> >>>> > # Build the gl_model
>> >>>> > niters = 10
>> >>>> > spark_model = LogisticRegressionWithSGD.train(parsedData,
>> >>>> > iterations=niters)
>> >>>> >
>> >>>> > # Evaluate the gl_model on training data
>> >>>> > labelsAndPreds = parsedData.map(lambda p: (p.label,
>> >>>> > spark_model.predict(p.features)))
>> >>>> > trainErr = labelsAndPreds.filter(lambda (v, p): v != p).count()
/
>> >>>> > float(parsedData.count())
>> >>>> >
>> >>>>
>> >>>
>> >>>
>
>

Mime
View raw message