spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Venkat, Ankam" <Ankam.Ven...@centurylink.com>
Subject Python Logistic Regression error
Date Sun, 23 Nov 2014 19:38:04 GMT
Can you please suggest sample data for running the logistic_regression.py?

I am trying to use a sample data file at  https://github.com/apache/spark/blob/master/data/mllib/sample_linear_regression_data.txt

I am running this on CDH5.2 Quickstart VM.

[cloudera@quickstart mllib]$ spark-submit logistic_regression.py lr.txt 3

But, getting below error.

14/11/23 11:23:55 ERROR TaskSetManager: Task 0 in stage 0.0 failed 4 times; aborting job
14/11/23 11:23:55 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed,
from pool
14/11/23 11:23:55 INFO TaskSchedulerImpl: Cancelling stage 0
14/11/23 11:23:55 INFO DAGScheduler: Failed to run runJob at PythonRDD.scala:296
Traceback (most recent call last):
  File "/usr/lib/spark/examples/lib/mllib/logistic_regression.py", line 50, in <module>
    model = LogisticRegressionWithSGD.train(points, iterations)
  File "/usr/lib/spark/python/pyspark/mllib/classification.py", line 110, in train
    initialWeights)
  File "/usr/lib/spark/python/pyspark/mllib/_common.py", line 430, in _regression_train_wrapper
    initial_weights = _get_initial_weights(initial_weights, data)
  File "/usr/lib/spark/python/pyspark/mllib/_common.py", line 415, in _get_initial_weights
    initial_weights = _convert_vector(data.first().features)
  File "/usr/lib/spark/python/pyspark/rdd.py", line 1127, in first
    rs = self.take(1)
  File "/usr/lib/spark/python/pyspark/rdd.py", line 1109, in take
    res = self.context.runJob(self, takeUpToNumLeft, p, True)
  File "/usr/lib/spark/python/pyspark/context.py", line 770, in runJob
    it = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, javaPartitions, allowLocal)
  File "/usr/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 538, in
__call__
  File "/usr/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed
4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, 192.168.139.145): org.apache.spark.api.python.PythonException:
Traceback (most recent call last):
  File "/usr/lib/spark/python/pyspark/worker.py", line 79, in main
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/usr/lib/spark/python/pyspark/serializers.py", line 196, in dump_stream
    self.serializer.dump_stream(self._batched(iterator), stream)
  File "/usr/lib/spark/python/pyspark/serializers.py", line 127, in dump_stream
    for obj in iterator:
  File "/usr/lib/spark/python/pyspark/serializers.py", line 185, in _batched
    for item in iterator:
  File "/usr/lib/spark/python/pyspark/rdd.py", line 1105, in takeUpToNumLeft
    yield next(iterator)
  File "/usr/lib/spark/examples/lib/mllib/logistic_regression.py", line 37, in parsePoint
    values = [float(s) for s in line.split(' ')]
ValueError: invalid literal for float(): 1:0.4551273600657362

Regards,
Venkat
This communication is the property of CenturyLink and may contain confidential or privileged
information. Unauthorized use of this communication is strictly prohibited and may be unlawful.
If you have received this communication in error, please immediately notify the sender by
reply e-mail and destroy all copies of the communication and any attachments.

Mime
View raw message