spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Holden Karau <hol...@pigscanfly.ca>
Subject Re: MLlib Spam example gets stuck in Stage X
Date Mon, 30 Mar 2015 19:10:54 GMT
Thanks for pointing that out, I've updated the ham & spam example files,
they should be good from master currently.

On Mon, Mar 30, 2015 at 10:16 AM, Xiangrui Meng <mengxr@gmail.com> wrote:

> +Holden, Joseph
>
> It seems that there is something wrong with the sample data file:
> https://github.com/databricks/learning-spark/blob/master/files/ham.txt
>
> -Xiangrui
>
> On Fri, Mar 27, 2015 at 1:03 PM, Su She <suhshekar52@gmail.com> wrote:
>
>> Hello Xiangrui,
>>
>> Hmm, yes I have run other Spark (word count, spark streaming/kafka, etc)
>> examples locally, the same way I'm trying to run this MLlib example (i've
>> tried local[2] and local [4]).
>>
>> 1) I did trainingData.count() and the job was completed. The output was
>> 2...should this only be 2 or 400 (since each text file has 200 words)?
>>
>> 2) I noticed the code says: val trainingData = positiveExamples ++
>> negativeExamples
>>
>> I'm not very familiar with scala, but the ++ sign seems weird to me, but
>> when I tried to only have one + sign, it did not build
>>
>> 3) I found a similar thread here...
>> http://mail-archives.apache.org/mod_mbox/spark-user/201409.mbox/%3CCAFRXrqf6DRxLCSB7q1-W1PuAYAdPnB8WOmwiWE_8++Okq2cArg@mail.gmail.com%3E
>>
>> it looks like Emily had the same problem (count at
>> DataValidators.scala:38
>> <http://localhost:4040/stages/stage?id=2&attempt=0>), but doesn't seem
>> like a solution was found. Also, I don't get any of those errors printed to
>> the console.
>>
>> 4) sorry, not sure what else to say, as this is a pretty basic example.
>> thank you for the help!
>>
>> best,
>>
>> Su
>>
>> On Fri, Mar 27, 2015 at 11:23 AM, Xiangrui Meng <mengxr@gmail.com> wrote:
>>
>>> Hi Su,
>>>
>>> I'm not sure what the problem is. Did you try other Spark examples on
>>> your cluster? Did they work? Could you try
>>>
>>> trainingData.count()
>>>
>>> before calling lrLearn.run()? Just want to check whether this is an
>>> MLlib issue.
>>>
>>> Thanks,
>>> Xiangrui
>>>
>>> On Wed, Mar 25, 2015 at 3:27 PM, Su She <suhshekar52@gmail.com> wrote:
>>>
>>>> Hello Everyone,
>>>>
>>>> I was hoping to see if anyone has any additional thoughts on this as I
>>>> was able to find barely anything related to this error online (something
>>>> related to dependencies/breeze?)...thank you!
>>>>
>>>> Best,
>>>>
>>>> Su
>>>>
>>>> On Thu, Mar 19, 2015 at 10:54 AM, Su She <suhshekar52@gmail.com> wrote:
>>>>
>>>>> Hello Akhil,
>>>>>
>>>>> I tried running it in an application, and I got the same result. The
>>>>> app gets stuck in Stage 1 at MLlib.scala at line 32 which in my app
>>>>> corresponds to: val model = lrLearner.run(trainingData).
>>>>>
>>>>> These are the details:
>>>>>
>>>>> org.apache.spark.rdd.RDD.count(RDD.scala:910)
>>>>> org.apache.spark.mllib.util.DataValidators$$anonfun$1.apply(DataValidators.scala:38)
>>>>> org.apache.spark.mllib.util.DataValidators$$anonfun$1.apply(DataValidators.scala:37)
>>>>> org.apache.spark.mllib.regression.GeneralizedLinearAlgorithm$$anonfun$run$2.apply(GeneralizedLinearAlgorithm.scala:161)
>>>>> org.apache.spark.mllib.regression.GeneralizedLinearAlgorithm$$anonfun$run$2.apply(GeneralizedLinearAlgorithm.scala:161)
>>>>> scala.collection.LinearSeqOptimized$class.forall(LinearSeqOptimized.scala:70)
>>>>> scala.collection.immutable.List.forall(List.scala:84)
>>>>> org.apache.spark.mllib.regression.GeneralizedLinearAlgorithm.run(GeneralizedLinearAlgorithm.scala:161)
>>>>> org.apache.spark.mllib.regression.GeneralizedLinearAlgorithm.run(GeneralizedLinearAlgorithm.scala:146)
>>>>> MLlib$.main(MLlib.scala:32)
>>>>> MLlib.main(MLlib.scala)
>>>>> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>>> java.lang.reflect.Method.invoke(Method.java:606)
>>>>> org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:358)
>>>>> org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
>>>>> org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>>>>>
>>>>>
>>>>> Thank you for the help Akhil!
>>>>>
>>>>> Best,
>>>>>
>>>>> Su
>>>>>
>>>>>
>>>>> On Thu, Mar 19, 2015 at 1:27 AM, Akhil Das <akhil@sigmoidanalytics.com
>>>>> > wrote:
>>>>>
>>>>>> It seems its stuck at doing a count? What happening at line 38? I'm
>>>>>> not seeing count operation in this code  anywhere
>>>>>> https://github.com/databricks/learning-spark/blob/master/src/main/scala/com/oreilly/learningsparkexamples/scala/MLlib.scala#L48
>>>>>>
>>>>>> Thanks
>>>>>> Best Regards
>>>>>>
>>>>>> On Thu, Mar 19, 2015 at 1:32 PM, Su She <suhshekar52@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hello Akhil,
>>>>>>>
>>>>>>> Thanks for the info! Here is my UI...I am not sure what to make
of
>>>>>>> the information here:
>>>>>>>
>>>>>>> [image: Inline image 1]
>>>>>>>
>>>>>>> [image: Inline image 2]
>>>>>>>
>>>>>>> Details of active stage:
>>>>>>>
>>>>>>> org.apache.spark.rdd.RDD.count(RDD.scala:910)
>>>>>>> org.apache.spark.mllib.util.DataValidators$$anonfun$1.apply(DataValidators.scala:38)
>>>>>>> org.apache.spark.mllib.util.DataValidators$$anonfun$1.apply(DataValidators.scala:37)
>>>>>>> org.apache.spark.mllib.regression.GeneralizedLinearAlgorithm$$anonfun$run$2.apply(GeneralizedLinearAlgorithm.scala:161)
>>>>>>> org.apache.spark.mllib.regression.GeneralizedLinearAlgorithm$$anonfun$run$2.apply(GeneralizedLinearAlgorithm.scala:161)
>>>>>>> scala.collection.LinearSeqOptimized$class.forall(LinearSeqOptimized.scala:70)
>>>>>>> scala.collection.immutable.List.forall(List.scala:84)
>>>>>>> org.apache.spark.mllib.regression.GeneralizedLinearAlgorithm.run(GeneralizedLinearAlgorithm.scala:161)
>>>>>>> org.apache.spark.mllib.regression.GeneralizedLinearAlgorithm.run(GeneralizedLinearAlgorithm.scala:146)
>>>>>>> $line21.$read$$iwC$$iwC$$iwC$$iwC.<init>(<console>:33)
>>>>>>> $line21.$read$$iwC$$iwC$$iwC.<init>(<console>:38)
>>>>>>> $line21.$read$$iwC$$iwC.<init>(<console>:40)
>>>>>>> $line21.$read$$iwC.<init>(<console>:42)
>>>>>>> $line21.$read.<init>(<console>:44)
>>>>>>> $line21.$read$.<init>(<console>:48)
>>>>>>> $line21.$read$.<clinit>(<console>)
>>>>>>> $line21.$eval$.<init>(<console>:7)
>>>>>>> $line21.$eval$.<clinit>(<console>)
>>>>>>> $line21.$eval.$print(<console>)
>>>>>>> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>>>>>
>>>>>>>
>>>>>>> Thank you for the help Akhil!
>>>>>>>
>>>>>>> -Su
>>>>>>>
>>>>>>> On Thu, Mar 19, 2015 at 12:49 AM, Akhil Das <
>>>>>>> akhil@sigmoidanalytics.com> wrote:
>>>>>>>
>>>>>>>> To get these metrics out, you need to open the driver ui
running on
>>>>>>>> port 4040. And in there you will see Stages information and
for each stage
>>>>>>>> you can see how much time it is spending on GC etc. In your
case, the
>>>>>>>> parallelism seems 4, the more # of parallelism the more #
of tasks you will
>>>>>>>> see.
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>> Best Regards
>>>>>>>>
>>>>>>>> On Thu, Mar 19, 2015 at 1:15 PM, Su She <suhshekar52@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi Akhil,
>>>>>>>>>
>>>>>>>>> 1) How could I see how much time it is spending on stage
1? Or
>>>>>>>>> what if, like above, it doesn't get past stage 1?
>>>>>>>>>
>>>>>>>>> 2) How could I check if its a GC time? and where would
I increase
>>>>>>>>> the parallelism for the model? I have a Spark Master
and 2 Workers running
>>>>>>>>> on CDH 5.3...what would the default spark-shell level
of parallelism be...I
>>>>>>>>> thought it would be 3?
>>>>>>>>>
>>>>>>>>> Thank you for the help!
>>>>>>>>>
>>>>>>>>> -Su
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, Mar 19, 2015 at 12:32 AM, Akhil Das <
>>>>>>>>> akhil@sigmoidanalytics.com> wrote:
>>>>>>>>>
>>>>>>>>>> Can you see where exactly it is spending time? Like
you said it
>>>>>>>>>> goes to Stage 2, then you will be able to see how
much time it spend on
>>>>>>>>>> Stage 1. See if its a GC time, then try increasing
the level of parallelism
>>>>>>>>>> or repartition it like sc.getDefaultParallelism*3.
>>>>>>>>>>
>>>>>>>>>> Thanks
>>>>>>>>>> Best Regards
>>>>>>>>>>
>>>>>>>>>> On Thu, Mar 19, 2015 at 12:15 PM, Su She <suhshekar52@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hello Everyone,
>>>>>>>>>>>
>>>>>>>>>>> I am trying to run this MLlib example from Learning
Spark:
>>>>>>>>>>>
>>>>>>>>>>> https://github.com/databricks/learning-spark/blob/master/src/main/scala/com/oreilly/learningsparkexamples/scala/MLlib.scala#L48
>>>>>>>>>>>
>>>>>>>>>>> Things I'm doing differently:
>>>>>>>>>>>
>>>>>>>>>>> 1) Using spark shell instead of an application
>>>>>>>>>>>
>>>>>>>>>>> 2) instead of their spam.txt and normal.txt I
have text files
>>>>>>>>>>> with 3700 and 2700 words...nothing huge at all
and just plain text
>>>>>>>>>>>
>>>>>>>>>>> 3) I've used numFeatures = 100, 1000 and 10,000
>>>>>>>>>>>
>>>>>>>>>>> *Error: *I keep getting stuck when I try to run
the model:
>>>>>>>>>>>
>>>>>>>>>>> val model = new LogisticRegressionWithSGD().run(trainingData)
>>>>>>>>>>>
>>>>>>>>>>> It will freeze on something like this:
>>>>>>>>>>>
>>>>>>>>>>> [Stage 1:==============>
>>>>>>>>>>>    (1 + 0) / 4]
>>>>>>>>>>>
>>>>>>>>>>> Sometimes its Stage 1, 2 or 3.
>>>>>>>>>>>
>>>>>>>>>>> I am not sure what I am doing wrong...any help
is much
>>>>>>>>>>> appreciated, thank you!
>>>>>>>>>>>
>>>>>>>>>>> -Su
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>


-- 
Cell : 425-233-8271

Mime
View raw message