Hi Su,

I'm not sure what the problem is. Did you try other Spark examples on your cluster? Did they work? Could you try

trainingData.count()

before calling lrLearn.run()? Just want to check whether this is an MLlib issue.

Thanks,
Xiangrui

On Wed, Mar 25, 2015 at 3:27 PM, Su She <suhshekar52@gmail.com> wrote:
Hello Everyone,

I was hoping to see if anyone has any additional thoughts on this as I was able to find barely anything related to this error online (something related to dependencies/breeze?)...thank you!

Best,

Su

On Thu, Mar 19, 2015 at 10:54 AM, Su She <suhshekar52@gmail.com> wrote:
Hello Akhil,

I tried running it in an application, and I got the same result. The app gets stuck in Stage 1 at MLlib.scala at line 32 which in my app corresponds to: val model = lrLearner.run(trainingData). 

These are the details:

org.apache.spark.rdd.RDD.count(RDD.scala:910)
org.apache.spark.mllib.util.DataValidators$$anonfun$1.apply(DataValidators.scala:38)
org.apache.spark.mllib.util.DataValidators$$anonfun$1.apply(DataValidators.scala:37)
org.apache.spark.mllib.regression.GeneralizedLinearAlgorithm$$anonfun$run$2.apply(GeneralizedLinearAlgorithm.scala:161)
org.apache.spark.mllib.regression.GeneralizedLinearAlgorithm$$anonfun$run$2.apply(GeneralizedLinearAlgorithm.scala:161)
scala.collection.LinearSeqOptimized$class.forall(LinearSeqOptimized.scala:70)
scala.collection.immutable.List.forall(List.scala:84)
org.apache.spark.mllib.regression.GeneralizedLinearAlgorithm.run(GeneralizedLinearAlgorithm.scala:161)
org.apache.spark.mllib.regression.GeneralizedLinearAlgorithm.run(GeneralizedLinearAlgorithm.scala:146)
MLlib$.main(MLlib.scala:32)
MLlib.main(MLlib.scala)
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
java.lang.reflect.Method.invoke(Method.java:606)
org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:358)
org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

Thank you for the help Akhil!

Best,

Su


On Thu, Mar 19, 2015 at 1:27 AM, Akhil Das <akhil@sigmoidanalytics.com> wrote:
It seems its stuck at doing a count? What happening at line 38? I'm not seeing count operation in this code  anywhere https://github.com/databricks/learning-spark/blob/master/src/main/scala/com/oreilly/learningsparkexamples/scala/MLlib.scala#L48

Thanks
Best Regards

On Thu, Mar 19, 2015 at 1:32 PM, Su She <suhshekar52@gmail.com> wrote:
Hello Akhil,

Thanks for the info! Here is my UI...I am not sure what to make of the information here:

Inline image 1

Inline image 2

Details of active stage:

org.apache.spark.rdd.RDD.count(RDD.scala:910)
org.apache.spark.mllib.util.DataValidators$$anonfun$1.apply(DataValidators.scala:38)
org.apache.spark.mllib.util.DataValidators$$anonfun$1.apply(DataValidators.scala:37)
org.apache.spark.mllib.regression.GeneralizedLinearAlgorithm$$anonfun$run$2.apply(GeneralizedLinearAlgorithm.scala:161)
org.apache.spark.mllib.regression.GeneralizedLinearAlgorithm$$anonfun$run$2.apply(GeneralizedLinearAlgorithm.scala:161)
scala.collection.LinearSeqOptimized$class.forall(LinearSeqOptimized.scala:70)
scala.collection.immutable.List.forall(List.scala:84)
org.apache.spark.mllib.regression.GeneralizedLinearAlgorithm.run(GeneralizedLinearAlgorithm.scala:161)
org.apache.spark.mllib.regression.GeneralizedLinearAlgorithm.run(GeneralizedLinearAlgorithm.scala:146)
$line21.$read$$iwC$$iwC$$iwC$$iwC.<init>(<console>:33)
$line21.$read$$iwC$$iwC$$iwC.<init>(<console>:38)
$line21.$read$$iwC$$iwC.<init>(<console>:40)
$line21.$read$$iwC.<init>(<console>:42)
$line21.$read.<init>(<console>:44)
$line21.$read$.<init>(<console>:48)
$line21.$read$.<clinit>(<console>)
$line21.$eval$.<init>(<console>:7)
$line21.$eval$.<clinit>(<console>)
$line21.$eval.$print(<console>)
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

Thank you for the help Akhil!

-Su

On Thu, Mar 19, 2015 at 12:49 AM, Akhil Das <akhil@sigmoidanalytics.com> wrote:
To get these metrics out, you need to open the driver ui running on port 4040. And in there you will see Stages information and for each stage you can see how much time it is spending on GC etc. In your case, the parallelism seems 4, the more # of parallelism the more # of tasks you will see. 

Thanks
Best Regards

On Thu, Mar 19, 2015 at 1:15 PM, Su She <suhshekar52@gmail.com> wrote:
Hi Akhil,

1) How could I see how much time it is spending on stage 1? Or what if, like above, it doesn't get past stage 1? 

2) How could I check if its a GC time? and where would I increase the parallelism for the model? I have a Spark Master and 2 Workers running on CDH 5.3...what would the default spark-shell level of parallelism be...I thought it would be 3?

Thank you for the help!

-Su


On Thu, Mar 19, 2015 at 12:32 AM, Akhil Das <akhil@sigmoidanalytics.com> wrote:
Can you see where exactly it is spending time? Like you said it goes to Stage 2, then you will be able to see how much time it spend on Stage 1. See if its a GC time, then try increasing the level of parallelism or repartition it like sc.getDefaultParallelism*3.

Thanks
Best Regards

On Thu, Mar 19, 2015 at 12:15 PM, Su She <suhshekar52@gmail.com> wrote:
Hello Everyone,

I am trying to run this MLlib example from Learning Spark:

Things I'm doing differently:

1) Using spark shell instead of an application

2) instead of their spam.txt and normal.txt I have text files with 3700 and 2700 words...nothing huge at all and just plain text 

3) I've used numFeatures = 100, 1000 and 10,000

Error: I keep getting stuck when I try to run the model:

val model = new LogisticRegressionWithSGD().run(trainingData)

It will freeze on something like this:

[Stage 1:==============>                                            (1 + 0) / 4]

Sometimes its Stage 1, 2 or 3.

I am not sure what I am doing wrong...any help is much appreciated, thank you!

-Su