mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pat Ferrel <...@occamsmachete.com>
Subject Re: spark-itemsimilarity can't launch on a Spark cluster?
Date Tue, 14 Oct 2014 13:03:40 GMT
So that is 1g per core? That doesn’t sound like enough. Look for a way to use less cores
and allocate more memory per core maybe.

On Oct 13, 2014, at 8:01 PM, chepoo <swallow_pulm@163.com> wrote:

Hi Pat,
	I have no enough memory. A total of three machines, each machine only 16g of memory. Users
will be about two million, items about one million. so the history data about 2g.

On Oct 13, 2014, at 23:34, Pat Ferrel <pat@occamsmachete.com> wrote:

> You have 256G of memory in each node machine partitioned to 16g per core?
> 
> If so you should have -sem 256g or a little less since that is how much memory per node
to allocate. All cores of a node will share this memory.
> 
> The only unusual memory consideration is the dictionaries, which are broadcast to each
node and shared by each task on the node during read and write. So there needs to be enough
memory to store one copy of each dictionary per node. A dictionary is a bi-directional hashmap.
This will be a max of one item and one user id  dictionaries that that are broadcast for the
duration of the read and write tasks. If a problem is occurring during reading or writing
it might be the dictionaries but with 256g per node this seems unlikely. How many users and
items?
> 
> 
> On Oct 13, 2014, at 2:30 AM, pol <swallow_pulm@163.com> wrote:
> 
> Hi Pat,
> 	yes, I manually stopped it running, but there are some wrong, may be a configuration
errors may be insufficient memory, I have to spark mailing lists for help.
> 	The spark-itemsimilarity another problem I consulting in separate mail. Thank you.
> 
> 
> On Oct 11, 2014, at 09:22, Pat Ferrel <pat@occamsmachete.com> wrote:
> 
>> Did you stop the 1.6g job or did it fail?
>> 
>> I see task failures but no stage failures.
>> 
>> 
>> On Oct 10, 2014, at 8:49 AM, pol <swallow_pulm@163.com> wrote:
>> 
>> Hi Pat,
>> 	Yes, spark-itemsimilarity can be work ok, it had been finished calculation on 150m
dataset.
>> 
>> 	The problem above, 1.6g dataset can’t be finishing calculation, I have three machines(16
cores and 16g memory per) for this test, the environment can't finish the calculation?
>> 	The dataset had archived one file by hadoop archive tool, such as only a machine
at processing state. To do so because don’t archive will be coming some error, about information
can refer to the attachment.
>> 	<spark1.png>
>> 
>> <spark2.png>
>> 
>> <spark3.png>
>> 
>> 
>> 	If you can, I will provide the test dataset to you. 
>> 
>> 	Thank you again.
>> 
>> 
>> On Oct 10, 2014, at 22:07, Pat Ferrel <pat@occamsmachete.com> wrote:
>> 
>>> So it is completing some of the spar-itemsimilarity jobs now? That is better
at least.
>>> 
>>> Yes. More data means you may need more memory or more nodes in your cluster.
This is how to scale Spark and Hadoop. Spark in particular needs core memory since it tries
to avoid disk read/write.
>>> 
>>> Try increasing -sem as fas as you can first then you may need to add machines
to your cluster tp speed it up. Do you need results faster than 15 hours.
>>> 
>>> Remember the way the Solr recommender works allows you to make recommendations
to new users and train less often. The new user data does no have to be in the training/indicator
data. You train partly based on how many new user but partly based on how many new items are
added to the catalog.
>>> 
>>> A\On Oct 10, 2014, at 1:47 AM, pol <swallow_pulm@163.com> wrote:
>>> 
>>> Hi Pat,
>>> 	Because of a holiday, now just reply.
>>> 
>>> 	I changed 1.0.2 to 1.0.1 for mahout-1.0-SNAPSHOT, and use Spark 1.0.1 , Hadoop
2.4.0, spark-itemsimilarity can be work ok. But have a new question:
>>> 	mahout spark-itemsimilarity -i /view_input,/purchase_input -o /output -os -ma
spark://recommend1:7077 -sem 15g -f1 purchase -f2 view -ic 2 -fc 1 -m 36
>>> 
>>> 	When "view" data:1.6g and "purchase" data:60m, this shell 15 hours are not performed("indicator-matrix"
had computed, and "cross-indicator-matrix" computing), but "view" data:100m finished 2 minutes
to perform, this is the reason of data?
>>> 
>>> 
>>> On Oct 1, 2014, at 01:10, Pat Ferrel <pat@occamsmachete.com> wrote:
>>> 
>>>> This will not be fixed in Mahout 1.0 unless we can find a problem in Mahout
now. I am the one who would fix it. At present it looks to me like a Spark version or setup
problem.
>>>> 
>>>> These errors seem to indicate that the build or setup have a problems. It
seems that you cannot use Spark 1.10. Set up your cluster to use mahout-1.0-SNAPSHOT with
pom set to back to spark-1.0.1, Spark 1.0.1 build for Hadoop 2.4, and Hadoop 2.4. This is
the only combination that is supposed to work together.
>>>> 
>>>> If this still fails it may be a setup problems since I can run on a cluster
just fine with my setup. When you get an error from this config send it to me and the Spark
user list to see if they can give us a clue.
>>>> 
>>>> Question: Do you have mahout-1.0-SNAPSHOT and spark installed on all your
cluster machines, with the correct environment variables and path?
>>>> 
>>>> 
>>>> On Sep 30, 2014, at 12:47 AM, pol <swallow_pulm@163.com> wrote:
>>>> 
>>>> Hi Pat, 
>>>> 	It’s problem for Spark version, but spark-itemsimilarity is still can't
the completion of normal.
>>>> 
>>>> 1. Change 1.0.1 to 1.1.0 at mahout-1.0-SNAPSHOT/pom.xml, Spark version compatibility
is no problem, but the program has a problem:
>>>> --------------------------------------------------------------
>>>> 14/09/30 11:26:04 WARN scheduler.TaskSetManager: Lost task 1.0 in stage 10.1
(TID 31, Hadoop.Slave1): java.lang.NoClassDefFoundError:  
>>>>     org/apache/commons/math3/random/RandomGenerator
>>>>     org.apache.mahout.common.RandomUtils.getRandom(RandomUtils.java:65)
>>>>     org.apache.mahout.math.cf.SimilarityAnalysis$$anonfun$3.apply(SimilarityAnalysis.scala:228)
>>>>     org.apache.mahout.math.cf.SimilarityAnalysis$$anonfun$3.apply(SimilarityAnalysis.scala:223)
>>>>     org.apache.mahout.sparkbindings.blas.MapBlock$$anonfun$1.apply(MapBlock.scala:33)
>>>>     org.apache.mahout.sparkbindings.blas.MapBlock$$anonfun$1.apply(MapBlock.scala:32)
>>>>     scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>>>>     scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>>>>     org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:235)
>>>>     org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:163)
>>>>     org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:70)
>>>>     org.apache.spark.rdd.RDD.iterator(RDD.scala:227)
>>>>     org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
>>>>     org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>>>>     org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>>>>     org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>>>>     org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>>>>     org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>>>>     org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
>>>>     org.apache.spark.scheduler.Task.run(Task.scala:54)
>>>>     org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
>>>>     java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>>>>     java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>>>>     java.lang.Thread.run(Thread.java:662)
>>>> --------------------------------------------------------------
>>>> I tried to add commons-math3-3.2.jar to mahout-1.0-SNAPSHOT/lib, but still
the same. (It not directly use the RandomGenerator at RandomUtils.java:65)
>>>> 
>>>> 
>>>> 2. Change 1.0.1 to 1.0.2 at mahout-1.0-SNAPSHOT/pom.xml, there are still
other errors:
>>>> --------------------------------------------------------------
>>>> 14/09/30 14:36:57 WARN scheduler.TaskSetManager: Lost TID 427 (task 7.0:51)
>>>> 14/09/30 14:36:57 WARN scheduler.TaskSetManager: Loss was due to java.lang.ClassCastException
>>>> java.lang.ClassCastException: scala.Tuple1 cannot be cast to scala.Tuple2
>>>>     at org.apache.mahout.drivers.TDIndexedDatasetReader$$anonfun$4.apply(TextDelimitedReaderWriter.scala:75)
>>>>     at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>>>>     at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>>>>     at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:59)
>>>>     at org.apache.spark.rdd.PairRDDFunctions$$anonfun$1.apply(PairRDDFunctions.scala:96)
>>>>     at org.apache.spark.rdd.PairRDDFunctions$$anonfun$1.apply(PairRDDFunctions.scala:95)
>>>>     at org.apache.spark.rdd.RDD$$anonfun$15.apply(RDD.scala:594)
>>>>     at org.apache.spark.rdd.RDD$$anonfun$15.apply(RDD.scala:594)
>>>>     at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>>>>     at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>>>>     at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>>>>     at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:158)
>>>>     at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
>>>>     at org.apache.spark.scheduler.Task.run(Task.scala:51)
>>>>     at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183)
>>>>     at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>>>>     at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>>>>     at java.lang.Thread.run(Thread.java:662)
>>>> --------------------------------------------------------------
>>>> Please refer to the attachment for full log.
>>>> <screenlog_bash.log>
>>>> 
>>>> 
>>>> 
>>>> In addition, I used 66 files on HDFS than each file in 20 to 30 M,  if it
is necessary I will provide the data.
>>>> Shell is : mahout spark-itemsimilarity -i /rec/input/ss/others,/rec/input/ss/weblog
-o /rec/output/ss -os -ma spark://recommend1:7077 -sem 4g -f1 purchase -f2 view -ic 2 -fc
1
>>>> Spark cluster: 8 workers, 32 cores total, 32G memory total, at two machines.
>>>> 
>>>> Feeling a few days are not solved, not as good as waiting for Mahout 1.0
release version or use mahout item similarity.
>>>> 
>>>> 
>>>> Thank you again, Pat.
>>>> 
>>>> 
>>>> On Sep 29, 2014, at 00:02, Pat Ferrel <pat@occamsmachete.com> wrote:
>>>> 
>>>>> It looks like the cluster version of spark-itemsimilarity is never accepted
by the Spark master. it fails in TextDelimitedReaderWriter.scala because all work is using
“lazy” evaluation and until the write no actual work is done on the Spark cluster.
>>>>> 
>>>>> However your cluster seems to be working with the Pi example. Therefore
there must be something wrong with the Mahout build or config. Some ideas:
>>>>> 
>>>>> 1) Mahout 1.0-SNAPSHOT is targeted for Spark 1.0.1.  However I use 1.0.2
and it seems to work. You might try changing the version in the pom.xml and do a clean build
of Mahout. Change the version number in mahout/pom.xml
>>>>> 
>>>>> mahout/pom.xml
>>>>> -     <spark.version>1.0.1</spark.version>
>>>>> +    <spark.version>1.1.0</spark.version>
>>>>> 
>>>>> This may not be needed but it is easier than installing Spark 1.0.1.
>>>>> 
>>>>> 2) Try installing and building Mahout on all cluster machines. I do this
so I can run the Mahout spark-shell on any machine but it may be needed. The Mahout jars,
path setup, and directory structure should be the same on all cluster machines.
>>>>> 
>>>>> 3) Try making -sem larger. I usually make it as large a I can on the
cluster and try smaller until it affects performance. The epinions dataset that I use for
testing on my cluster requires -sem 6g.
>>>>> 
>>>>> My cluster has 3 machines with Hadoop 1.2.1 and Spark 1.0.2.  I can try
running your data through spark-itemsimilarity on my cluster if you can share it. I will sign
an NDA and destroy it after the test.
>>>>> 
>>>>> 
>>>>> 
>>>>> On Sep 27, 2014, at 5:28 AM, pol <swallow_pulm@163.com> wrote:
>>>>> 
>>>>> Hi Pat,
>>>>> 	Thank for your’s reply. It's still can't work normal, I tested it
on a Spark standalone cluster, don’t tested it on a YARN cluster.
>>>>> 
>>>>> First, test the cluster configuration is correct. http:///Hadoop.Master:8080
infos:
>>>>> -----------------------------------
>>>>> URL: spark://Hadoop.Master:7077
>>>>> Workers: 2
>>>>> Cores: 4 Total, 0 Used
>>>>> Memory: 2.0 GB Total, 0.0 B Used
>>>>> Applications: 0 Running, 1 Completed
>>>>> Drivers: 0 Running, 0 Completed
>>>>> Status: ALIVE
>>>>> ----------------------------------
>>>>> 
>>>>> Environment
>>>>> ----------------------------------
>>>>> OS: CentOS release 6.5 (Final)
>>>>> JDK: 1.6.0_45
>>>>> Mahout: mahout-1.0-SNAPSHOT(mvn -Dhadoop2.version=2.4.1 -DskipTests clean
package)
>>>>> Hadoop: 2.4.1
>>>>> Spark: spark-1.1.0-bin-2.4.1(mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.1
-Phive -DskipTests clean package)
>>>>> ----------------------------------
>>>>> 
>>>>> Shell:
>>>>>  spark-submit --class org.apache.spark.examples.SparkPi --master spark://Hadoop.Master:7077
--executor-memory 1g --total-executor-cores 2 /root/spark-examples_2.10-1.1.0.jar 1000
>>>>> 
>>>>> It’s work ok, a part of the log for the shell:
>>>>> ----------------------------------
>>>>> 14/09/19 19:48:00 INFO scheduler.TaskSetManager: Finished task 995.0
in stage 0.0 (TID 995) in 17 ms on Hadoop.Slave1 (996/1000)
>>>>> 14/09/19 19:48:00 INFO scheduler.TaskSetManager: Starting task 998.0
in stage 0.0 (TID 998, Hadoop.Slave2, PROCESS_LOCAL, 1225 bytes)
>>>>> 14/09/19 19:48:00 INFO scheduler.TaskSetManager: Finished task 996.0
in stage 0.0 (TID 996) in 20 ms on Hadoop.Slave2 (997/1000)
>>>>> 14/09/19 19:48:00 INFO scheduler.TaskSetManager: Starting task 999.0
in stage 0.0 (TID 999, Hadoop.Slave1, PROCESS_LOCAL, 1225 bytes)
>>>>> 14/09/19 19:48:00 INFO scheduler.TaskSetManager: Finished task 997.0
in stage 0.0 (TID 997) in 27 ms on Hadoop.Slave1 (998/1000)
>>>>> 14/09/19 19:48:00 INFO scheduler.TaskSetManager: Finished task 998.0
in stage 0.0 (TID 998) in 31 ms on Hadoop.Slave2 (999/1000)
>>>>> 14/09/19 19:48:00 INFO scheduler.TaskSetManager: Finished task 999.0
in stage 0.0 (TID 999) in 20 ms on Hadoop.Slave1 (1000/1000)
>>>>> 14/09/19 19:48:00 INFO scheduler.DAGScheduler: Stage 0 (reduce at SparkPi.scala:35)
finished in 25.109 s
>>>>> 14/09/19 19:48:00 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 0.0,
whose tasks have all completed, from pool
>>>>> 14/09/19 19:48:00 INFO spark.SparkContext: Job finished: reduce at SparkPi.scala:35,
took 26.156022565 s
>>>>> Pi is roughly 3.14156112
>>>>> ----------------------------------
>>>>> 
>>>>> Second, test spark-itemsimilarity on "local", it's work ok, shell:
>>>>>  mahout spark-itemsimilarity -i /test/ss/input/data.txt -o /test/ss/output
-os -ma local[2] -sem 512m -f1 purchase -f2 view -ic 2 -fc 1
>>>>> 
>>>>> Third, test spark-itemsimilarity on "cluster", shell:
>>>>>  mahout spark-itemsimilarity -i /test/ss/input/data.txt -o /test/ss/output
-os -ma spark://Hadoop.Master:7077 -sem 512m -f1 purchase -f2 view -ic 2 -fc 1
>>>>> 
>>>>> It’s can’t work, full logs:
>>>>> ----------------------------------
>>>>> MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
>>>>> SLF4J: Class path contains multiple SLF4J bindings.
>>>>> SLF4J: Found binding in [jar:file:/usr/mahout-1.0-SNAPSHOT/mrlegacy/target/mahout-mrlegacy-1.0-SNAPSHOT-job.jar!/org/slf4j/impl/StaticLoggerBinder.class]
>>>>> SLF4J: Found binding in [jar:file:/usr/mahout-1.0-SNAPSHOT/spark/target/mahout-spark_2.10-1.0-SNAPSHOT-job.jar!/org/slf4j/impl/StaticLoggerBinder.class]
>>>>> SLF4J: Found binding in [jar:file:/usr/spark-1.1.0-bin-2.4.1/lib/spark-assembly-1.1.0-hadoop2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
>>>>> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
>>>>> SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
>>>>> 14/09/19 20:31:07 INFO spark.SecurityManager: Changing view acls to:
root
>>>>> 14/09/19 20:31:07 INFO spark.SecurityManager: SecurityManager: authentication
disabled; ui acls disabled; users with view permissions: Set(root)
>>>>> 14/09/19 20:31:08 INFO slf4j.Slf4jLogger: Slf4jLogger started
>>>>> 14/09/19 20:31:08 INFO Remoting: Starting remoting
>>>>> 14/09/19 20:31:08 INFO Remoting: Remoting started; listening on addresses
:[akka.tcp://spark@Hadoop.Master:47597]
>>>>> 14/09/19 20:31:08 INFO Remoting: Remoting now listens on addresses: [akka.tcp://spark@Hadoop.Master:47597]
>>>>> 14/09/19 20:31:08 INFO spark.SparkEnv: Registering MapOutputTracker
>>>>> 14/09/19 20:31:08 INFO spark.SparkEnv: Registering BlockManagerMaster
>>>>> 14/09/19 20:31:08 INFO storage.DiskBlockManager: Created local directory
at /tmp/spark-local-20140919203108-e4e3
>>>>> 14/09/19 20:31:08 INFO storage.MemoryStore: MemoryStore started with
capacity 2.3 GB.
>>>>> 14/09/19 20:31:08 INFO network.ConnectionManager: Bound socket to port
47186 with id = ConnectionManagerId(Hadoop.Master,47186)
>>>>> 14/09/19 20:31:08 INFO storage.BlockManagerMaster: Trying to register
BlockManager
>>>>> 14/09/19 20:31:08 INFO storage.BlockManagerInfo: Registering block manager
Hadoop.Master:47186 with 2.3 GB RAM
>>>>> 14/09/19 20:31:08 INFO storage.BlockManagerMaster: Registered BlockManager
>>>>> 14/09/19 20:31:08 INFO spark.HttpServer: Starting HTTP Server
>>>>> 14/09/19 20:31:08 INFO server.Server: jetty-8.y.z-SNAPSHOT
>>>>> 14/09/19 20:31:08 INFO server.AbstractConnector: Started SocketConnector@0.0.0.0:41116
>>>>> 14/09/19 20:31:08 INFO broadcast.HttpBroadcast: Broadcast server started
at http://192.168.204.128:41116
>>>>> 14/09/19 20:31:08 INFO spark.HttpFileServer: HTTP File server directory
is /tmp/spark-10744709-bbeb-4d79-8bfe-d64d77799fb3
>>>>> 14/09/19 20:31:08 INFO spark.HttpServer: Starting HTTP Server
>>>>> 14/09/19 20:31:08 INFO server.Server: jetty-8.y.z-SNAPSHOT
>>>>> 14/09/19 20:31:08 INFO server.AbstractConnector: Started SocketConnector@0.0.0.0:59137
>>>>> 14/09/19 20:31:09 INFO server.Server: jetty-8.y.z-SNAPSHOT
>>>>> 14/09/19 20:31:09 INFO server.AbstractConnector: Started SelectChannelConnector@0.0.0.0:4040
>>>>> 14/09/19 20:31:09 INFO ui.SparkUI: Started SparkUI at http://Hadoop.Master:4040
>>>>> 14/09/19 20:31:10 WARN util.NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable
>>>>> 14/09/19 20:31:10 INFO spark.SparkContext: Added JAR /usr/mahout-1.0-SNAPSHOT/math-scala/target/mahout-math-scala_2.10-1.0-SNAPSHOT.jar
at http://192.168.204.128:59137/jars/mahout-math-scala_2.10-1.0-SNAPSHOT.jar with timestamp
1411129870562
>>>>> 14/09/19 20:31:10 INFO spark.SparkContext: Added JAR /usr/mahout-1.0-SNAPSHOT/mrlegacy/target/mahout-mrlegacy-1.0-SNAPSHOT.jar
at http://192.168.204.128:59137/jars/mahout-mrlegacy-1.0-SNAPSHOT.jar with timestamp 1411129870588
>>>>> 14/09/19 20:31:10 INFO spark.SparkContext: Added JAR /usr/mahout-1.0-SNAPSHOT/math/target/mahout-math-1.0-SNAPSHOT.jar
at http://192.168.204.128:59137/jars/mahout-math-1.0-SNAPSHOT.jar with timestamp 1411129870612
>>>>> 14/09/19 20:31:10 INFO spark.SparkContext: Added JAR /usr/mahout-1.0-SNAPSHOT/spark/target/mahout-spark_2.10-1.0-SNAPSHOT.jar
at http://192.168.204.128:59137/jars/mahout-spark_2.10-1.0-SNAPSHOT.jar with timestamp 1411129870618
>>>>> 14/09/19 20:31:10 INFO spark.SparkContext: Added JAR /usr/mahout-1.0-SNAPSHOT/math-scala/target/mahout-math-scala_2.10-1.0-SNAPSHOT.jar
at http://192.168.204.128:59137/jars/mahout-math-scala_2.10-1.0-SNAPSHOT.jar with timestamp
1411129870620
>>>>> 14/09/19 20:31:10 INFO spark.SparkContext: Added JAR /usr/mahout-1.0-SNAPSHOT/mrlegacy/target/mahout-mrlegacy-1.0-SNAPSHOT.jar
at http://192.168.204.128:59137/jars/mahout-mrlegacy-1.0-SNAPSHOT.jar with timestamp 1411129870631
>>>>> 14/09/19 20:31:10 INFO spark.SparkContext: Added JAR /usr/mahout-1.0-SNAPSHOT/math/target/mahout-math-1.0-SNAPSHOT.jar
at http://192.168.204.128:59137/jars/mahout-math-1.0-SNAPSHOT.jar with timestamp 1411129870644
>>>>> 14/09/19 20:31:10 INFO spark.SparkContext: Added JAR /usr/mahout-1.0-SNAPSHOT/spark/target/mahout-spark_2.10-1.0-SNAPSHOT.jar
at http://192.168.204.128:59137/jars/mahout-spark_2.10-1.0-SNAPSHOT.jar with timestamp 1411129870647
>>>>> 14/09/19 20:31:10 INFO client.AppClient$ClientActor: Connecting to master
spark://Hadoop.Master:7077...
>>>>> 14/09/19 20:31:13 INFO storage.MemoryStore: ensureFreeSpace(86126) called
with curMem=0, maxMem=2491102003
>>>>> 14/09/19 20:31:13 INFO storage.MemoryStore: Block broadcast_0 stored
as values to memory (estimated size 84.1 KB, free 2.3 GB)
>>>>> 14/09/19 20:31:13 INFO mapred.FileInputFormat: Total input paths to process
: 1
>>>>> 14/09/19 20:31:13 INFO spark.SparkContext: Starting job: collect at TextDelimitedReaderWriter.scala:74
>>>>> 14/09/19 20:31:13 INFO scheduler.DAGScheduler: Registering RDD 7 (distinct
at TextDelimitedReaderWriter.scala:74)
>>>>> 14/09/19 20:31:13 INFO scheduler.DAGScheduler: Got job 0 (collect at
TextDelimitedReaderWriter.scala:74) with 2 output partitions (allowLocal=false)
>>>>> 14/09/19 20:31:13 INFO scheduler.DAGScheduler: Final stage: Stage 0(collect
at TextDelimitedReaderWriter.scala:74)
>>>>> 14/09/19 20:31:13 INFO scheduler.DAGScheduler: Parents of final stage:
List(Stage 1)
>>>>> 14/09/19 20:31:13 INFO scheduler.DAGScheduler: Missing parents: List(Stage
1)
>>>>> 14/09/19 20:31:14 INFO scheduler.DAGScheduler: Submitting Stage 1 (MapPartitionsRDD[7]
at distinct at TextDelimitedReaderWriter.scala:74), which has no missing parents
>>>>> 14/09/19 20:31:14 INFO scheduler.DAGScheduler: Submitting 2 missing tasks
from Stage 1 (MapPartitionsRDD[7] at distinct at TextDelimitedReaderWriter.scala:74)
>>>>> 14/09/19 20:31:14 INFO scheduler.TaskSchedulerImpl: Adding task set 1.0
with 2 tasks
>>>>> 14/09/19 20:31:29 WARN scheduler.TaskSchedulerImpl: Initial job has not
accepted any resources; check your cluster UI to ensure that workers are registered and have
sufficient memory
>>>>> 14/09/19 20:31:30 INFO client.AppClient$ClientActor: Connecting to master
spark://Hadoop.Master:7077...
>>>>> 14/09/19 20:31:44 WARN scheduler.TaskSchedulerImpl: Initial job has not
accepted any resources; check your cluster UI to ensure that workers are registered and have
sufficient memory
>>>>> 14/09/19 20:31:50 INFO client.AppClient$ClientActor: Connecting to master
spark://Hadoop.Master:7077...
>>>>> 14/09/19 20:31:59 WARN scheduler.TaskSchedulerImpl: Initial job has not
accepted any resources; check your cluster UI to ensure that workers are registered and have
sufficient memory
>>>>> 14/09/19 20:32:10 ERROR cluster.SparkDeploySchedulerBackend: Application
has been killed. Reason: All masters are unresponsive! Giving up.
>>>>> 14/09/19 20:32:10 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 1.0,
whose tasks have all completed, from pool
>>>>> 14/09/19 20:32:10 INFO scheduler.TaskSchedulerImpl: Cancelling stage
1
>>>>> 14/09/19 20:32:10 INFO scheduler.DAGScheduler: Failed to run collect
at TextDelimitedReaderWriter.scala:74
>>>>> Exception in thread "main" org.apache.spark.SparkException: Job aborted
due to stage failure: All masters are unresponsive! Giving up.
>>>>> at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1044)
>>>>> at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1028)
>>>>> at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1026)
>>>>> at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>>>>> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>>>>> at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1026)
>>>>> at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:634)
>>>>> at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:634)
>>>>> at scala.Option.foreach(Option.scala:236)
>>>>> at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:634)
>>>>> at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1229)
>>>>> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
>>>>> at akka.actor.ActorCell.invoke(ActorCell.scala:456)
>>>>> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
>>>>> at akka.dispatch.Mailbox.run(Mailbox.scala:219)
>>>>> at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
>>>>> at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>>>>> at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>>>>> at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>>>>> at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>>>>> 14/09/19 20:32:10 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/metrics/json,null}
>>>>> 14/09/19 20:32:10 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/stages/stage/kill,null}
>>>>> 14/09/19 20:32:10 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/,null}
>>>>> 14/09/19 20:32:10 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/static,null}
>>>>> 14/09/19 20:32:10 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/executors/json,null}
>>>>> 14/09/19 20:32:10 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/executors,null}
>>>>> 14/09/19 20:32:10 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/environment/json,null}
>>>>> 14/09/19 20:32:10 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/environment,null}
>>>>> 14/09/19 20:32:10 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/storage/rdd/json,null}
>>>>> 14/09/19 20:32:10 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/storage/rdd,null}
>>>>> 14/09/19 20:32:10 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/storage/json,null}
>>>>> 14/09/19 20:32:10 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/storage,null}
>>>>> 14/09/19 20:32:10 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/stages/pool/json,null}
>>>>> 14/09/19 20:32:10 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/stages/pool,null}
>>>>> 14/09/19 20:32:10 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/stages/stage/json,null}
>>>>> 14/09/19 20:32:10 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/stages/stage,null}
>>>>> 14/09/19 20:32:10 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/stages/json,null}
>>>>> 14/09/19 20:32:10 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/stages,null}
>>>>> ----------------------------------
>>>>> 
>>>>> Thanks.
>>>>> 
>>>>> 
>>>>> 
>>>>> On Sep 27, 2014, at 01:05, Pat Ferrel <pat@occamsmachete.com> wrote:
>>>>> 
>>>>>> Any luck with this?
>>>>>> 
>>>>>> If not could you send a full stack trace and check on the cluster
machines for other logs that might help.
>>>>>> 
>>>>>> 
>>>>>> On Sep 25, 2014, at 6:34 AM, Pat Ferrel <pat@occamsmachete.com>
wrote:
>>>>>> 
>>>>>> Looks like a Spark error as far as I can tell. This error is very
generic and indicates that the job was not accepted for execution so Spark may be configured
wrong. This looks like a question for the Spark people
>>>>>> 
>>>>>> My Spark sanity check:
>>>>>> 
>>>>>> 1)  In the Spark UI at  http:///Hadoop.Master:8080 does everything
look correct?
>>>>>> 2) Have you tested your spark *cluster* with one of their examples?
Have you run *any non-Mahout* code on the cluster to check that it is configured properly?

>>>>>> 3) Are you using exactly the same Spark and Hadoop locally as on
the cluster? 
>>>>>> 4) Did you launch both local and cluster jobs from the same cluster
machine? The only difference being the master URL (local[2] vs. spark://Hadoop.Master:7077)?
>>>>>> 
>>>>>> 14/09/22 04:12:47 WARN scheduler.TaskSchedulerImpl: Initial job has
not accepted any resources; check your cluster UI to ensure that workers are registered and
have sufficient memory
>>>>>> 14/09/22 04:12:49 INFO client.AppClient$ClientActor: Connecting to
master spark://Hadoop.Master:7077...
>>>>>> 
>>>>>> 
>>>>>> On Sep 24, 2014, at 8:18 PM, pol <swallow_pulm@163.com> wrote:
>>>>>> 
>>>>>> Hi, Pat
>>>>>> 	Dataset is the same, and the data is very few for test. This is
a bug?
>>>>>> 
>>>>>> 
>>>>>> On Sep 25, 2014, at 02:57, Pat Ferrel <pat.ferrel@gmail.com>
wrote:
>>>>>> 
>>>>>>> Are you using different data sets on the local and cluster?
>>>>>>> 
>>>>>>> Try increasing spark memory with -sem, I use -sem 6g for the
epinions data set.
>>>>>>> 
>>>>>>> The ID dictionaries are kept in-memory on each cluster machine
so a large number of user or item IDs will need more memory.
>>>>>>> 
>>>>>>> 
>>>>>>> On Sep 24, 2014, at 9:31 AM, pol <swallow_pulm@163.com>
wrote:
>>>>>>> 
>>>>>>> Hi, All
>>>>>>> 	
>>>>>>> 	I’m sure it’s ok that launching Spark standalone to a cluster,
but it can’t work used for spark-itemsimilarity.
>>>>>>> 
>>>>>>> 	Launching on 'local' it’s ok:
>>>>>>> mahout spark-itemsimilarity -i /user/root/test/input/data.txt
-o /user/root/test/output -os -ma local[2] -f1 purchase -f2 view -ic 2 -fc 1 -sem 1g
>>>>>>> 
>>>>>>> 	but launching on a standalone cluster will be an error:
>>>>>>> mahout spark-itemsimilarity -i /user/root/test/input/data.txt
-o /user/root/test/output -os -ma spark://Hadoop.Master:7077 -f1 purchase -f2 view -ic 2 -fc
1 -sem 1g
>>>>>>> ------------
>>>>>>> 14/09/22 04:12:47 WARN scheduler.TaskSchedulerImpl: Initial job
has not accepted any resources; check your cluster UI to ensure that workers are registered
and have sufficient memory
>>>>>>> 14/09/22 04:12:49 INFO client.AppClient$ClientActor: Connecting
to master spark://Hadoop.Master:7077...
>>>>>>> 14/09/22 04:13:02 WARN scheduler.TaskSchedulerImpl: Initial job
has not accepted any resources; check your cluster UI to ensure that workers are registered
and have sufficient memory
>>>>>>> 14/09/22 04:13:09 INFO client.AppClient$ClientActor: Connecting
to master spark://Hadoop.Master:7077...
>>>>>>> 14/09/22 04:13:17 WARN scheduler.TaskSchedulerImpl: Initial job
has not accepted any resources; check your cluster UI to ensure that workers are registered
and have sufficient memory
>>>>>>> 14/09/22 04:13:29 ERROR cluster.SparkDeploySchedulerBackend:
Application has been killed. Reason: All masters are unresponsive! Giving up.
>>>>>>> 14/09/22 04:13:29 INFO scheduler.TaskSchedulerImpl: Removed TaskSet
1.0, whose tasks have all completed, from pool 
>>>>>>> 14/09/22 04:13:29 INFO scheduler.TaskSchedulerImpl: Cancelling
stage 1
>>>>>>> 14/09/22 04:13:29 INFO scheduler.DAGScheduler: Failed to run
collect at TextDelimitedReaderWriter.scala:74
>>>>>>> Exception in thread "main" org.apache.spark.SparkException: Job
aborted due to stage failure: All masters are unresponsive! Giving up.
>>>>>>> 	at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1044)
>>>>>>> 	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1028)
>>>>>>> 	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1026)
>>>>>>> 	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>>>>>>> 	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>>>>>>> 	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1026)
>>>>>>> 	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:634)
>>>>>>> 	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:634)
>>>>>>> 	at scala.Option.foreach(Option.scala:236)
>>>>>>> 	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:634)
>>>>>>> 	at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1229)
>>>>>>> 	at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
>>>>>>> 	at akka.actor.ActorCell.invoke(ActorCell.scala:456)
>>>>>>> 	at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
>>>>>>> 	at akka.dispatch.Mailbox.run(Mailbox.scala:219)
>>>>>>> 	at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
>>>>>>> 	at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>>>>>>> 	at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>>>>>>> 	at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>>>>>>> 	at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>>>>>>> ------------
>>>>>>> 
>>>>>>> Thanks.
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
> 
> 
> 




Mime
View raw message