mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pat Ferrel <...@occamsmachete.com>
Subject Re: Running Mahout on a Spark cluster
Date Tue, 03 Oct 2017 21:12:12 GMT
Thanks Trevor,

this encoding leaves the Scala version hard coded. But this is an appreciated clue and will
get me going. There may be a way to use the %% with this or just explicitly add the scala
version string.

@Hoa, I plan to update that repo.


On Oct 3, 2017, at 1:26 PM, Trevor Grant <trevor.d.grant@gmail.com> wrote:

The spark is included via maven classifier-

the sbt line should be

libraryDependencies += "org.apache.mahout" % "mahout-spark_2.11" %
"0.13.1-SNAPSHOT" classifier "spark_2.1"


On Tue, Oct 3, 2017 at 2:55 PM, Pat Ferrel <pat@occamsmachete.com> wrote:

> I’m the aforementioned pferrel
> 
> @Hoa, thanks for that reference, I forgot I had that example. First don’t
> use the Hadoop part of Mahout, it is not supported and will be deprecated.
> The Spark version of cooccurrence will be supported. You find it in the
> SimilarityAnalysis object.
> 
> If you go back to the last release you should be able to make that
> https://github.com/pferrel/3-input-cooc <https://github.com/pferrel/3-
> input-cooc> work with version updates to Mahout-0.13.0 and dependencies.
> To use the latest master of Mahout, there are the problems listed below.
> 
> 
> I’m having a hard time building with sbt using the mahout-spark module
> when I build that latest mahout master with `mvn clean install`. This puts
> the mahout-spark module in the local ~/.m2 maven cache. The structure
> doesn’t match what SBT expects the path and filenames to be.
> 
> The build.sbt  `libraryDependencies` line *should* IMO be:
> `"org.apache.mahout" %% "mahout-spark-2.1" % “0.13.1-SNAPSHOT`
> 
> This is parsed by sbt to yield the path of :
> org/apache/mahout/mahout-spark-2.1/0.13.1-SNAPSHOT/
> mahout-spark-2.1_2.11-0.13.1-SNAPSHOT.jar
> 
> unfortunately the outcome of `mvn clean install` currently is (I think):
> org/apache/mahout/mahout-spark/0.13.1-SNAPSHOT/mahout-
> spark-0.13.1-SNAPSHOT-spark_2.1.jar
> 
> I can’t find a way to make SBT parse that structure and name.
> 
> 
> On Oct 2, 2017, at 11:02 PM, Trevor Grant <trevor.d.grant@gmail.com>
> wrote:
> 
> Code pointer:
> https://github.com/rawkintrevo/cylons/tree/master/eigenfaces
> 
> However, I build Mahout (0.13.1-SNAPSHOT) locally with
> 
> mvn clean install -Pscala-2.11,spark-2.1,viennacl-omp -DskipTests
> 
> That's how maven was able to pick those up.
> 
> 
> On Fri, Sep 22, 2017 at 10:06 PM, Hoa Nguyen <hoa@insightdatascience.com>
> wrote:
> 
>> Hey all,
>> 
>> Thanks for the offers of help. I've been able to narrow down some of the
>> problems to version incompatibility and I just wanted to give an update.
>> Just to back track a bit, my initial goal was to run Mahout on a
>> distributed cluster whether that was running Hadoop Map Reduce or Spark.
>> 
>> I started out trying to get it to run on Spark, which I have some
>> familiarity, but that didn't seem to work. While the error messages seem
> to
>> indicate there weren't enough resources on the workers ("WARN
>> scheduler.TaskSchedulerImpl: Initial job has not accepted any resources;
>> check your cluster UI to ensure that workers are registered and have
>> sufficient memory"), I'm pretty sure that wasn't the case, not only
> because
>> it's a 4 node cluster of m4.xlarges, I was able to run another, simpler
>> Spark batch job on that same distributed cluster.
>> 
>> After a bit of wrangling, I was able to narrow down some of the issues.
> It
>> turns out I was kind of blindly using this repo https://github.com/
>> pferrel/3-input-cooc as a guide without fully realizing that it was from
>> several years ago and based on Mahout 0.10.0, Scala 2.10 and Spark 1.1.1
>> That is significantly different from my environment, which has Mahout
>> 0.13.0 and Spark 2.1.1 installed, which also means I have to use Scala
>> 2.11. After modifying the build.sbt file to account for those versions, I
>> now have compile type mismatch issues that I'm just not that savvy to fix
>> (see attached screenshot if you're interested).
>> 
>> Anyway, the good news that I was able to finally get Mahout code running
>> on Hadoop map-reduce, but also after a bit wrangling. It turned out my
>> instances were running Ubuntu 14 and apparently that doesn't play well
> with
>> Hadoop 2.7.4, which prevented me from running any sample Mahout code
> (from
>> here: https://github.com/apache/mahout/tree/master/examples/bin) that
>> relied on map-reduce. Those problems went away after I installed Hadoop
>> 2.8.1 instead. Now I'm able to get the shell scripts running on a
>> distributed Hadoop cluster (yay!).
>> 
>> Anyway, if anyone has more recent and working Spark Scala code that uses
>> Mahout that they can point me to, I'd appreciate it.
>> 
>> Many thanks!
>> Hoa
>> 
>> On Fri, Sep 22, 2017 at 1:09 AM, Trevor Grant <trevor.d.grant@gmail.com>
>> wrote:
>> 
>>> Hi Hoa,
>>> 
>>> A few things could be happening here, I haven't run across that specific
>>> error.
>>> 
>>> 1) Spark 2.x - Mahout 0.13.0: Mahout 0.13.0 WILL run on Spark 2.x,
> however
>>> you need to build from source (not the binaries).  You can do this by
>>> downloading mahout source or cloning the repo and building with:
>>> mvn clean install -Pspark-2.1,scala-2.11 -DskipTests
>>> 
>>> 2) Have you setup spark with Kryo serialization? How you do this depends
>>> on
>>> if you're in the shell/zeppelin or using spark submit.
>>> 
>>> However, for both of these cases- it shouldn't have even run local afaik
>>> so
>>> the fact it did tells me you probably have gotten this far?
>>> 
>>> Assuming you've done 1 and 2, can you share some code? I'll see if I can
>>> recreate on my end.
>>> 
>>> Thanks!
>>> 
>>> tg
>>> 
>>> On Thu, Sep 21, 2017 at 9:37 PM, Hoa Nguyen <hoa@insightdatascience.com
>> 
>>> wrote:
>>> 
>>>> I apologize in advance if this is too much of a newbie question but I'm
>>>> having a hard time running any Mahout example code in a distributed
>>> Spark
>>>> cluster. The code runs as advertised when Spark is running locally on
>>> one
>>>> machine but the minute I point Spark to a cluster and master url, I
>>> can't
>>>> get it to work, drawing the error: "WARN scheduler.TaskSchedulerImpl:
>>>> Initial job has not accepted any resources; check your cluster UI to
>>> ensure
>>>> that workers are registered and have sufficient memory"
>>>> 
>>>> I know my Spark cluster is configured and working correctly because I
>>> ran
>>>> non-Mahout code and it runs on a distributed cluster fine. What am I
>>> doing
>>>> wrong? The only thing I can think of is that my Spark version is too
>>> recent
>>>> -- 2.1.1 -- for the Mahout version I'm using -- 0.13.0. Is that it or
>>> am I
>>>> doing something else wrong?
>>>> 
>>>> Thanks for any advice,
>>>> Hoa
>>>> 
>>> 
>> 
>> 
> 
> 


Mime
View raw message