mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Choon-Siang \"Jeffrey04\" Lai" <mycyber...@yahoo.com>
Subject Re: #clojure #fkmeans - Clustering of Test Data Failed
Date Tue, 13 Sep 2011 08:11:13 GMT
Btw, does the clusterdumper return the points of each cluster like the synthetic control data
example?



>________________________________
>From: "Choon-Siang "Jeffrey04" Lai" <mycyberpet@yahoo.com>
>To: "user@mahout.apache.org" <user@mahout.apache.org>
>Sent: Monday, September 12, 2011 3:49 PM
>Subject: Re: #clojure #fkmeans - Clustering of Test Data Failed
>
>
>Hi Danny,
>
>
>I have read a small portion of the source code, for variation 1, an initial cluster will
be generated using RandomSeedGenerator if there is none found in the path so I don't have
to do the initial cluster myself. For variation 2, I actually have generated the initial cluster
using this code
>
>
>        (RandomSeedGenerator/buildRandom
          hadoop_configuration
          input_path
          clusters_in_path
          (int 2)
          (new EuclideanDistanceMeasure))
>
>
>
>I should have also mentioned that I am running my code using mahout 0.6-snapshot :)
>
>
>Thanks for the reply anyway :)
>
>
>best wishes,
>Jeffrey04
>
>
>
>>________________________________
>>From: Danny Bickson <danny.bickson@gmail.com>
>>To: user@mahout.apache.org; Jeffrey <mycyberpet@yahoo.com>
>>Sent: Monday, September 12, 2011 3:31 PM
>>Subject: Re: #clojure #fkmeans - Clustering of Test Data Failed
>>
>>
>>Hi Jeffery!
>>I have encountered this problem as well. The workaround, is to run one iteration of
k-means, to create initial cluster assignment and
>>then run fuzzy k-means using the output from the first iteration of k-means.
>>
>>Hope this helps, 
>>
>>Danny Bickson
>>
>>
>>On Mon, Sep 12, 2011 at 10:15 AM, Jeffrey <mycyberpet@yahoo.com> wrote:
>>
>>Hi,
>>>
>>>I have a test data that has a number of points, written to a sequence file using
a Clojure script as follows (I am equally just as bad in both JAVA and Clojure, since I really
don't like JAVA I wrote my scripts in Clojure whenever possible).
>>>
>>>    #!./bin/clj
>>>    (ns sensei.sequence.core)
>>>
>>>    (require 'clojure.string)
>>>    (require 'clojure.java.io)
>>>
>>>    (import org.apache.hadoop.conf.Configuration)
>>>    (import org.apache.hadoop.fs.FileSystem)
>>>    (import org.apache.hadoop.fs.Path)
>>>    (import org.apache.hadoop.io.SequenceFile)
>>>    (import org.apache.hadoop.io.Text)
>>>
>>>    (import org.apache.mahout.math.VectorWritable)
>>>    (import org.apache.mahout.math.SequentialAccessSparseVector)
>>>
>>>    (with-open [reader (clojure.java.io/reader *in*)]
>>>      (let [hadoop_configuration ((fn []
>>>                                    (let [conf (new Configuration)]
>>>                                      (. conf set "fs.default.name"
"hdfs://localhost:9000/")
>>>                                      conf)))
>>>            hadoop_fs (FileSystem/get hadoop_configuration)]
>>>        (reduce
>>>          (fn [writer [index value]]
>>>            (. writer append index value)
>>>            writer)
>>>          (SequenceFile/createWriter
>>>            hadoop_fs
>>>            hadoop_configuration
>>>            (new Path "test/sensei")
>>>            Text
>>>            VectorWritable)
>>>          (map
>>>            (fn [[tag row_vector]]
>>>              (let [input_index (new Text tag)
>>>                    input_vector (new VectorWritable)]
>>>                (. input_vector set row_vector)
>>>                [input_index input_vector]))
>>>            (map
>>>              (fn [[tag photo_list]]
>>>                (let [photo_map (apply hash-map photo_list)
>>>                      input_vector (new SequentialAccessSparseVector
(count (vals photo_map)))]
>>>                  (loop [frequency_list (vals photo_map)]
>>>                    (if (zero? (count frequency_list))
>>>                      [tag input_vector]
>>>                      (when-not (zero? (count frequency_list))
>>>                        (. input_vector set
>>>                           (mod (count frequency_list) (count (vals
photo_map)))
>>>                           (Integer/parseInt (first frequency_list)))
>>>                        (recur (rest frequency_list)))))))
>>>              (reduce
>>>                (fn [result next_line]
>>>                  (let [[tag photo frequency] (clojure.string/split next_line
#" ")]
>>>                    (update-in result [tag]
>>>                      #(if (nil? %)
>>>                         [photo frequency]
>>>                         (conj % photo frequency)))))
>>>                {}
>>>                (line-seq reader)))))))
>>>
>>>Basically the script receives input (from stdin) in this format
>>>
>>>    tag_uri image_uri count
>>>
>>>e.g.
>>>
>>>    http://flickr.com/photos/tags/ísland http://flickr.com/photos/13980928@N03/6001200971
0
>>>    http://flickr.com/photos/tags/ísland http://flickr.com/photos/21207178@N07/5441742937
0
>>>    http://flickr.com/photos/tags/ísland http://flickr.com/photos/25845846@N06/3033371575
0
>>>    http://flickr.com/photos/tags/ísland http://flickr.com/photos/30366924@N08/5772100510
0
>>>    http://flickr.com/photos/tags/ísland http://flickr.com/photos/31343451@N00/5957189406
0
>>>    http://flickr.com/photos/tags/ísland http://flickr.com/photos/36662563@N00/4815218552
1
>>>    http://flickr.com/photos/tags/ísland http://flickr.com/photos/38583880@N00/5686968462
0
>>>    http://flickr.com/photos/tags/ísland http://flickr.com/photos/43335486@N00/5794673203
0
>>>    http://flickr.com/photos/tags/ísland http://flickr.com/photos/46857830@N03/5651576112
0
>>>    http://flickr.com/photos/tags/ísland http://flickr.com/photos/99996011@N00/5396566822
0
>>>
>>>Then turn them into sequence file with each entry represents one point (10 dimensions
in this example) with key set to tag_uri <http://flickr.com/photos/tags/ísland> and
value set to point described by the frequency vector (0 0 0 0 0 1 0 0 0 0)
>>>
>>>I then use a script (available in 2 different variations) to send the data in
as a clustering job, however I am getting error that I don't know how this can be fixed. It
seems that something is wrong with the initial cluster.
>>>
>>>Script variation 1
>>>
>>>    #!./bin/clj
>>>
>>>    (ns sensei.clustering.fkmeans)
>>>
>>>    (import org.apache.hadoop.conf.Configuration)
>>>    (import org.apache.hadoop.fs.Path)
>>>
>>>    (import org.apache.mahout.clustering.fuzzykmeans.FuzzyKMeansDriver)
>>>    (import org.apache.mahout.common.distance.EuclideanDistanceMeasure)
>>>    (import org.apache.mahout.clustering.kmeans.RandomSeedGenerator)
>>>
>>>    (let [hadoop_configuration ((fn []
>>>                                    (let [conf (new Configuration)]
>>>                                      (. conf set "fs.default.name"
"hdfs://localhost:9000/")
>>>                                      conf)))
>>>          driver (new FuzzyKMeansDriver)]
>>>      (. driver setConf hadoop_configuration)
>>>      (. driver
>>>         run
>>>         (into-array String ["--input" "test/sensei"
>>>                             "--output" "test/clusters"
>>>                             "--clusters" "test/clusters/clusters-0"
>>>                             "--clustering"
>>>                             "--overwrite"
>>>                             "--emitMostLikely" "false"
>>>                             "--numClusters" "3"
>>>                             "--maxIter" "10"
>>>                             "--m" "5"])))
>>>
>>>Script variation 2:
>>>
>>>    #!./bin/clj
>>>
>>>    (ns sensei.clustering.fkmeans)
>>>
>>>    (import org.apache.hadoop.conf.Configuration)
>>>    (import org.apache.hadoop.fs.Path)
>>>
>>>    (import org.apache.mahout.clustering.fuzzykmeans.FuzzyKMeansDriver)
>>>    (import org.apache.mahout.common.distance.EuclideanDistanceMeasure)
>>>    (import org.apache.mahout.clustering.kmeans.RandomSeedGenerator)
>>>
>>>    (let [hadoop_configuration ((fn []
>>>                                    (let [conf (new Configuration)]
>>>                                      (. conf set "fs.default.name"
"hdfs://127.0.0.1:9000/")
>>>                                      conf)))
>>>          input_path (new Path "test/sensei")
>>>          output_path (new Path "test/clusters")
>>>          clusters_in_path (new Path "test/clusters/cluster-0")]
>>>      (FuzzyKMeansDriver/run
>>>        hadoop_configuration
>>>        input_path
>>>        (RandomSeedGenerator/buildRandom
>>>          hadoop_configuration
>>>          input_path
>>>          clusters_in_path
>>>          (int 2)
>>>          (new EuclideanDistanceMeasure))
>>>        output_path
>>>        (new EuclideanDistanceMeasure)
>>>        (double 0.5)
>>>        (int 10)
>>>        (float 5.0)
>>>        true
>>>        false
>>>        (double 0.0)
>>>        false)) '' runSequential
>>>
>>>I am getting the same error with both variations
>>>
>>>    SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
>>>    SLF4J: Defaulting to no-operation (NOP) logger implementation
>>>    SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further
details.
>>>    11/08/25 15:20:16 WARN util.NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable
>>>    11/08/25 15:20:16 INFO compress.CodecPool: Got brand-new compressor
>>>    11/08/25 15:20:16 INFO compress.CodecPool: Got brand-new decompressor
>>>    11/08/25 15:20:17 WARN mapred.JobClient: Use GenericOptionsParser for parsing
the arguments. Applications should implement Tool for the same.
>>>    11/08/25 15:20:17 INFO input.FileInputFormat: Total input paths to process
: 1
>>>    11/08/25 15:20:17 INFO mapred.JobClient: Running job: job_local_0001
>>>    11/08/25 15:20:17 INFO mapred.MapTask: io.sort.mb = 100
>>>    11/08/25 15:20:17 INFO mapred.MapTask: data buffer = 79691776/99614720
>>>    11/08/25 15:20:17 INFO mapred.MapTask: record buffer = 262144/327680
>>>    11/08/25 15:20:17 WARN mapred.LocalJobRunner: job_local_0001
>>>    java.lang.IllegalStateException: No clusters found. Check your -c path.
>>>            at org.apache.mahout.clustering.fuzzykmeans.FuzzyKMeansMapper.setup(FuzzyKMeansMapper.java:62)
>>>            at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:142)
>>>            at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:763)
>>>            at org.apache.hadoop.mapred.MapTask.run(MapTask.java:369)
>>>            at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:210)
>>>    11/08/25 15:20:18 INFO mapred.JobClient:  map 0% reduce 0%
>>>    11/08/25 15:20:18 INFO mapred.JobClient: Job complete: job_local_0001
>>>    11/08/25 15:20:18 INFO mapred.JobClient: Counters: 0
>>>    Exception in thread "main" java.lang.RuntimeException: java.lang.InterruptedException:
Fuzzy K-Means Iteration failed processing test/clusters/cluster-0/part-randomSeed
>>>            at clojure.lang.Util.runtimeException(Util.java:153)
>>>            at clojure.lang.Compiler.eval(Compiler.java:6417)
>>>            at clojure.lang.Compiler.load(Compiler.java:6843)
>>>            at clojure.lang.Compiler.loadFile(Compiler.java:6804)
>>>            at clojure.main$load_script.invoke(main.clj:282)
>>>            at clojure.main$script_opt.invoke(main.clj:342)
>>>            at clojure.main$main.doInvoke(main.clj:426)
>>>            at clojure.lang.RestFn.invoke(RestFn.java:436)
>>>            at clojure.lang.Var.invoke(Var.java:409)
>>>            at clojure.lang.AFn.applyToHelper(AFn.java:167)
>>>            at clojure.lang.Var.applyTo(Var.java:518)
>>>            at clojure.main.main(main.java:37)
>>>    Caused by: java.lang.InterruptedException: Fuzzy K-Means Iteration failed
processing test/clusters/cluster-0/part-randomSeed
>>>            at org.apache.mahout.clustering.fuzzykmeans.FuzzyKMeansDriver.runIteration(FuzzyKMeansDriver.java:252)
>>>            at org.apache.mahout.clustering.fuzzykmeans.FuzzyKMeansDriver.buildClustersMR(FuzzyKMeansDriver.java:421)
>>>            at org.apache.mahout.clustering.fuzzykmeans.FuzzyKMeansDriver.buildClusters(FuzzyKMeansDriver.java:345)
>>>            at org.apache.mahout.clustering.fuzzykmeans.FuzzyKMeansDriver.run(FuzzyKMeansDriver.java:295)
>>>            at sensei.clustering.fkmeans$eval17.invoke(fkmeans.clj:35)
>>>            at clojure.lang.Compiler.eval(Compiler.java:6406)
>>>            ... 10 more
>>>
>>>Notice there is a runSequential flag for the 2nd variation, if I set it to true
>>>
>>>    SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
>>>    SLF4J: Defaulting to no-operation (NOP) logger implementation
>>>    SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further
details.
>>>    11/09/07 14:32:32 WARN util.NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable
>>>    11/09/07 14:32:32 INFO compress.CodecPool: Got brand-new compressor
>>>    11/09/07 14:32:32 INFO compress.CodecPool: Got brand-new decompressor
>>>    Exception in thread "main" java.lang.IllegalStateException: Clusters is
empty!
>>>            at org.apache.mahout.clustering.fuzzykmeans.FuzzyKMeansDriver.buildClustersSeq(FuzzyKMeansDriver.java:361)
>>>            at org.apache.mahout.clustering.fuzzykmeans.FuzzyKMeansDriver.buildClusters(FuzzyKMeansDriver.java:343)
>>>            at org.apache.mahout.clustering.fuzzykmeans.FuzzyKMeansDriver.run(FuzzyKMeansDriver.java:295)
>>>            at sensei.clustering.fkmeans$eval17.invoke(fkmeans.clj:35)
>>>            at clojure.lang.Compiler.eval(Compiler.java:6465)
>>>            at clojure.lang.Compiler.load(Compiler.java:6902)
>>>            at clojure.lang.Compiler.loadFile(Compiler.java:6863)
>>>            at clojure.main$load_script.invoke(main.clj:282)
>>>            at clojure.main$script_opt.invoke(main.clj:342)
>>>            at clojure.main$main.doInvoke(main.clj:426)
>>>            at clojure.lang.RestFn.invoke(RestFn.java:436)
>>>            at clojure.lang.Var.invoke(Var.java:409)
>>>            at clojure.lang.AFn.applyToHelper(AFn.java:167)
>>>            at clojure.lang.Var.applyTo(Var.java:518)
>>>            at clojure.main.main(main.java:37)
>>>
>>>Now, if I cluster the data using the CLI tool, it will complete without error
>>>
>>>    $ bin/mahout fkmeans --input test/sensei --output test/clusters --clusters
test/clusters/clusters-0 --clustering --overwrite --emitMostLikely false --numClusters 10
--maxIter 10 --m 5
>>>
>>>However, even there is this option: --clustering, I am not seeing any points in
the cluster dump generated with this command
>>>
>>>    $ ./bin/mahout clusterdump --seqFileDir test/clusters/clusters-1 --pointsDir
test/clusters/clusteredPoints --output sensei.txt
>>>
>>>And yeah, the command completed without any error too.
>>>
>>>... been stuck with this problem over and over again for months, and I can't still
get the clustering done properly :(
>>>
>>>Best wishes,
>>>Jeffrey04
>>
>>
>>
>
>
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message