mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <gsing...@apache.org>
Subject Re: Clustering from DB
Date Sat, 11 Jul 2009 02:24:39 GMT
Hmm, that might be a mistake on my part when trying to resolve how  
Hadoop 0.20 now resolves globs.  I somewhat blindly applied "/*" where  
needed, but I think it is likely worth revistiing here where a  
specific file is needed?

-Grant

On Jul 10, 2009, at 3:08 PM, nfantone wrote:

> This error is still bugging me. The exception:
>
> WARNING: java.io.FileNotFoundException: File
> output/clusters-0/part-00000/* does not exist.
> java.io.FileNotFoundException: File output/clusters-0/part-00000/*
> does not exist.
>
> ocurrs first at:
>
> org 
> .apache 
> .mahout.clustering.kmeans.KMeansDriver.isConverged(KMeansDriver.java: 
> 298)
>
> which corresponds to:
>
>  private static boolean isConverged(String filePath, JobConf conf,
> FileSystem fs)
>      throws IOException {
>    Path outPart = new Path(filePath + "/*");
>    SequenceFile.Reader reader = new SequenceFile.Reader(fs, outPart,
> conf);  <-- THIS
>    ...
>  }
>
> where isConverged() is called in this fashion:
>
> return isConverged(clustersOut + "/part-00000", conf, fs);
>
> by runIteration(), which is previously invoked by runJob() like:
>
>     String clustersOut = output + "/clusters-" + iteration;
>      converged = runIteration(input, clustersIn, clustersOut,  
> measureClass,
>          delta, numReduceTasks, iteration);
>
> Consequently, assuming its the first iteration and the output folder
> has been named "output" by the user, the SequenceFile.Reader receives
> "output/clusters-0/part-00000/*" as a path, which is non-existent. I
> believe the path should end in "part-00000" and the  + "/*" should be
> removed... although someone, evidently, thought otherwise.
>
> Any feedback?
>
> On Mon, Jul 6, 2009 at 5:39 PM, nfantone<nfantone@gmail.com> wrote:
>> I was using Canopy to create input clusters, but the error appeared
>> while running kMeans (if I run kMeans' job only with previously
>> created clusters from Canopy placed in output/canopies as initial
>> clusters, it still fails). I noticed no other problems. I was using
>> revision 790979 before updating.  Strangely, there were no changes in
>> the job and drivers class from that revision. svn diff shows that the
>> only classes that changed in org.apache.mahout.clustering.kmeans
>> package were KMeansInfo.java and RandomSeedGenerator.java
>>
>> On Mon, Jul 6, 2009 at 3:55 PM, Jeff Eastman<jdog@windwardsolutions.com 
>> > wrote:
>>> Hum, no, it's looking for the output of the first iteration. Were  
>>> there
>>> other errors? What was the last revision you were running? It does  
>>> look like
>>> something got horked, as it should be looking for output/ 
>>> clusters-0/*. Can
>>> you diff the job and driver class to see what changed?
>>>
>>> Jeff
>>>
>>> nfantone wrote:
>>>>
>>>> Fellows, today I updated to revision 791558 and while running  
>>>> kMeans I
>>>> got the following exception:
>>>>
>>>> WARNING: java.io.FileNotFoundException: File
>>>> output/clusters-0/part-00000/* does not exist.
>>>> java.io.FileNotFoundException: File output/clusters-0/part-00000/*
>>>> does not exist.
>>>>
>>>> The algorithm isn't interrupted, though. But this exception wasn't
>>>> thrown before the update and, to me, its message is not quite  
>>>> clear.
>>>> It seems as it's looking for any file inside a "part-00000"  
>>>> directory,
>>>> which doesn't exist; and, as far as I know, "part-xxxxx" are  
>>>> default
>>>> names for output files.
>>>>
>>>> I could show the entire stack trace, if needed. Any pointers?
>>>>
>>>>
>>>> On Thu, Jul 2, 2009 at 3:16 PM, nfantone<nfantone@gmail.com> wrote:
>>>>
>>>>>
>>>>> Thanks for the feedback, Jeff.
>>>>>
>>>>>
>>>>>>
>>>>>> The logical format of input to KMeans is <Key, Vector> as it
is  
>>>>>> in
>>>>>> sequence
>>>>>> file format, but the Key is never used. To my knowledge, there  
>>>>>> is no
>>>>>> requirement to assign identifiers to the input points*. Users  
>>>>>> are free
>>>>>> to
>>>>>> associate an arbitrary name field with each vector - also label 

>>>>>> mappings
>>>>>> may
>>>>>> be assigned - but these are not manipulated by KMeans or any of 

>>>>>> the
>>>>>> other
>>>>>> clustering applications. The name field is now used as a vector
>>>>>> identifier
>>>>>> by the KMeansClusterMapper - if it is non-null - in the output  
>>>>>> step
>>>>>> only.
>>>>>>
>>>>>
>>>>> The key may not be used internally, but externally they can  
>>>>> prove to
>>>>> be pretty useful. For me, keys are userIDs and each Vector  
>>>>> represents
>>>>> his/her historical behavior. Being able to collect the output
>>>>> information as <UserID, ClusterID> is quite neat as it allows me
 
>>>>> to,
>>>>> for instance, retrieve user information using data directly from a
>>>>> HDFS file's field.
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:
http://www.lucidimagination.com/search


Mime
View raw message