mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Allen McIntosh <amcint...@appcomsci.com>
Subject Re: seq2sparse dropping tokens
Date Fri, 29 May 2015 19:38:13 GMT
On 05/29/2015 03:13 PM, Suneel Marthi wrote:
> Allen, could u please file a JIRA for this?

Sure.  Do you have any idea what it is?

On the other question I had - after getting a few hours of sleep I was 
able to formulate the right Google query :-) and got directed to 
http://jayaniwithanawasam.blogspot.com which directed me to TopicModel, 
as well as provided me with a running start on the coding.

However, I ran into a tiny problem.  TopicModel seems to expect to read 
an existing model from a single file, or from several files passed in 
via varargs.  Since the model is now spread out over several files, it 
would be save some trauma if the documentation warned about this.

>
> On Fri, May 29, 2015 at 8:58 AM, Allen McIntosh <amcintosh@appcomsci.com>
> wrote:
>
>> This shows up with Mahout 0.10.0 (the distribution archive) and Hadoop
>> 2.2.0
>>
>> When I run seq2sparse on a document containing the following tokens:
>>
>> cash cash equival cash cash equival consist highli liquid instrument
>> commerci paper time deposit other monei market instrument which origin
>> matur three month less aggreg cash balanc bank reclassifi neg balanc
>> consist mainli unclear check account payabl neg balanc reclassifi
>> account payabl decemb
>>
>> the tokens mainli, check and unclear are dropped on the floor (they do
>> not appear in the dictionary file).  The issue persists if I change the
>> analyzer to SimpleAnalyzer (-a
>> org.apache.lucene.analysis.core.SimpleAnalyzer).  I can understand an
>> English analyzer doing something like this, but it seems a little
>> strange that it would happen with SimpleAnalyzer.  (I wonder if it is
>> coincidence that these tokens appear consecutively in the input.)
>>
>> What I am trying to do:  The standard analyzers don't do enough, and I
>> have no access to the client's cluster to preload a custom analyzer.
>> Processing the text before stuffing it into the initial sequence file
>> seemed to be the cleanest alternative, since there doesn't seem to be
>> any way to add a custom jar when using a stock Mahout app.
>>
>> Why dropped or mangled tokens matter, other than as missing information:
>>   Ultimately what I need to do is calculate topic weights for an
>> arbitrary chunk of text.  (See next post.)  If I can't get the tokens
>> right, I don't think I can do this.
>>
>>
>>
>>
>


Mime
View raw message