mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Musselman <>
Subject Re: How to convert a unique text value to a unique long value for a large data set
Date Thu, 18 Apr 2019 23:45:27 GMT
Ramu, sorry for the belated response but if you're still interested you may
want to try the new version of item similarity, which is described some in
this article:


On Thu, Sep 20, 2018 at 5:10 AM Ramu Ramaiah <> wrote:

> Hi,
> I am using the Apache Mahout's
> with the input options
> 1. --booleanData
> 2. --similarityClassname SIMILARITY_LOGLIKELIHOOD
> The loglikelihood similarity algorithm expects a numeric input. However, I
> have a textual data. One of the things, I did was to write a trivial
> standalone java program to convert the unique text value to a unique long
> value, which does the following.
> 1. Maintain a Map such that key is the unique text value and the value is
> the unique long value. Map<String, Long>.
> 2. Before we insert the key, we can lookup the Map, if a key-exists, do not
> create a new Long value. If a key does not exist, increment the counter
> value and insert it to the Map.
> However, for large data sets, this may have a limitation since the map size
> grows with the number of unique text values.
> There are couple of ways to do this
> 1. Create a database table, with a constraint of unique text value ( a
> primary key). Query the table before inserting a new long value. I am
> guessing, this may be slow.
> 2. Whatever, hashing algorithm that I may chose, there's a possibility of
> collision and there's no guarantee for a unique long value for a given
> unique text value.
> Are there any better ways to solve this for a large data set?
> Thanks,
> Ramu

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message