mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ramu Ramaiah <>
Subject Fwd: How to convert a unique text value to a unique long value for a large data set
Date Thu, 20 Sep 2018 12:09:48 GMT
I am using the Apache Mahout's
with the input options

1. --booleanData
2. --similarityClassname SIMILARITY_LOGLIKELIHOOD

The loglikelihood similarity algorithm expects a numeric input. However, I
have a textual data. One of the things, I did was to write a trivial
standalone java program to convert the unique text value to a unique long
value, which does the following.

1. Maintain a Map such that key is the unique text value and the value is
the unique long value. Map<String, Long>.
2. Before we insert the key, we can lookup the Map, if a key-exists, do not
create a new Long value. If a key does not exist, increment the counter
value and insert it to the Map.

However, for large data sets, this may have a limitation since the map size
grows with the number of unique text values.

There are couple of ways to do this

1. Create a database table, with a constraint of unique text value ( a
primary key). Query the table before inserting a new long value. I am
guessing, this may be slow.
2. Whatever, hashing algorithm that I may chose, there's a possibility of
collision and there's no guarantee for a unique long value for a given
unique text value.

Are there any better ways to solve this for a large data set?


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message