mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Musselman <andrew.mussel...@gmail.com>
Subject Re: How to convert a unique text value to a unique long value for a large data set
Date Thu, 18 Apr 2019 23:45:27 GMT
Ramu, sorry for the belated response but if you're still interested you may
want to try the new version of item similarity, which is described some in
this article:
https://mahout.apache.org/users/recommender/intro-cooccurrence-spark.html

Best
Andrew

On Thu, Sep 20, 2018 at 5:10 AM Ramu Ramaiah <ramu.ramaiah@gmail.com> wrote:

> Hi,
> I am using the Apache Mahout's
> org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob
> with the input options
>
> 1. --booleanData
> 2. --similarityClassname SIMILARITY_LOGLIKELIHOOD
>
> The loglikelihood similarity algorithm expects a numeric input. However, I
> have a textual data. One of the things, I did was to write a trivial
> standalone java program to convert the unique text value to a unique long
> value, which does the following.
>
> 1. Maintain a Map such that key is the unique text value and the value is
> the unique long value. Map<String, Long>.
> 2. Before we insert the key, we can lookup the Map, if a key-exists, do not
> create a new Long value. If a key does not exist, increment the counter
> value and insert it to the Map.
>
> However, for large data sets, this may have a limitation since the map size
> grows with the number of unique text values.
>
> There are couple of ways to do this
>
> 1. Create a database table, with a constraint of unique text value ( a
> primary key). Query the table before inserting a new long value. I am
> guessing, this may be slow.
> 2. Whatever, hashing algorithm that I may chose, there's a possibility of
> collision and there's no guarantee for a unique long value for a given
> unique text value.
>
> Are there any better ways to solve this for a large data set?
>
> Thanks,
> Ramu
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message