mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From shaikhah otaibi <>
Subject item-item similarity
Date Fri, 24 Apr 2015 16:02:12 GMT

I have a problem with finding similarities between items in Mahout. Most of
the similarity values are "NaN".
In my work, I want to calculate the similarity between research papers that
users bookmark in their libraries. Those users are connected using three
different implicit social networks based on their bookmarking behavior in
CiteULike, now I want to show which social network will connect the most
similar users (user-based) OR show that those connected users share similar
information (item-based). So, I need to compute the similarities between
users. Since, there is no explicit rating, I tried Loglikelihood and
Tanimoto in Mahout but I am getting lots of NaN values. I tried user-based
and Item-based. I am not sure if my code is 100% correct since I am new to
Mahout. Especially for the item-based since that I am not sure if the
inverted matrix is built by mahout. I mean building the item-item matrix .

I tried to build the model using:

DataModel model = *new*
FileDataModel(*new* File("FILENAME.csv"))));

Then I calculate the similarity using:

ItemSimilarity similarity = *new* TanimotoCoefficientSimilarity(model);
Then I used the list of the paperids to go through the matrix and print the
similarities. For instance, if I have paperids: 1,2,3,4,etc.
I tried to print the similarities between paper1 and paper as:

System.*out*.println("item similarity:"+similarity.itemSimilarity(1, 2));

When I checked the NaN values, it seems if the paper is not bookmarked
twice in the dataset, I got NaN
In the case of user-based I used the following:

DataModel model = *new* FileDataModel(*new* File("FILENAME.csv"));

UserSimilarity similarity = *new* LogLikelihoodSimilarity(model);

UserSimilarity jaccsimilarity = *new* TanimotoCoefficientSimilarity(model);

UserNeighborhood neighborhood = *new* NearestNUserNeighborhood(5, similarity,

Then from the list of userids, I tried to print the similarities between
users who are connected using the social network as follows:


Could you please help me to understand why I am getting lots of NaN values,
and how I can deal with them to compare the different average similarities
of the three social networks. should I replace them with zero !!
(mathematically, if the intersection is zero in TanimotoCoefficient,
and Logliklihood, this means I should get zero)



  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message