mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Owen <>
Subject Re: new user
Date Mon, 19 Oct 2009 15:47:47 GMT
You got the "100K" data set which is quite different for some reason.
Make sure you nab the 1M data set and the instructions will make

The target directory should exist in the tarball, since it exists in
SVN, but oops maybe it doesn't for some reason. In any event you can
just create it.

Yes the underlying FileDataModel is pretty flexible. The javadoc
should cover it pretty well -- tab or comma separated, needs the first
three fields to be user ID, item ID, pref value (if applicable).

 It will read the '' file just fine. However, the example code
this tutorial references is using a custom implementation, since the
1M and 10M data set files are using a strange format that needs
something customized. You could easily dig in to the code and swap in
FileDataModel for GroupLensDataModel if you want to use the 100K data

The other data is pretty domain-specific and is not directly relevant
to a recommender engine. So no there is nothing that would do anything
with 'u.item' for instance. However it would be pretty easy to write,
for example, a custom ItemSimilarity implementation that reads this
and deduces some notion of similarity from genre. You could then plug
that in to a GenericItemBasedRecommender for a fast, and perhaps quite
effective, recommender.

Ah perhaps this will be an example in the book ... :)


On Mon, Oct 19, 2009 at 4:27 PM, Brian Wolf <> wrote:
> Hi,
> I discovered and downloaded mahout today. Maybe its just giddiness, but can
> you help me,
> this from tutorial
> "
>   1. Download the "1 Million MovieLens Dataset" from
>   2. Unpack the archive and copy   ->movies.dat<-   and    ->ratings.dat<-
>     to
>   trunk/taste-web/src/main/resources/org/apache/mahout/cf/taste/example/grouplens
> under
>   the Mahout distribution directory.
> "
>  I
> I downloaded the  MovieLens date set, there is no "movies.dat or
> ratings.dat". Are the correct files and u.item?
> I haven't found any documention  on file formats, there are other things
> confusing to new users, such as when I built
> the downloaded gz file, and built it with maven following the instructions ,
>  the directory  was only partly built, however, when I used checked out with
> svn, the full diretory structure was built.
> Can Taste incorporate other data files, like the ones listed below, as
> well?, ie demographic data, etc Where can I find documentation about data
> file formats accepted by taste, or do I need to dig into the code?
> Thank you,
> Brian Wolf
> developer
> gOgO deVelopment, ltd
> Sedona, AZ
>     -- The full u data set, 100000 ratings by 943 users on 1682
> items.
>              Each user has rated at least 20 movies.  Users and items are
>              numbered consecutively from 1.  The data is randomly
>              ordered. This is a tab separated list of
>         user id | item id | rating | timestamp.
>              The time stamps are unix seconds since 1/1/1970 UTC
>     -- The number of users, items, and ratings in the u data set.
> u.item     -- Information about the items (movies); this is a tab separated
>              list of
>              movie id | movie title | release date | video release date |
>              IMDb URL | unknown | Action | Adventure | Animation |
>              Children's | Comedy | Crime | Documentary | Drama | Fantasy |
>              Film-Noir | Horror | Musical | Mystery | Romance | Sci-Fi |
>              Thriller | War | Western |
>              The last 19 fields are the genres, a 1 indicates the movie
>              is of that genre, a 0 indicates it is not; movies can be in
>              several genres at once.
>              The movie ids are the ones used in the data set.
> u.genre    -- A list of the genres.
> u.user     -- Demographic information about the users; this is a tab
>              separated list of
>              user id | age | gender | occupation | zip code
>              The user ids are the ones used in the data set.
> u.occupation -- A list of the occupations.

View raw message