mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aurora Skarra-Gallagher <aur...@yahoo-inc.com>
Subject Re: Running Taste Web example without the webserver
Date Thu, 23 Jul 2009 23:09:24 GMT
Hi,

Thank you for responding. My Spam filter was "out to get me" and your responses were misclassified.

I will investigate the Hadoop integration piece, specifically RecommenderJob. Currently, the
Hadoop grid I'm working with is using 0.18.3. Will that pose a problem? I noticed some threads
about versions of Hadoop less than 0.19 not working.

We are looking at starting with 70M users and scaling up to 500M eventually. It is hard for
me to estimate the number of items. We could be starting out with 100, but as these items
are entities that we extract, there could be tens of thousands eventually. I would guess that
most users would have less than 100 of these.

Does that help? I would be interested in your input on the algorithms and also being a guinea
pig for the code you're developing, if it makes sense.

-Aurora


On 7/23/09 12:43 AM, "Sean Owen" <srowen@gmail.com> wrote:

Aurora did you see my last reply on the list?

On Wed, Jul 22, 2009 at 9:29 AM, Sean Owen<srowen@gmail.com> wrote:
> Yes, there are a few components here -- a few different purposes. All
> build around the core library which isn't specific to Hadoop or an
> HTTP server, but you've seen some of the components that adapt the
> core to this contexts. There are also components that can evaluate or
> load test the code.
>
> The only piece you are interested in then is really the Hadoop
> integration -- see org.apache.mahout.cf.taste.hadoop. There you will
> find RecommenderJob which should be able to launch a
> pseudo-distributed recommender job. I say pseudo since these
> algorithms are not in general distributable, but, one can of course
> run n instances of a recommender to compute 1/nth of all
> recommendations each. That is nice, though it means, say, the amount
> of RAM the jobs consume is still limited by the size of each machine.
>
> I just recently rewrote this package to be compatible with Hadoop
> 0.20's new APIs. I do not know that it works, and, have some reason to
> believe there are bugs in the API that will prevent it from working.
> So this piece is currently in flux.
>
> If you want to experiment and be a guinea pig for this latest
> revision, I can provide close support to work through the bugs on both
> sides. Or we can talk about your requirements more a bit to figure out
> whether this is feasible, what the best algorithm is, whether you need
> Hadoop?
>
> How big is 'massive'? could you reveal how many users, items, and
> user-item preferences to an order of magnitude? what is generally the
> nature of the input data you have, and you want recommendations out?
>
> On Wed, Jul 22, 2009 at 12:12 AM, Aurora
> Skarra-Gallagher<aurora@yahoo-inc.com> wrote:
>> Hi,
>>
>> I apologize if I've misunderstood the purpose of the Taste component of Mahout. Our
goal was to take a recommendation framework and use our own recommendation algorithm within
it. We need to process a massive amount of data, and wanted it to be done on our Hadoop grid.
I thought that Taste was the right fit for the job. I'm not interested in the HTTP service.
I'm interested in the recommendation framework, particularly from a back-end batch perspective.
Does that help clarify? Thanks for helping me sort through this.
>>
>> -Aurora
>>
>>
>> On 7/21/09 3:02 PM, "Sean Owen" <srowen@gmail.com> wrote:
>>
>> Hmm, lots going on here, it's confusing.
>>
>> Are you trying to run this on Hadoop intentionally? because the web
>> app example is not intended to run on Hadoop. It's a component
>> intended to serve recommendations over HTTP in real time. It also
>> appears you are running an evaluation rather than a web app serving
>> requests. I realize you're trying to run this without Jetty, but
>> that's kind of like trying to run a web app without a web server.
>>
>> I think you'd have to clarify what you are trying to do, and then what
>> you are doing right now, to begin to assist.
>>
>> On Tue, Jul 21, 2009 at 9:20 PM, Aurora
>> Skarra-Gallagher<aurora@yahoo-inc.com> wrote:
>>> Hi,
>>>
>>> I'm trying to run the taste web example without using jetty. Our gateways aren't
meant to be used as webservers. By poking around, I found that the following command worked:
>>> hadoop --config ~/hod-clusters/test jar /x/mahout-current/examples/target/mahout-examples-0.2-SNAPSHOT.job
org.apache.mahout.cf.taste.example.grouplens.GroupLensRecommenderEvaluatorRunner
>>>
>>> The output is:
>>> 09/07/21 19:59:21 INFO file.FileDataModel: Creating FileDataModel for file /tmp/ratings.txt
>>> 09/07/21 19:59:21 INFO eval.AbstractDifferenceRecommenderEvaluator: Beginning
evaluation using 0.9 of GroupLensDataModel
>>> 09/07/21 19:59:22 INFO file.FileDataModel: Reading file info...
>>> 09/07/21 19:59:22 INFO file.FileDataModel: Processed 100000 lines
>>> 09/07/21 19:59:22 INFO file.FileDataModel: Processed 200000 lines
>>> 09/07/21 19:59:22 INFO file.FileDataModel: Processed 300000 lines
>>> 09/07/21 19:59:22 INFO file.FileDataModel: Processed 400000 lines
>>> 09/07/21 19:59:22 INFO file.FileDataModel: Processed 500000 lines
>>> 09/07/21 19:59:22 INFO file.FileDataModel: Processed 600000 lines
>>> 09/07/21 19:59:22 INFO file.FileDataModel: Processed 700000 lines
>>> 09/07/21 19:59:22 INFO file.FileDataModel: Processed 800000 lines
>>> 09/07/21 19:59:23 INFO file.FileDataModel: Processed 900000 lines
>>> 09/07/21 19:59:23 INFO file.FileDataModel: Processed 1000000 lines
>>> 09/07/21 19:59:23 INFO file.FileDataModel: Read lines: 1000209
>>> 09/07/21 19:59:30 INFO slopeone.MemoryDiffStorage: Building average diffs...
>>> 09/07/21 19:59:42 INFO eval.AbstractDifferenceRecommenderEvaluator: Evaluation
result: 0.7035965559003973
>>> 09/07/21 19:59:42 INFO grouplens.GroupLensRecommenderEvaluatorRunner: 0.7035965559003973
>>>
>>> The job appears to write data to /tmp/ratings.txt and /tmp/movies.txt. I'm not
sure if this is the correct way to run this example. I have a few questions:
>>>
>>>  1.  Is the output file /tmp/ratings.txt? If so, how do I interpret it?
>>>  2.  What does the Evaluation result mean?
>>>  3.  Is it even running on HDFS?
>>>  4.  Is it a map-reduce job?
>>>
>>> Any pointers on how to run this as a standalone job would be helpful.
>>>
>>> Thanks,
>>> Aurora
>>>
>>
>>
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message