mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sebastian Schelter <...@apache.org>
Subject Re: Mahout beginner questions...
Date Thu, 05 Apr 2012 07:47:09 GMT
You don't have to hold the rating matrix in memory. When computing
recommendations for a user, fetch all his ratings from some datastore
(database, key-value-store, memcache...) with a single query and use the
item similarities that are held in-memory to compute the recommendations.

--sebastian

On 05.04.2012 09:44, Razon, Oren wrote:
> Thanks for the answer, but still...
> I will need to keep in memory the rating matrix so I will be able to utilize the ranking
a user gave to items together with the item similarity.
> 
> -----Original Message-----
> From: Sebastian Schelter [mailto:ssc@apache.org] 
> Sent: Thursday, April 05, 2012 10:34
> To: user@mahout.apache.org
> Subject: Re: Mahout beginner questions...
> 
> Hi Oren,
> 
> If you use an item-based approach, its sufficient to use the top-k
> similar items per item (with k somewhere between 25 and 100). That means
> the data to hold in memory is num_items * k data points.
> 
> While this is a theoretical limitation, it should not be a problem in
> practical scenarios, as you can easily fit some hundred million of that
> datapoints in a few gigabytes of RAM.
> 
> --sebastian
> 
> 
> On 05.04.2012 09:27, Razon, Oren wrote:
>> Ok, so here is the point I still not getting.
>>
>> The architecture we are talking about is to push heavy computation for offline work,
for that I could utilize Hadoop part.
>> Beside, having an online part, which will retrieve the recommendation from the pre-computed
results or even will do some more computation online to try and adjust the recommendation
to current user context. 
>> But as you said for the JDBC connector, in order to serve recommendations fast, the
online recommender need to have all pre-computed results in-memory. So isn't it a limitation
to scale up? It means that as long as my recommender service  is growing I will need more
memory in order to hold it all in-memory in the online part...
>> Am I wrong here?  
>>
>> -----Original Message-----
>> From: Sean Owen [mailto:srowen@gmail.com] 
>> Sent: Thursday, March 22, 2012 17:57
>> To: user@mahout.apache.org
>> Subject: Re: Mahout beginner questions...
>>
>> A distributed and non-distributed recommender are really quite
>> separate. They perform the same task in quite different ways. I don't
>> think you would mix them per se.
>>
>> Depends on what you mean by a model-based recommender... I would call
>> the matrix-factorization-based and clustering-based approaches
>> "model-based" in the sense that they assume the existence of some
>> underlying structure and discover it. There's no Bayesian-style
>> approaches in the code.
>>
>> They scale in different ways; I am not sure they are unilaterally a
>> solution to scale, no. I do agree in general that these have good
>> scaling properties for real-world use cases, like the
>> matrix-factorization approaches.
>>
>>
>> A "real" scalable architecture would have a real-time component and a
>> big distributed computation component. Mahout has elements of both and
>> can be the basis for piecing that together, but it's not a question of
>> strapping together the distributed and non-distributed implementation.
>> It's a bit harder than that.
>>
>>
>> I am actually quite close to being ready to show off something in this
>> area -- I have been working separately on a more complete rec system
>> that has both the real-time element but integrated directly with a
>> distributed element to handle the large-scale computation. I think
>> this is typical of big data architectures. You have (at least) a
>> real-time distributed "Serving Layer" and a big distributed batch
>> "Computation Layer". More on this in about... 2 weeks.
>>
>>
>> On Thu, Mar 22, 2012 at 3:16 PM, Razon, Oren <oren.razon@intel.com> wrote:
>>> Hi Sean,
>>> Thanks for your fast response, I really appreciate the quality of your book ("Mahout
in action"), and the support you give in such forums.
>>> Just to clear my second question...
>>> I want to build a recommender framework that will support different use cases.
 So my intention is to have both distributed and non-distributed solution in one framework,
the question is, is it a good design to put them both in the same machine (one of the machines
in the Hadoop cluster)?
>>>
>>> BTW... another question, it seem that a good solution to the recommender scalability
will be to use model based recommenders.
>>> Saying this, I wonder why there is such few model based recommenders, especially
considering the fact that Mahout contain several data mining models implemented already?
>>>
>>>
>>> -----Original Message-----
>>> From: Sean Owen [mailto:srowen@gmail.com]
>>> Sent: Thursday, March 22, 2012 13:51
>>> To: user@mahout.apache.org
>>> Subject: Re: Mahout beginner questions...
>>>
>>> 1. These are the JDBC-related classes. For example see
>>> MySQLJDBCDiffStorage or MySQLJDBCDataModel in integration/
>>>
>>> 2. The distributed and non-distributed code are quite separate. At
>>> this scale I don't think you can use the non-distributed code to a
>>> meaningful degree. For example you could pre-compute item-item
>>> similarities over this data and use a non-distributed item-based
>>> recommender but you probably have enough items that this will strain
>>> memory. You would probably be looking at pre-computing recommendations
>>> in batch.
>>>
>>> 3. I don't think Netezza will help much here. It's still not fast
>>> enough at this scale to use with a real-time recommender (nothing is).
>>> If it's just a place you store data to feed into Hadoop it's not
>>> adding value. All the JDBC-related integrations ultimately load data
>>> into memory and that's out of the question with 500M data points.
>>>
>>> I'd also suggest you have a think about whether you "really" have 500M
>>> data points. Often you can know that most of the data is noise or not
>>> useful, and can get useful recommendations on a fraction of the data
>>> (maybe 5M). That makes a lot of things easier.
>>>
>>> On Thu, Mar 22, 2012 at 11:35 AM, Razon, Oren <oren.razon@intel.com> wrote:
>>>> Hi,
>>>> As a data mining developer who need to build a recommender engine POC (Proof
Of Concept) to support several future use cases, I've found Mahout framework as an appealing
place to start with. But as I'm new to Mahout and Hadoop in general I've a couple of questions...
>>>>
>>>> 1.      In "Mahout in action" under section 3.2.5 (Database-based data) it
says: "...Several classes in Mahout's recommender implementation will attempt to push computations
into the database for performance...". I've looked in the documents and inside the code itself,
but didn't found anywhere a reference to what are those calculations that are pushed into
the DB. Could you please explain what could be done inside the DB?
>>>> 2.      My future use will include use cases with small-medium data volumes
(where I guess the non-distributed algorithms will do the job), but also use cases that include
huge amounts of data (over 500,000,000 ratings). From my understanding this is where the distributed
code should be come handy. My question here is, because I will need to use both distributed
& non-distributed how could I build a good design here?
>>>>      Should I build two different solutions on different machines? Could
I do part of the job distributed (for example similarity calculation) and the output will
be used for the non-distributed code? Is it a BKM? Also if I deploy entire mahout code on
an Hadoop environment, what does it mean for the non-distributed code, will it all run as
a different java process on the name node?
>>>> 3.      As for now, beside of the Hadoop cluster we are building we have
some strong SQL machines (Netezza appliance) that can handle big (structure) data and include
good integration with 3'rd party analytics providers or developing on java platform but don't
include such reach recommender framework like Mahout. I'm trying to understand how could I
utilize both solutions (Netezza & Mahout) to handle big data recommender system use cases.
Thought maybe to move data into Netezza, do there all data manipulation and transformation,
and in the end to prepare a file that contain the classic data model structure needed for
Mahout. But could you think on better solution \ architecture? Maybe keeping the data only
inside Netezza and extracting it to Mahout using JDBC when needed? I will be glad to hear
your ideas :)
>>>>
>>>> Thanks,
>>>> Oren
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> Intel Electronics Ltd.
>>>>
>>>> This e-mail and any attachments may contain confidential material for
>>>> the sole use of the intended recipient(s). Any review or distribution
>>>> by others is strictly prohibited. If you are not the intended
>>>> recipient, please contact the sender and delete all copies.
>>> ---------------------------------------------------------------------
>>> Intel Electronics Ltd.
>>>
>>> This e-mail and any attachments may contain confidential material for
>>> the sole use of the intended recipient(s). Any review or distribution
>>> by others is strictly prohibited. If you are not the intended
>>> recipient, please contact the sender and delete all copies.
>> ---------------------------------------------------------------------
>> Intel Electronics Ltd.
>>
>> This e-mail and any attachments may contain confidential material for
>> the sole use of the intended recipient(s). Any review or distribution
>> by others is strictly prohibited. If you are not the intended
>> recipient, please contact the sender and delete all copies.
> 
> ---------------------------------------------------------------------
> Intel Electronics Ltd.
> 
> This e-mail and any attachments may contain confidential material for
> the sole use of the intended recipient(s). Any review or distribution
> by others is strictly prohibited. If you are not the intended
> recipient, please contact the sender and delete all copies.


Mime
View raw message