mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Starina <david.star...@gmail.com>
Subject Re: LDA - help me understand
Date Thu, 10 Mar 2016 16:55:29 GMT
About the last question: it probably has something to do with setting the
max iterations and max iterations per document to the same value ... What
is the "number of iterations per document" really doing?

--David

On Thu, Mar 10, 2016 at 5:39 PM, David Starina <david.starina@gmail.com>
wrote:

> There is one more weird thing I can not understand ...
>
> When running only one iteration of LDA, the iteration took 88 seconds.
> When running 20 iterations with exactly the same code, on the same
> documents, same parameters ... it took 8683 seconds - which is 434 seconds
> per iteration. Is there something I don't understand about this algorithm?
> Why would one iteration take that much longer just because you run more of
> iterations?
>
> --David
>
> On Thu, Mar 10, 2016 at 2:24 PM, David Starina <david.starina@gmail.com>
> wrote:
>
>> How does memory requirement grow with the number of topics? A little
>> experimentation shows me that number of documents doesn't matter as much as
>> the number of topics ... Does the memory requirement grow exponentially
>> with the number of topics?
>>
>> --David
>>
>> On Thu, Mar 10, 2016 at 11:43 AM, David Starina <david.starina@gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>> I realize MapReduce algorithms are not the "hot new stuff" anymore, but
>>> I am playing around with LDA. I have some problems with the memory, can you
>>> help me suggest how to set up parameters to make this work?
>>>
>>> I am running on a virtual cluster on my laptop - two nodes with 3 GB of
>>> memory each - just to prepare before I try this on a physical cluster with
>>> much larger data set. I am using a data set of 500 documents, averaging
>>> around 120 kB each, with roughly 60.000 terms. Running this with 20 topics
>>> runs ok - but when running on 100 topics, I ran out of memory (on the
>>> mappers). Can you suggest me how to set parameters, so it's going to run
>>> more mappers that will consume less memory?
>>>
>>> The error I get: Task Id : attempt_1457214584155_0074_m_000000_1, Status
>>> : FAILED
>>> *Container*
>>> [pid=26283,containerID=container_1457214584155_0074_01_000003] *is
>>> running beyond physical memory limits. Current usage: 1.0 GB of 1 GB
>>> physical memory used*; 1.7 GB of 2.1 GB virtual memory used. Killing
>>> container.
>>>
>>> This are the parameters I set for CVB0Driver:
>>>
>>> static int numTopics = 100;
>>> static double doc_topic_smoothening = 0.5;
>>> static double term_topic_smoothening = 0.5;
>>>
>>> static int maxIter = 3;
>>> static int iteration_block_size = 10;
>>> static double convergenceDelta = 0;
>>> static float testFraction = 0.0f;
>>> static int numTrainThreads = 4;
>>> static int numUpdateThreads = 1;
>>> static int maxItersPerDoc = 3;
>>> static int numReduceTasks = 10;
>>> static boolean backfillPerplexity = false;
>>>
>>> Any suggestion? Should I enlarge the container size on Hadoop, or can I fix this
with LDA parameters?
>>>
>>> Cheers,
>>> David
>>>
>>>
>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message