mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Starina <david.star...@gmail.com>
Subject Re: LDA - help me understand
Date Thu, 10 Mar 2016 16:39:36 GMT
There is one more weird thing I can not understand ...

When running only one iteration of LDA, the iteration took 88 seconds. When
running 20 iterations with exactly the same code, on the same documents,
same parameters ... it took 8683 seconds - which is 434 seconds per
iteration. Is there something I don't understand about this algorithm? Why
would one iteration take that much longer just because you run more of
iterations?

--David

On Thu, Mar 10, 2016 at 2:24 PM, David Starina <david.starina@gmail.com>
wrote:

> How does memory requirement grow with the number of topics? A little
> experimentation shows me that number of documents doesn't matter as much as
> the number of topics ... Does the memory requirement grow exponentially
> with the number of topics?
>
> --David
>
> On Thu, Mar 10, 2016 at 11:43 AM, David Starina <david.starina@gmail.com>
> wrote:
>
>> Hi,
>>
>> I realize MapReduce algorithms are not the "hot new stuff" anymore, but I
>> am playing around with LDA. I have some problems with the memory, can you
>> help me suggest how to set up parameters to make this work?
>>
>> I am running on a virtual cluster on my laptop - two nodes with 3 GB of
>> memory each - just to prepare before I try this on a physical cluster with
>> much larger data set. I am using a data set of 500 documents, averaging
>> around 120 kB each, with roughly 60.000 terms. Running this with 20 topics
>> runs ok - but when running on 100 topics, I ran out of memory (on the
>> mappers). Can you suggest me how to set parameters, so it's going to run
>> more mappers that will consume less memory?
>>
>> The error I get: Task Id : attempt_1457214584155_0074_m_000000_1, Status
>> : FAILED
>> *Container*
>> [pid=26283,containerID=container_1457214584155_0074_01_000003] *is
>> running beyond physical memory limits. Current usage: 1.0 GB of 1 GB
>> physical memory used*; 1.7 GB of 2.1 GB virtual memory used. Killing
>> container.
>>
>> This are the parameters I set for CVB0Driver:
>>
>> static int numTopics = 100;
>> static double doc_topic_smoothening = 0.5;
>> static double term_topic_smoothening = 0.5;
>>
>> static int maxIter = 3;
>> static int iteration_block_size = 10;
>> static double convergenceDelta = 0;
>> static float testFraction = 0.0f;
>> static int numTrainThreads = 4;
>> static int numUpdateThreads = 1;
>> static int maxItersPerDoc = 3;
>> static int numReduceTasks = 10;
>> static boolean backfillPerplexity = false;
>>
>> Any suggestion? Should I enlarge the container size on Hadoop, or can I fix this
with LDA parameters?
>>
>> Cheers,
>> David
>>
>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message