mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jake Mannix <jake.man...@gmail.com>
Subject Re: LDA/CVB Performance
Date Thu, 13 Jun 2013 16:50:19 GMT
On Thu, Jun 13, 2013 at 9:43 AM, Andy Schlaikjer <
andrew.schlaikjer@gmail.com> wrote:

> Hi Alan,
>
> On Thu, Jun 13, 2013 at 8:54 AM, Alan Gardner <gardner@pythian.com> wrote:
>
> > The weirdest behaviour I'm seeing is that the multithreaded training Map
> > task only utilizes one core on an eight core node. I'm not sure if this
> is
> > configurable in the JVM parameters or the job config. In the meantime
> I've
> > set the input split very small, so that I can run 8 parallel 1-thread
> > training mappers per node. Should I be configuring this differently?
> >
>
> At my office it's generally frowned upon to run MR tasks which attempt to
> make use of lots of cores on a multicore system, due to cluster
> configuration which forces number of map / reduce slots to sum to num
> cores. If multiple multi-threaded task attempts run on the same node, CPU
> load may spike and negatively affect performance of all task attempts on
> the node.
>
>
> > I also wanted to check in and verify that the performance I'm seeing is
> > typical:
> >
> > - on a six-node cluster (48 map slots, 8 cores per node) running full
> tilt,
> > each iteration takes about 7 hours. I assume the problem is just that our
> > cluster is far too small, and that the performance will scale if I make
> the
> > splits even smaller and distribute the job across more nodes.
> >
>
> How many input splits are generated for your input doc-term matrix? In each
> task attempt, how many rows are processed? Make sure input is balanced
> across all map tasks.
>
>
> > - with an 8GB heap size I can't exceed about 200 topics before running
> out
> > of heap space. I tried making the Map input smaller, but that didn't seem
> > to help. Can someone describe how memory usage scales per mapper in terms
> > of topics, documents and terms?
> >
>
> The tasks need memory proportional to num topics x num terms. Do you have a
> full 8 GB heap for each task slot?
>

Andy, note that he said he's running with a 1.6M-term dictionary.  That's
going
to be 2 * 200 * 1.6M * 8B = 5.1GB for just the term-topic matrices. Still
not hitting
8GB, but getting closer.

Do you really need 1.6M terms?  With only 500k documents, you're probably
using a lot of terms which only occur 1-3 times throughout the corpus.  If
you
take terms which occur at least 5 times, you'll probably drop your dict size
by an order of magnitude, without much loss of usefulness.


>
> Cheers,
> Andy
>
> Twitter, Inc.
>



-- 

  -jake

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message