mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alan Gardner <gard...@pythian.com>
Subject LDA/CVB Performance
Date Thu, 13 Jun 2013 15:54:45 GMT
I'm doing a POC of LDA in Mahout on a dataset of about 500000 documents and
with 1.6 million unique terms (document length is highly variable, up to a
few thousand unique terms per document).

The weirdest behaviour I'm seeing is that the multithreaded training Map
task only utilizes one core on an eight core node. I'm not sure if this is
configurable in the JVM parameters or the job config. In the meantime I've
set the input split very small, so that I can run 8 parallel 1-thread
training mappers per node. Should I be configuring this differently?

I also wanted to check in and verify that the performance I'm seeing is
typical:

- on a six-node cluster (48 map slots, 8 cores per node) running full tilt,
each iteration takes about 7 hours. I assume the problem is just that our
cluster is far too small, and that the performance will scale if I make the
splits even smaller and distribute the job across more nodes.

- with an 8GB heap size I can't exceed about 200 topics before running out
of heap space. I tried making the Map input smaller, but that didn't seem
to help. Can someone describe how memory usage scales per mapper in terms
of topics, documents and terms?

Thanks
-- 
Alan Gardner
Solutions Architect - CTO Office

gardner@pythian.com | LinkedIn:
http://www.linkedin.com/profile/view?id=65508699 |
@alanctgardner<https://twitter.com/alanctgardner>
Tel: +1 613 565 8696 x1218
Mobile: +1 613 897 5655

-- 


--




Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message