mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matt Molek <mpmo...@gmail.com>
Subject Re: Using multiple reducers with rowsimilarity job
Date Mon, 05 Nov 2012 16:22:17 GMT
Having found a few mentions of running rowsimilarity with multiple
reducers, I assume it's ok.

I'm having a problem with the RowSimilarityJob-CooccurrencesMapper-Reducer
job though. I'm running over a data set of ~5 million entries x ~3 million
boolean features, where each entry has no more than 10 non-zeros. With 256
mappers, ~95% of them finish within 10 minutes. The last 5% get stuck at
random levels of completeness, like 44.47%, and just sit there for ages
spilling more and more output but never increasing the completeness
counter. Eventually after as much as 8 hours they jump to 100%, merge their
output, and finish.

It's usually the early map tasks that have trouble. Right now I'm sitting
with all tasks done except mappers 0-4 which are stuck at various states of
completeness.

Is there something about the ordering of the output of the
RowSimilarityJob-VectorNormMapper-Reducer job that would consistently cause
the early map tasks on RowSimilarityJob-CooccurrencesMapper-Reducer job to
take forever? Is there any tuning I can do to more evenly distribute this
load so 5% of my mappers don't slow my job down so horribly?

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message