mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sandra Clover" <>
Subject Re: Document size rules of thumb
Date Wed, 07 Oct 2009 16:37:43 GMT
Hi Robin,


    Thanks for the response. To answer your questions:


0. The setup is Mahout 0.1 & Hadoop 0.19.2 – I think I am using a
branch version. Currently trying to install the trunk version

1. The data I am trying to classify is from scientific papers -
essentially the abstract title, text and keywords of there paper -
example below

2. No data source is under 300 characters

3. I am training using the Mahout naive Bayes and am getting low
incorrectly classified rates something like: 1.67% - I’m quite happy
with that…

4. After I have trained the model Robin I use the Mahout naive Bayes
classify() method to classify new (unseen) data (with the classification
already known) - this is where I start to get problems -  I get very poor
successful classification rates for new data. Something like: 82%
unsuccessful classified.


To Summarise: I get very good results in training and very poor results
with new data.


I have posted on this before and it was suggested to me that I use the
trunk version. I am still working on that and will let you know if this
is successful and clears up this problem – its tricky as there are many
jars missing after I downloaded it. Could be a bit smoother IMHO. Will
persevere. Any hints/comments here to help?


In the mean time (as I work on that) I was wondering could it be
something to do with the data itself? Perhaps I should use more papers
per file or increase the data per paper in the files? Any comments on


PS: Thanks you for the fix on the “priority queue implementation of
hadoop” problem (which was addressed in another post) Robin. Perhaps
this fix will address the high error rates for the new data? Or perhaps
the trunk version will it – am nor sure. Still working on the
installation… Would appreciate you comments though on any/all of the




Example of data below (the class in this case is War):

War [Characteristics of war wound infection] War wounds are the most
complex type of non-targeted injuries due to uncontrolled tissue damage
of varied and multifold localizations, exposing sterile body areas to
contamination with a huge amount of bacteria. Wound contamination is
caused by both the host microflora and exogenous agents from the
environment (bullets, cloth fragments, dust, dirt, water) due to
destruction of the host protective barriers. War wounds are the
consequence of destructive effects of various types of projectiles, which
result in massive tissue devitalization, hematomas, and compromised
circulation with tissue ischemia or anoxia. This environment is highly
favorable for proliferation of bacteria and their invasion in the
surrounding tissue over a relatively short period of time. War wounds are
associated with a high risk of local and systemic infection. The
infection will develop unless a timely combined treatment is undertaken,
including surgical intervention within 6 hours of wounding and antibiotic
therapy administered immediately or at latest in 3 hours of wound
infliction. Time is a crucial factor in this type of targeted combined
treatment consisting of surgical debridement, appropriate empirical
antimicrobial therapy, and specific antitetanic prophylaxis. Apart from
exposure factors, there are a number of predisposing factors that favor
the development of polymicrobial aerobic-anaerobic infection. These are
shock, pain, blood loss, hypoxia, hematomas, type and amount of
traumatized tissue, age, and comorbidity factors in the wounded. The
determinants that define the spectrum of etiologic agents in contaminated
war wounds are: wound type, body region involved, time interval between
wounding and primary surgical treatment, climate factors, season,
geographical area, hygienic conditions, and patient habits. The etiologic
agents of infection include gram-positive aerobic cocci, i. e.
Staphylococcus spp, Streptococcus spp and Enterococcus spp, which belong
to the physiological flora of the human skin and mucosa; gram-negative
facultative aerobic rods; members of the family Enterobacteriacea
(Escherichia coil, Proteus mirabilis, Klebsiella pneumoniae, Enterobacter
cloacae), which predominate in the physiological flora of the intestines,
transitory flora of the skin and environment; gram-negative bacteria, i.
e. Pseudomonas aeruginosa, Serratia marcescens, Acinetobacter
calcoaceticus - A. baumanii complex; environmental bacteria associated
with humid environment and dust; anaerobic gram-positive sporogeneous
rods Clostridium spp, gram-negative asporogeneous rods Bacteroides spp
and gram-positive anaerobic cocci; Peptostreptococcus spp and Peptococcus
spp. The latter usually colonize the intestine, primarily the colon, and
the skin, while clostridium spores are also found in the environment.
Early empirical antibiotic therapy is used instead of standard antibiotic
prophylaxis. Empirical antimicrobial therapy is administered to prevent
the development of systemic infection, gas gangrene, necrotizing
infection of soft tissue, intoxication and death. The choice of
antibiotics is determined by the presumed infective agents and
localization of the wound. It is used in all types of war wounds over
5-7-10 days. The characteristics of antibiotics used in war wounds are
the following: broad spectrum of activity, ability to penetrate deep into
the tissue, low toxicity, long half-life, easy storage and application,
and cost effectiveness. The use of antibiotics is not a substitution for
surgical treatment. The expected incidence of infection, according to
literature data, is 35%-40%. If the time elapsed until surgical
debridement exceeds 12 hours, or the administration of antibiotics
exceeds 6 hours of wound infliction, primary infection of the war wound
occurs (early infection) in more than 50% of cases. The keys for the
prevention of infection are prompt and thorough surgical exploration of
the wound, administration of antibiotics and antitetanic prophylaxis,
awareness of the probable pathogens with respect to localization of the
wound, and optimal choice of antibiotics and length of their

  ----- Original Message -----
  From: "Robin Anil"
  Subject: Re: Document size rules of thumb
  Date: Wed, 7 Oct 2009 18:00:58 +0530

  HI Sandra, Could you explain your setup, what kind of a dataset it
  Mahout Naive Bayes/CBayes (not Bayesian network) classifier is built
  text articles or documents in mind. The characteristics might change
  if the
  document you wish to classify is 140 char sms or twitter
  affect much though). Could you tell me what kind of results are you
  then by looking at the data and the scores generated we can see what
  to tune

  On Wed, Oct 7, 2009 at 4:58 PM, Sandra Clover wrote:

  > Hi, Just wondering do you have any nice rules-of-thumb or any other
  > guides (characteristics) as to the minimum size of the documents
  used in
  > training the complementary Bayesian network? I would appreciate any
  > comments/views/opinions/rules-of-thumb/experiences that you may be
  > to offer on good characteristics of the documents that go into
  > (particularly when you have a large number of categories to
  > classify)... Thanking
  > you,Sandra.
  > --
  > An Excellent Credit Score is 750
  > See Yours in Just 2 Easy Steps!

An Excellent Credit Score is 750 
See Yours in Just 2 Easy Steps!

  • Unnamed multipart/alternative (inline, 7-Bit, 0 bytes)
View raw message