jakarta-oro-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Kevin Markey" <kmar...@silvercreeksystems.com>
Subject RE: Is this a bug with oro?
Date Thu, 02 Apr 2009 13:50:03 GMT
I've taken the liberty of reconstructing your WHOLE regex.  Here it is...

(([a-zA-Z0-9_\\-]+)(\\s*=\\s*(\"(.*?)\"|'(.*?)'|([^'\">\\s]+)))?)

As predicted, one of your main groups is optional.  (I can't recall ALL the rules for Oro's
numbering of nested parentheses.  Like Perl, it's dynamic, and depends on the existence of
which groups are recognized.  I avoid such complexities.  You should, too.  See below.)

I suggest you insert this in Perl and run your input through it.  Print out everything that
the whole thing recognizes.  Also print and label each group.  It will show that different
numbers of groups are recognized and that YOUR expectations of which group is which are NOT
what IT expects!!!  Your expression is too complicated for you (or me) to debug.  There are
up to 7 capturing groups here!!!

Suggestion.  Use the NONCAPTURING GROUP when you don't need to capture or when optional. 
I.e., (?:pattern).  And make each group you intend to capture (and number) stuff REQUIRED.
 If there are two overall patterns you need to test, test them independently.  

2nd suggestion.  Use an open source HTML parser instead.  They've already solved this problem.

Final conclusion:  There ain't a bug in Oro.  The bug is in your logic.

Enjoy.
Kevin

-----Original Message-----
From: Balaji [mailto:balaji.prabakaran@listertechnologies.com]
Sent: Thu 4/2/2009 4:14 AM
To: Kevin Markey; oro-user@jakarta.apache.org
Subject: RE: Is this a bug with oro?
 
Hi Kevin,
 
Apologize for missing the diTag pattern. Here it is,
 
 private static String start  = "<" ;
 private static String tagNames =
"(form|input\\s+|head|/?select\\s+|option\\s+|textarea\\s+" + 
 "|checkboxgroup\\s+|radiogroup|/?optintrue){1}" ;
 private static String anything = "([^>]*)" ;
 private static String end   = "[/]*>" ;
 private static Pattern  diTag;
 private static String attribute = "[a-zA-Z0-9_\\-]+" ;
 private static String optWS  = "\\s*" ;
 private static String dquoted  = "\"(.*?)\"" ;
 private static String squoted  = "'(.*?)'" ;
 private static String plain  = "([^'\">\\s]+)" ;
 private static Pattern  nvps ;
 private static PatternMatcher primaryMatcher = new Perl5Matcher() ;
 private static PatternCompiler compiler = new Perl5Compiler() ;

   diTag = compiler.compile( start + tagNames + anything + end ,
     Perl5Compiler.CASE_INSENSITIVE_MASK ) ;

   nvps = compiler.compile( "((" + attribute + ")" + "(" + optWS
     + "=" + optWS + "(" + dquoted
     + "|" + squoted + "|" + plain
     + "))?)" ); 

 
The different scenarios for failure that you have mentioned, should fail
consistently(for the same input). correct?
In this case, for the same input the NPE occurs only occassionally. Here the
input is a HTML file read over http. Do you think, the NPE can occur when
the HTML is not available for some reason(network issue, etc..)?
 
Thanks,
Balaji Prabhakaran 
  _____  

From: Kevin Markey [mailto:kmarkey@silvercreeksystems.com] 
Sent: Tuesday, March 31, 2009 11:31 PM
To: ORO Users List; oro-user@jakarta.apache.org;
balaji.prabakaran@listertechnologies.com
Cc: Kevin Markey
Subject: RE: Is this a bug with oro?



One more thing to do for your diagnostics.  Do these so you can identify
where in __setLastMatchResult() you fail.

- Get the source, recompile the jar with debugging information so you get
the line number.
- Turn off any obfuscation.

Also provide the diTag pattern that is used when this fails.  (I don't see
it defined in your snippet.)  That is key. 

Still, I have a hunch...  The regex apparently has 2 groups.  I predict your
pattern allows a match **without** matching the groups.  As result,
__originalInput is reset to null at the conclusion of __setLastMatchResult()
after matching the 1st group, setting off the NPE the next iteration of your
WHILE loop, or the __beginGroupOffset or __endGroupOffset or
__endMatchOffsets arrays might be null.  I'm not totally familiar with the
source code, but I've used it for several years, and these are the things
that typically fail.

B.t.w., 2.0.6 and 2.0.8 are not substantially different in these regards.

So, make sure that BOTH groups are required in your regex.

Kevin

-----Original Message-----
From: Balaji [mailto:balaji.prabakaran@listertechnologies.com]
Sent: Tue 3/31/2009 9:09 AM
To: oro-user@jakarta.apache.org
Subject: RE: Is this a bug with oro?

Hi Kevin,

Thanks a lot for your reply. Highly appreciate your help. Here are required
details.

The version is 2.0.8

The context is this.. trying to read a html file over http and parse values
of some hidden attributes in the html form.

Here is the code.. the exception occurs at the line marked below. Occurs
randomly and is not reproducable at will.
The string passed to contains() is never null and is always checked for true
before calling getMatch(). Please check if Iam missing something.

******************class that contains the code that throws the
exception************
public class Parser
{

 private static Pattern  diTag;
 private static PatternMatcher primaryMatcher = new Perl5Matcher() ;
 private static PatternCompiler compiler = new Perl5Compiler() ;

 public static void initialize(){
  .
  .
  .
 }
 public Parser( StringBuffer input)
 {
  this.input = input ;
 }
 public Vector parse()
 {
  Vector returnValue=null;
  PatternMatcherInput patternMatcherInput = new
PatternMatcherInput(input.toString());
  int previous = 0 ;
  while(primaryMatcher.contains(patternMatcherInput,diTag))
  {
   MatchResult result = primaryMatcher.getMatch();  //exception is thrown
here....
   String dataString =
input.substring(previous,patternMatcherInput.getMatchBeginOffset());
   String tag = result.group(1);
   String inputS = result.group(2);
   try
   {
    returnValue=processDITag( tag.toUpperCase(),inputS ) ;
    previous = patternMatcherInput.getCurrentOffset() ;
   }
   catch(NotHandledException nh)
   {
    previous = patternMatcherInput.getMatchBeginOffset() ;
   }
  }
  return returnValue;
 }

 public Vector processDITag( String tag, String inputString ) throws
NotHandledException
 {
  .
  .
  .
 }
}

 
******************code that calls the method in the above
class*******************************
  diHTML = readInputFile(queryParametersBean.getSurveyName()); //reads the
data from a html file over http
 
      
  if(diHTML.length()==0)
  {
   LogWriter.info(CLASS_NAME,"loadPageEvent(HttpServletRequest req)","The
file name is not available" + sHtmlPath);  
   sFileName=ConfigBean.getProperty(sSerPathFileName); // replace with exact
file name
   sFileName=sFilePath + sFileName;
   queryParametersBean.setSurveyName(sFileName);
   diHTML = readInputFile(queryParametersBean.getSurveyName());
   LogWriter.info(CLASS_NAME,"loadPageEvent(HttpServletRequest req)","The
file name from config file" + sFileName);  
  }
  if(diHTML.length()==0) {
   LogWriter.info(CLASS_NAME,"loadPageEvent(HttpServletRequest req)","The
file name is not in akamai server");  
  }
  else {
   if(!( queryParametersBean.getEmail() != null &&
queryParametersBean.getEmail().length() != 0 &&

(ProcessorSupport.validateEmailAddress(queryParametersBean.getEmail())==fals
e) && diHTML.length() !=0))
         { 
    LogWriter.info(CLASS_NAME,"loadPageEvent(HttpServletRequest
req)","queryParametersBean track page load " +
queryParametersBean.getEmail());   
    System.out.println("inside load event");
    Parser myParser = new Parser(diHTML, queryParameters) ;
    Vector resultString=myParser.parse();
    Iterator itrelements=resultString.iterator();
    .
    .
    .
        }
    }
****************************************************************************
*********************************

Thanks,
Balaji Prabhakaran  _____ 

From: Kevin Markey [mailto:kmarkey@silvercreeksystems.com]
Sent: Tuesday, March 31, 2009 6:48 PM
To: ORO Users List; oro-user@jakarta.apache.org;
balaji.prabakaran@listertechnologies.com
Subject: RE: Is this a bug with oro?



Some context and code in which this fails and data with which this fails
would help.
Also the version you are using would help.

However, inspecting 2.0.6 code (which is the most handy on the machine I'm
on -- I suspect other code is similar),
there is only one place in __setLastMatchResult() where you can get a NPE.
__lastMatchResult is non-null.  OpCode is non-null.  However,
__originalInput MIGHT be null.  Hence you can get a NPE where the
__originalInput.length is tested.  Check your code whether the string in
contains() is null, and always check if the result is true.

E.g.,

private PatternCompiler m_compiler = new Perl5Compiler();
private PatternMatcher m_matcher = new Perl5Matcher();
private Pattern m_commentRegex = m_compiler.compile ( "#" );

/** Extract comment from string. */
public String findComment ( String s )
{
   if ( s == null ) return null;
   if ( m_matcher.contains ( s, m_commentRegex ) )
   {
      MatchResult result = m_matcher.getMatch();
      String comment = s.substring ( result.endOffset(0) );
      return comment;
   }
   return null;
}

Enjoy.
Kevin Markey

-----Original Message-----
From: Balaji [mailto:balaji.prabakaran@listertechnologies.com]
Sent: Tue 3/31/2009 6:22 AM
To: oro-user@jakarta.apache.org
Subject: Is this a bug with oro?

Hello,

I occassionally get the below exception. The call to getMatch is causing a
NullPointerException.

Caused by: java.lang.NullPointerException
    at org.apache.oro.text.regex.Perl5Matcher.__setLastMatchResult(Unknown
Source)
    at org.apache.oro.text.regex.Perl5Matcher.getMatch(Unknown Source)

Here is what the API documentation says,
A MatchResult instance containing the pattern match found by the last call
to any one of the matches() or contains() methods. If no match was found by
the last call, returns null.

I believe this is a bug. Can you guys, please confirm?
If so, is there a fix or a workaround for this bug?

Any help will be greatly appreciated.

Thanks,
Balaji Prabhakaran









Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message