nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chris A. Mattmann (JIRA)" <j...@apache.org>
Subject [jira] Commented: (NUTCH-133) ParserFactory does not work as expected
Date Wed, 07 Dec 2005 21:33:09 GMT
    [ http://issues.apache.org/jira/browse/NUTCH-133?page=comments#action_12359645 ] 

Chris A. Mattmann commented on NUTCH-133:
-----------------------------------------

Hi Stefan,

  Thanks for your reply. I actually like a lot of your proposed changes having to do with
the MimeType cleansing and the detection mechanism. I think that you should continue to work
with Jerome on that as he seems to be the guy who is the MimeType guru on the project.

With regards to your comment about "..First, I only rewrote classes where it was absolutely
necessary...", I respectfully disagree with this assertion. Take the TestParserFactory test
case class. Your patch almost completely removes its methods, however, all of the methods
in the TestParserFactory class that is currently committed in Nutch are in my mind extremely
relevant tests to ensure proper operation of the ParserFactory. Continuing on, your patch
completely removes the ParsePluginsReader class and replaces with with a class, "ParsePluginConf",
that does the exact same thing (i.e., reading the parse-plugins.xml file). Your changes in
the ParseUtil class are equally as perplexing. You remove the public final static Parse parse(Content
content) throws ParseException method, and replace it with a method public static Parse parse(Content
content) throws ParseException that seemingly does the exact same thing as the previous method?
The same is true for the public final static Parse parseByParserId(String parserId, Content
content) in the same class, it is replaced in your patch with seemingly the exact same method
minus the "final" declaration? 

These are the things I was talking about. This patch file, rather than just updating the current
sources with the true bits and pieces of your new contribution in places, completely removes
methods, files, etc., and then replaces them with much of the same code, at least in the ParseUtil,
ParsePluginsReader, etc. classes that I have looked at. 

This is not to say that the rest of your suggestions and bugs aren't worthwhile, nor your
suggestion to use the extensionId rather than the pluginId in the parse-plugins.xml file.
These ideas are definitely interesting and worth exploring amongst the whole group working
on these issues.

Thanks,
  Chris


> ParserFactory does not work as expected
> ---------------------------------------
>
>          Key: NUTCH-133
>          URL: http://issues.apache.org/jira/browse/NUTCH-133
>      Project: Nutch
>         Type: Bug
>     Versions: 0.8-dev, 0.7.1, 0.7.2-dev
>     Reporter: Stefan Groschupf
>     Priority: Blocker
>  Attachments: ParserFactoryPatch_nutch.0.7_patch.txt, Parserutil_test_patch.txt
>
> Marcel Schnippe detect a set of problems until working with different content and parser
types, we worked together to identify the problem source.
> From our point of view this described problems could be the source for many other problems
daily described in the mailing lists.
> Find a conclusion of the problems below.
> Problem:
> Some servers returns mixed case but correct header keys like 'Content-type' or 'content-Length'
 in the http response header.
> That's why for example a get("Content-Type") fails and a page is detected as zip using
the magic content type detection mechanism. 
> Also we note that this a common reason why pdf parsing fails since Content-Length does
return the correct value. 
> Sample:
> returns "text/HTML" or "application/PDF" or Content-length
> or this url:
> http://www.lanka.info/dictionary/EnglishToSinhala.jsp
> Solution:
> First just write only lower case keys into the properties and later convert all keys
that are used to query the metadata to lower case as well.
> e.g.:
> HttpResponse.java, line 353:
> use lower cases here and for all keys used to query header properties (also content-length)
change:  String key = line.substring(0, colonIndex); to  String key = line.substring(0, colonIndex)
.toLowerCase();
> Problem:
> MimeTypes based discovery (magic and url based) is only done in case the content type
was not delivered by the web server, this happens not that often, mostly this was a problem
with mixed case keys in the header.
> see:
>  public Content toContent() {
>     String contentType = getHeader("Content-Type");
>     if (contentType == null) {
>       MimeType type = null;
>       if (MAGIC) {
>         type = MIME.getMimeType(orig, content);
>       } else {
>         type = MIME.getMimeType(orig);
>       }
>       if (type != null) {
>           contentType = type.getName();
>       } else {
>           contentType = "";
>       }
>     }
>     return new Content(orig, base, content, contentType, headers);
>   }
> Solution:
> Use the content-type information as it is from the webserver and move the content type
discovering from Protocol plugins to the Component where the parsing is done - to the ParseFactory.
> Than just create a list of parsers for the content type returned by the server and the
custom detected content type. In the end we can iterate over all parser until we got a successfully
parsed status.
> Problem:
> Content will be parsed also if the protocol reports a exception and has a non successful
status, in such a case the content is new byte[0] in any case.
> Solution:
> Fetcher.java, line 243.
> Change:   if (!Fetcher.this.parsing ) { .. to 
>  if (!Fetcher.this.parsing || !protocolStatus.isSuccess()) {
>        // TODO we may should not write out here emthy parse text and parse date, i suggest
give outputpage a parameter parsed true / false
>           outputPage(new FetcherOutput(fle, hash, protocolStatus),
>                 content, new ParseText(""),
>                 new ParseData(new ParseStatus(ParseStatus.NOTPARSED), "", new Outlink[0],
new Properties()));
>         return null;
>       }
> Problem:
> Actually the configuration of parser is done based on plugin id's, but one plugin can
have several extentions, so  normally a plugin can provide several parser, but this is no
limited just wrong values are used in the configuration process. 
> Solution:
> Change plugin id to  extension id in the parser configuration file and also change this
code in the parser factory to use extension id's everywhere.
> Problem:
> there is not a clear differentiation between content type and mime type. 
> I'm notice that some plugins call metaData.get("Content-Type) or content.getContentType();
> Actually in theory this can return different values, since the content type could be
detected by the MimesTypes util and is not the same as delivered in the http response header.
> As mentioned actually content type is only detected by the MimeTypes util in case the
header does not contains any content type informations or had problems with mixed case keys.
> Solution:
> Take the content type property out of the meta data and clearly restrict the access of
this meta data into the own getter method.
> Problem:
> Most protocol plugins  checking if content type is null only in this case the MimeTypes
util is used. Since my patch move the mime type detection to the parser factory - where from
my point of view - is the right place, it is now unneccary code we can remove from the protocol
plugins. I never found a case where no content type was returned just mixed case keys was
used. 
> Solution. 
> Remove this detection code, since it is now in the parser factory.
> I didn't change this since more code I  change, I guess there is a  less chance to get
the patch into the sources, I suggest we open a low priority issue and once we change the
plugins we can remove it.
> Problem:
> This is not a problem, but a 'code smells' (Martin Fowler) There are empty test methods
in TestMimeType
>   /** Test of <code>getExtensions</code> method. */
>     public void testGetExtensions() {
>     }
> Solution:
> Implement these tests or remove the test methods.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


Mime
View raw message