nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Mattmann <chris.mattm...@jpl.nasa.gov>
Subject Re: [Nutch-cvs] [Nutch Wiki] Update of "ParserFactoryImprovementProposal" by ChrisMattmann
Date Thu, 15 Sep 2005 23:27:52 GMT
Hi Otis,

 Point taken. In actuality since both convey the same information I think
that it's okay to support both, but by default say we could code the initial
plugins specified in parse-plugins.xml without the "order=" attribute. Fair
enough?

Cheers,
  Chris



On 9/15/05 3:23 PM, "ogjunk-nutch@yahoo.com" <ogjunk-nutch@yahoo.com> wrote:

> Well, you have to tell users about order="N" somewhere in the docs.
> Instead of telling them about order="N", tell them that the order in
> XML matters.  Either case requires education, and the latter one
> requires less typing and avoids the case described in the proposal.
> 
> Otis
> 
> --- Sébastien LE CALLONNEC <slc_ie@yahoo.ie> wrote:
> 
>> Hi Otis,
>> 
>> 
>> This issue arose during our discussion for this proposal, and my
>> feeling was that the XML specification doesn't state that the order
>> is
>> significant in an XML file.  I therefore read the spec again, and
>> indeed didn't find anything on that subject...
>> 
>> I think it is somehow reasonable to consider that a parser _might_
>> return the elements in a different order—though, as I mentioned to
>> Chris & Jerome, that would be quite unheard of, and, to be honnest,
>> rather irritating.
>> 
>> What do you think?
>> 
>> 
>> Regards,
>> Sebastien.
>> 
>> 
>> 
>>> Quick comment about order="N" and the paragraph that describes how
>> to
>>> deal with cases where people mess things up and enter multiple
>>> plugins
>>> for the same content type and the same order:
>>> 
>>> - Why is the order attribute even needed?  It looks like a
>> redundant
>>> piece of information - why not derive order from the order of
>> plugin
>>> definitions in the XML file?
>>> 
>>> For instance:
>>> Instead of this:
>>> 
>>>   <mimeType name="*">
>>>       <plugin id=”parse-text” order=”1”/>
>>>       <plugin id=”another-one-default-parser” order=”2”/>
>>>      ....
>>>   </mimeType>
>>> 
>>> We have this:
>>> 
>>>   <mimeType name="*">
>>>       <plugin id=”parse-text”/>
>>>       <plugin id=”another-one-default-parser”/>
>>>      ....
>>>   </mimeType>
>>> 
>>> parse-text first, another-one-default-parser second.  Less typing,
>>> and
>>> we avoid the case of equal ordering all together.
>>> 
>>> Otis
>>> 
>>> 
>>> --- Apache Wiki <wikidiffs@apache.org> wrote:
>>> 
>>>> Dear Wiki user,
>>>> 
>>>> You have subscribed to a wiki page or wiki category on "Nutch
>> Wiki"
>>>> for change notification.
>>>> 
>>>> The following page has been changed by ChrisMattmann:
>>>> http://wiki.apache.org/nutch/ParserFactoryImprovementProposal
>>>> 
>>>> The comment on the change is:
>>>> Initial Draft of ParserFactoryImprovementProposal
>>>> 
>>>> New page:
>>>> = Parser Factory Improvement Proposal =
>>>> 
>>>> 
>>>> == Summary of Issue ==
>>>> Currently Nutch provides a plugin mechanism wherein plugins
>>> register
>>>> certain metadata about themselves, including their id, classname,
>>> and
>>>> so forth. In particular, the set of parsing plugins register
>> which
>>>> contentTypes and file suffixes they can support with a
>>>> PluginRepository.
>>>> 
>>>> One “adopted practice� in current Nutch parsing plugins
>>>> (committed in Subversion, e.g., see parse-pdf, parse-rss, etc.)
>> has
>>>> also been to verify that the content type passed to it during a
>>> fetch
>>>> is indeed one of the contentTypes that it supports (be it
>>>> application/xml, or application/pdf, etc.). This practice is
>>>> cumbersome for a few reasons:
>>>> 
>>>>  *Any updates to supported content types for a parsing plugin
>> will
>>>> require a recompilation of the plugin code
>>>>  *Checking for “hard coded� content types within the parsing
>>>> plugin is a duplication of information that already exists in the
>>>> plugin’s descriptor file, plugin.xml
>>>>  *By the time that content gets to a parsing plugin, (e.g., the
>>>> parsing plugin is returned by the ParserFactory, and provided
>>> content
>>>> during a fetch), the ParsingFactory should have already ensured
>>> that
>>>> the appropriate plugin is getting called for a particular
>>>> contentType.
>>>> 
>>>> In addition to this problem is the fact that several parsing
>>> plugins
>>>> may all support many of the same content types. For instance, the
>>>> parse-js plugin may be the only well suited parsing plugin for
>>>> javascript, but perhaps it may also provided a good enough
>>> heuristic
>>>> parser for plain text as well, and so it may support both types.
>>>> However, there may be a parsing plugin for text (which there
>> is!),
>>>> parse-text, whose primary purpose is to parse plain text as well.
>>>> 
>>>> == Suggested Remedy ==
>>>> To deal with ensuring the desired parsing plugin is called for
>> the
>>>> appropriate content type, and to in effect, “kill two birds
>> with
>>>> one stone�, we propose that there be a parsing plugin
>> preference
>>>> list for each content type that Nutch knows how to handle, i.e.,
>>> each
>>>> content type available via the mimeType system. Therefore, during
>> a
>>>> fetch, once the appropriate mimeType has been determined for
>>> content,
>>>> and the ParserFactory is tasked with returning a parsing plugin,
>>> the
>>>> ParserFactory should consult a preference list for that
>>> contentType,
>>>> allowing it to determine which plugin has the highest preference
>>> for
>>>> the contentType. That parsing plugin should be returned via the
>>>> ParserFactory to the fetcher. If there is any problem using the
>>>> initial returned parsing plugin for a particular contentType
>> (i.e.,
>>>> if a ParseException is throw during the parser, or a null
>>> ParseStatus
>>>> is returned), then the ParserFactory should be called again, this
>>>> time asking for the “next highest ranked
>>>>  � plugin for that contentType. Such a process should repeat on
>>> and
>>>> on until the parse is successful.
>>>> 
>>>> We propose that the “plugin preference list� should be a
>>> separate
>>>> file that lives in $NUTCH_HOME/conf called
>> “parse-plugins.xml�.
>>>> The format of the file (full DTD to be developed during coding)
>>>> should be something like: {{{
>>>> 
>>>> <parse-plugins>
>>>>   <default pluginname=�parse-text�/>
>>>>   <fileType name=�powerpoint�>
>>>>    <mimeTypes>
>>>>     <mimeType name=�application/pdf� />
>>>>     <mimeType name=�application/x-pdf� />
>>>>     …
>>>>    </mimeTypes>
>>>> 
>>>>    <plugins>
>>>> 
>>>>       <plugin name=�parse-pdf� order=�1�/>
>>>>       <plugin name=�parse-pdf-worse� order=�2�/>
>>>>      …
>>>>    </plugins>
>>>>   </fileType>
>>>>     …
>>>> </parse-plugins>
>>>> 
>>>> }}}
>>>> 
>>>> 
>>>> One of the main impacts of having a file like parse-plugins.xml
>> is
>>>> that no longer should the pathSuffix="" be part of the plugin.xml
>>>> descriptor. We propose to move that out of plugin.xml and into
>> the
>>>> mime-types.xml file.
>>>> 
>>>> == Architectural Impact ==
>>>> 
>>>> === Components ===
>>>>  *Fetcher
>>>>  *PluginSystem
>>>>  *ParserFactory
>>>> 
>>>> === Impact on current releases of Nutch ===
>>>> 
>>>> ''Incompatibilities''
>>>> 
>>>> By moving the contentType and pathSuffix out of the plugin.xml
>>> file,
>>>> this would create an updated version of the plugin.xml descriptor
>>>> schema for each plugin. To lessen the effect on previous and
>>>> near-term releases of Nutch this information could be left as an
>>>> option in the plugin.xml schema, but marked as “deprecated�
>> to
>>>> let people know that this functionality isn’t part of the parse
>>>> plugin identification process anymore, but it is left in the
>> schema
>>>> so as not to create incompatibilities with the plugin.xml files
>>> that
>>>> people have already wrote. However, ultimately in future releases
>>> of
>>>> Nutch, we propose that the contentType and pathSuffix attributes
>>>> should be removed from the plugin.xml schema.
>>>> 
>>>> Other than the plugin.xml file schema change, this capability
>>>> addition will simply control the order in which parsing plugins
>> get
>>>> called during fetching activities. It won’t directly impact the
>>>> segments stored, or the webapp, or any of the main components of
>>>> Nutch.
>>>> 
>>>> ''Issues''
>>>> 
>>>> The proposed new capabilities should be first tested on local
>>>> systems, and if successful, uploaded to JIRA, and verified
>> against
>>>> the latest SVNs.
>>>> Unit tests should be written to verify appropriate plugin parsing
>>>> order.
>>>> Users will need to be notified in the Nutch tutorial and
>>> instruction
>>>> lists about how to set up the parsing plugin preferences prior to
>>>> performing a fetch.
>>>> 
>>>> == Personnel ==
>>>> 
>>>>  *Jerome Charron
>>>>  *Sébastien Le Callonnec
>>>>  *Chris A. Mattmann
>>>> 
>>>> == Timeframe ==
>>>> 
>>> 
>> === message truncated ===
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
> ___________________________________________________________________________
>> 
>> Appel audio GRATUIT partout dans le monde avec le nouveau Yahoo!
>> Messenger 
>> Téléchargez cette version sur http://fr.messenger.yahoo.com
>> 
> 

______________________________________________
Chris A. Mattmann
Chris.Mattmann@jpl.nasa.gov
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group
 
_________________________________________________
Jet Propulsion Laboratory            Pasadena, CA
Office: 171-266B                        Mailstop:  171-246
Phone:  818-354-8810
_______________________________________________________
 
Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.
 
 




Mime
View raw message