nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chris Mattmann" <chris.mattm...@jpl.nasa.gov>
Subject RE: [jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata
Date Fri, 06 Jan 2006 04:32:25 GMT
Hi Folks,
 
  I've tried removing the 5 copies of the comment, however I can't find a
button on JIRA to remove comments. Maybe an administrator for Nutch can do
it? Anyways, the dang thing is running so slow right now, it may just have
to wait until the server stops returning the 503 service unavailable
messages. Sorry again...

Cheers,
  Chris


______________________________________________
Chris A. Mattmann
Chris.Mattmann@jpl.nasa.gov 
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group

_________________________________________________
Jet Propulsion Laboratory            Pasadena, CA
Office: 171-266B                        Mailstop:  171-246
_______________________________________________________

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.


> -----Original Message-----
> From: chris.mattmann@jpl.nasa.gov [mailto:chris.mattmann@jpl.nasa.gov]
> Sent: Thursday, January 05, 2006 8:28 PM
> To: nutch-dev@lucene.apache.org
> Subject: RE: [jira] Commented: (NUTCH-139) Standard metadata property
> names in the ParseData metadata
> 
> Guys,
> 
>  My apologies for the spamming comments -- I tried to submit my comment
> through JIRA one time and it kept giving me service unavailable. So I
> resubmitted like 5 times, on the fifth time it finally went through -- but
> I
> guess the other comments went through too. I'll try and remove them right
> away.
> 
>  Sorry again.
> 
> Cheers,
>   Chris
> 
> 
> ______________________________________________
> Chris A. Mattmann
> Chris.Mattmann@jpl.nasa.gov
> Staff Member
> Modeling and Data Management Systems Section (387)
> Data Management Systems and Technologies Group
> 
> _________________________________________________
> Jet Propulsion Laboratory            Pasadena, CA
> Office: 171-266B                        Mailstop:  171-246
> _______________________________________________________
> 
> Disclaimer:  The opinions presented within are my own and do not reflect
> those of either NASA, JPL, or the California Institute of Technology.
> 
> 
> > -----Original Message-----
> > From: Doug Cutting (JIRA) [mailto:jira@apache.org]
> > Sent: Thursday, January 05, 2006 8:04 PM
> > To: nutch-dev@incubator.apache.org
> > Subject: [jira] Commented: (NUTCH-139) Standard metadata property names
> in
> > the ParseData metadata
> >
> >     [ http://issues.apache.org/jira/browse/NUTCH-
> > 139?page=comments#action_12361922 ]
> >
> > Doug Cutting commented on NUTCH-139:
> > ------------------------------------
> >
> > One more thing.  Content length should also not need to be stored in the
> > metadata as an x-nutch value.  The content length is simply the length
> of
> > the Content's data.  The protocol may have truncated the content, in
> which
> > case perhaps we need an x-nutch-truncated-content metadata property or
> > something, but we should not be overwriting the HTTP "Content-Length"
> > header, nor should we trust that it reflects the length of the data
> > actually fetched.
> >
> >
> > > Standard metadata property names in the ParseData metadata
> > > ----------------------------------------------------------
> > >
> > >          Key: NUTCH-139
> > >          URL: http://issues.apache.org/jira/browse/NUTCH-139
> > >      Project: Nutch
> > >         Type: Improvement
> > >   Components: fetcher
> > >     Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev
> > >  Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB
> > RAM, although bug is independent of environment
> > >     Reporter: Chris A. Mattmann
> > >     Assignee: Chris A. Mattmann
> > >     Priority: Minor
> > >      Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6
> > >  Attachments: NUTCH-139.060105.patch, NUTCH-139.Mattmann.patch.txt,
> > NUTCH-139.jc.review.patch.txt
> > >
> > > Currently, people are free to name their string-based properties
> > anything that they want, such as having names of "Content-type",
> "content-
> > TyPe", "CONTENT_TYPE" all having the same meaning. Stefan G. I believe
> > proposed a solution in which all property names be converted to lower
> > case, but in essence this really only fixes half the problem right (the
> > case of identifying that "CONTENT_TYPE"
> > > and "conTeNT_TyPE" and all the permutations are really the same). What
> > about
> > > if I named it "Content     Type", or "ContentType"?
> > >  I propose that a way to correct this would be to create a standard
> set
> > of named Strings in the ParseData class that the protocol framework and
> > the parsing framework could use to identify common properties such as
> > "Content-type", "Creator", "Language", etc.
> > >  The properties would be defined at the top of the ParseData class,
> > something like:
> > >  public class ParseData{
> > >    .....
> > >     public static final String CONTENT_TYPE = "content-type";
> > >     public static final String CREATOR = "creator";
> > >    ....
> > > }
> > > In this fashion, users could at least know what the name of the
> standard
> > properties that they can obtain from the ParseData are, for example by
> > making a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to
> > get the content type or a call to
> > ParseData.getMetadata().set(ParseData.CONTENT_TYPE, "text/xml"); Of
> > course, this wouldn't preclude users from doing what they are currently
> > doing, it would just provide a standard method of obtaining some of the
> > more common, critical metadata without pouring over the code base to
> > figure out what they are named.
> > > I'll contribute a patch near the end of the this week, or beg. of next
> > week that addresses this issue.
> >
> > --
> > This message is automatically generated by JIRA.
> > -
> > If you think it was sent incorrectly contact one of the administrators:
> >    http://issues.apache.org/jira/secure/Administrators.jspa
> > -
> > For more information on JIRA, see:
> >    http://www.atlassian.com/software/jira


Mime
View raw message