nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Mattmann <chris.mattm...@jpl.nasa.gov>
Subject Re: What's the status of Nutch-GUI?
Date Mon, 20 Nov 2006 23:29:47 GMT
Hi Armel,

On 11/20/06 1:44 PM, "Armel T. Nene" <armel.nene@idna-solutions.com> wrote:

> Hi Chris,
> 
> I am trying to extend parse-xml to enable the creation of lucene fields
> straight from an xml file. For example, a database table that has been parse
> as an XML file should be stored in the index with the relevant fields, i.e.
> customer name, address and so on. This file will not have a namespace
> associated with it and should not be stored as "xmlcontent" in the database.
> Currently, parse-xml looks for known fields in the document and stores the
> associated values with the field name. I have added an extra conditions as
> if the known fields are not present in the current document, the element or
> node in the document should be the new field stored in the index with their
> value.

I think that this is fine.
> 
> Therefore, when parse-xml receives an xml document with no namespace
> available, it will parse the document and store it element name as new field
> in the index and the element associated value.
> 
> Let me know if I am on the right track because I know I don't have to write
> a separate plugin for this feature but just extending ( or modifying)
> parse-xml.

I think that parse-xml will support what you are talking about. In terms of
the "check" that you are doing to see if a field exists or not before adding
another value for it in the index, as I understood Lucene, I believe that
you could just omit this check and add the field regardless. If you add
multiple values for the same field in a Document, e.g:

<snip>
Document doc = new Document();

doc.add(new Field("fieldname", "fieldvalue", ...));
doc.add(new Field("fieldname", "fieldvalue2",...));

</snip>

Both the values "fieldvalue" and "fieldvalue2" will both get stored in the
index for the key "fieldname". So, if I understand you correctly (which I
may not ;) ), then I think you can omit the check that you are talking about
above and just go with adding the same field name 2x.

HTH,
  Chris

> 
> Cheers,
> 
> Armel
> 
> 
> -----Original Message-----
> From: Chris Mattmann [mailto:chris.mattmann@jpl.nasa.gov]
> Sent: 20 November 2006 18:40
> To: nutch-dev@lucene.apache.org
> Subject: Re: What's the status of Nutch-GUI?
> 
> Hi Sami and Scott,
> 
>  This is on my TO-DO list as one of the items that I will begin working on
> getting into the sources as a committer. Additionally, I plan on integrating
> and testing the parse-xml plugin into the source tree. As soon as I get my
> Apache account and SVN access, I will start working on this.
> 
> Thanks!
> 
> Cheers,
>   Chris
> 
> 
> 
> On 11/20/06 9:24 AM, "Sami Siren" <ssiren@gmail.com> wrote:
> 
>> scott green wrote:
>>> Hi
>>> 
>>> Is nutch-gui dead? why i cannot find any source in svn repo?
>> 
>> Unfortunately the sources for the admin gui never got into svn. It would
>> be great if someone could pick it up and bring it up to date to get it
>> integrated.
>> 
>> --
>>   Sami Siren
>> 
> 
> 
> 
> 

______________________________________________
Chris A. Mattmann
Chris.Mattmann@jpl.nasa.gov
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group

_________________________________________________
Jet Propulsion Laboratory            Pasadena, CA
Office: 171-266B                        Mailstop:  171-246
_______________________________________________________

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.



Mime
View raw message