tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chris A. Mattmann (JIRA)" <j...@apache.org>
Subject [jira] Commented: (TIKA-7) Lius Lite remove all lucene dependencies from Lius and use Nutch office parsers
Date Wed, 13 Jun 2007 14:06:27 GMT

    [ https://issues.apache.org/jira/browse/TIKA-7?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12504246
] 

Chris A. Mattmann commented on TIKA-7:
--------------------------------------

Jukka,

 Thanks for spearheading the lead on this. I think it's important to note that patches to
Tika should follow this type of standard layed out, e.g., use the org.apache.tika namespace,
make sure that unit tests are placed in the right place, that resources are as well, etc.
I am on travel right now, but I will take a look at this patch as soon as I get back to Los
Angeles.

 One question I have is, have we standardized on the following issues (I know they were discussed
at ApacheCon at the BoF, as I've seen conversation on the dev list regarding it, however,
I wasn' there :) ):

1. standardization of Parser interface?
2. control flow of Tika parsers (e.g., similar to Bertrand's email http://www.nabble.com/-RT--Tika-framework-usage-scenario-tf3913308.html)
3. major features that we want for 0.1 release

 I think that these questions need to be answered before we move forward with more code development.
I realize that I've been out of the loop for a bit of time, however, I'm starting to have
some time now to get back into the loop :) So, let's discuss. Here are my propositions for
issues 1-3 above:

1. I like Bertrand's idea of a pipeline-based Tika framework. I think that the "ContentFilter"
that he proposes is essentially this Parser interface that we are talking about. Immediate
questions that come to mind are:
  a. Could the ContentFilter be run in single filter mode, e.g., from the command line? I
think that a use case for Tika should be that all parsers are executable in some fashion (even
if only for testing) from the command line. The parsed content should be returned as some
form of a Metadata object, in which the user can inspect the parsed information. Perhaps other
information should be returned as well, but that's what I thought off of the top of my head.
  b. Would this pipeline model still support the use cases for Nutch, and other initial projects
that we were targeting as customers of Tika? Nutch's parse plugins are currently more single
content parsing plugins, however, I think they could still be handled by this pipeline framework.
I just want to get everyone else's opinion on it?

2. See my questions in #1 above
3. I think that we should plan to have the following features in the 0.1 release of Tika:
   a. Basic parsing capability, +1 for using pipelining, but we need to standardize the interfaces
for those/talk about architecture
   b. Content Type identification (e.g., MimeType identification)
   c. Basic metadata extraction capabilities
   d. Limited set of known parsing of content types, e.g., HTML, and PDF


What does everyone else think?

> Lius Lite remove all lucene dependencies from Lius  and use Nutch office parsers
> --------------------------------------------------------------------------------
>
>                 Key: TIKA-7
>                 URL: https://issues.apache.org/jira/browse/TIKA-7
>             Project: Tika
>          Issue Type: New Feature
>          Components: general
>         Environment: Java 1.5
>            Reporter: Rida Benjelloun
>         Attachments: liuslite.patch, liusLite.zip
>
>
> Hi,
> This is a work in progress of Lius. The release remove all Lucene dependencies and use
Nutch Office parsers because they are based on Apache POI.
> Lius Lite offer 4 ways  for content extraction :
> - Document fulltext extraction
> - XPath extraction
> - Regex extraction
> - Document metadata extraction (not implemented for all parsers)
> Lius Lite use an XML config file to configure the parsers and the information to extract.
 Please see config.xml in the config folder
> See also Junit tests.
> Here is an example  of XML parsing :
> 1- XML Config
> 		<parser name="text-xml" class="liuslite.parser.xml.XMLParser">			
> 				<namespace>http://purl.org/dc/elements/1.1/</namespace>
> 				<mime>application/xml</mime>
> 				<extract>
> 					<content name="title" xpathSelect="//dc:title"/>
> 					<content name="subject" xpathSelect="//dc:subject"/>
> 					<content name="creator" xpathSelect="//dc:creator"/>
> 					<content name="description" xpathSelect="//dc:description"/>
> 					<content name="publisher" xpathSelect="//dc:publisher"/>
> 					<content name="contributor" xpathSelect="//dc:contributor"/>
> 					<content name="type" xpathSelect="//dc:type"/>
> 					<content name="format" xpathSelect="//dc:format"/>
> 					<content name="identifier" xpathSelect="//dc:identifier"/>
> 					<content name="language" xpathSelect="//dc:language"/>
> 					<content name="rights" xpathSelect="//dc:rights"/>
> 					<content name="outLinks">
> 						<regexSelect>
> 							<![CDATA[
> 								([A-Za-z][A-Za-z0-9+.-]{1,120}:[A-Za-z0-9/](([A-Za-z0-9$_.+!*,;/?:@&~=-])|%[A-Fa-f0-9]{2}){1,333}(#([a-zA-Z0-9][a-zA-Z0-9$_.+!*,;/?:@&~=%-]{0,1000}))?)
> 							]]>
> 						</regexSelect>
> 					</content>
> 				</extract>			
> 		</parser>
> 2- XML Document
> <oaidc:dc xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:oaidc="http://www.openarchives.org/OAI/2.0/oai_dc/">
> 	<dc:title>Archim├Ęde et Lius</dc:title>
> 	<dc:creator>Rida Benjelloun</dc:creator>
> 	<dc:subject>Java</dc:subject>
> 	<dc:subject>XML</dc:subject>
> 	<dc:subject>XSLT</dc:subject>
> 	<dc:subject>JDOM</dc:subject>
> 	<dc:subject>Indexation</dc:subject>
> 	<dc:description>Framework d'indexation des documents XML, HTML, PDF etc.. </dc:description>
> 	<dc:publisher>Doculibre</dc:publisher>
> 	<dc:identifier>http://www.apache.org</dc:identifier>
> 	<dc:date>2000-12</dc:date>
> 	<dc:type>test</dc:type>
> 	<dc:format>application/msword</dc:format>
> 	<dc:language>Fr</dc:language>
> 	<dc:rights>Non restreint</dc:rights>	
> </oaidc:dc>
> 3- Java Code 
> LiusConfig lc = LiusConfig.getInstance(configPathString);
> LiusLogger.setLoggerConfigFile(log4jPathString);
> File testFile = new File("test.xml");
> try {
> 	Parser  parser = ParserFactory.getParser(testFile, lc);
>         String fullText = parser.getContentStr();
>         
>         Content title = parser.getContent("title");
>         String titleStr = title.getValue();
>         
>         Content subject = parser.getContent("subject");
>         String[] subjects = subject.getValues();
>         etc ...
>         Or : 
>         List<Content> contents = parser.getContents();
>         
>      } catch (MimeInfoException e) {
> 	 e.printStackTrace();
>      } catch (IOException e) {
> 	e.printStackTrace();
>      } catch (LiusException e) {
> 	e.printStackTrace();
>       }
> best regards
> Rida Benjelloun

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message