Dear Wiki user,
You have subscribed to a wiki page or wiki category on "Jackrabbit Wiki" for change notification.
The following page has been changed by Astroknight:
http://wiki.apache.org/jackrabbit/TextExtractorExamples
New page:
##language:en
== Examples for writing your own TextExtractors ==
=== Add Mime Types ===
Make sure to extract from jackrabbit-jcr-server-*.jar and add "org\apache\jackrabbit\server\io\mimetypes.properties"
to your web project's classes folder, then add mime types which are defined in your text extractor
classes.
{{{
...
mht=message/rfc822
msg=application/msoutlook
csv=text/plain
}}}
=== Obtain Mime Type ===
To obtain mime type from a file path use {{{MimeResolver}}} when possible, you'd better maintain
one instance as it will read the mimetypes.properties file in the construtor.
{{{
public static MimeResolver mimeResolver = new MimeResolver();
...
String contentType = mimeResolver.getMimeType(filePath);
}}}
=== Ms Poperpoint ===
To well support the text extraction of ms powerpoint files, code below could help you by leveraging
Apache POI's HSLF component.
{{{
/**
* Text extractor for Microsoft PowerPoint presentations.
*/
public class MsPowerPointTextExtractor extends AbstractTextExtractor {
/**
* Force loading of dependent class.
*/
static {
POIFSReader.class.getName();
}
/**
* Creates a new <code>MsPowerPointTextExtractor</code> instance.
*/
public MsPowerPointTextExtractor() {
super(new String[]{"application/vnd.ms-powerpoint",
"application/mspowerpoint"});
}
//-------------------------------------------------------< TextExtractor >
/**
* {@inheritDoc}
*/
public Reader extractText(InputStream stream,
String type,
String encoding) throws IOException {
try {
CharArrayWriter writer = new CharArrayWriter();
SlideShow slideShow= new SlideShow(new HSLFSlideShow(stream));
Slide[] slides = slideShow.getSlides();
for (int i = 0; i < slides.length; i++) {
Slide slide = slides[i];
/* Optional */
if(StringUtils.isNotEmpty(slide.getTitle())) {
writer.append(slide.getTitle() + " ");
}
TextRun[] textRuns = slide.getTextRuns();
for (int j = 0; j < textRuns.length; j++) {
writer.append(textRuns[j].getText() + " ");
}
}
return new CharArrayReader(writer.toCharArray());
} finally {
stream.close();
}
}
}
}}}
|