tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Nick Burch (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-695) Custom properties on xlsx, docx, pptx
Date Thu, 12 Jan 2012 15:01:43 GMT

    [ https://issues.apache.org/jira/browse/TIKA-695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13185001#comment-13185001
] 

Nick Burch commented on TIKA-695:
---------------------------------

Thanks for the sample files. Based on them, I've added support for all the common custom property
types (to both Tika and POI), and added unit tests for custom properties on both OLE2 and
OOXML files

As of r1230576, custom properties from OOXML files are being correctly extracted

The only parts left not supported are Vectors/Arrays (where a property can have multiple values),
and the byte based blogs/streams. I don't think we're likely to be able to do much with the
byte based ones, but possibly the vectors/arrays could be worth adding later. If you're able
to create files with these custom properties, please open a new enhancement for it!
                
> Custom properties on xlsx, docx, pptx
> -------------------------------------
>
>                 Key: TIKA-695
>                 URL: https://issues.apache.org/jira/browse/TIKA-695
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.10, 1.0
>         Environment: All OS
>            Reporter: Etienne Jouvin
>            Priority: Minor
>             Fix For: 1.1
>
>
> Parser on office Xfiles do not get custom properties.
> In class MetadataExtractor, method extract, only core and extended properties are retrieve.
> I added something like this:
> extractMetadata(extractor.getCustomProperties(), metadata);
> {quote}
> 	/**
> 	 * Add this method to read custom properties on document.
> 	 * 
> 	 * @param properties All custom properties.
> 	 * @param metadata Metadata to complete with read properties.
> 	 */
> 	private void extractMetadata(CustomProperties properties, Metadata metadata) {
> 		org.openxmlformats.schemas.officeDocument.x2006.customProperties.CTProperties propsHolder
= properties.getUnderlyingProperties();
> 		String value = null;
> 		DateUtils dateUtils = DateUtils.getInstance();
> 		BigDecimal bigDecimal;
> 		for (CTProperty property : propsHolder.getPropertyList()) {
> 			/* Parse each property */
> 			if (property.isSetLpwstr()) {
> 				value = property.getLpwstr();
> 			} else if (property.isSetFiletime()) {
> 				value = dateUtils.convertDate(property.getFiletime(), null);
> 			} else if (property.isSetDate()) {
> 				value = dateUtils.convertDate(property.getDate(), null);
> 			} else if (property.isSetDecimal()) {
> 				bigDecimal = property.getDecimal();
> 				value = null == bigDecimal ? null : bigDecimal.toString();
> 			} else if (property.isSetBool()) {
> 				value = BooleanUtils.toStringTrueFalse(property.getBool());
> 			} else if (property.isSetInt()) {
> 				value = Integer.toString(property.getInt());
> 			} else if (property.isSetLpstr()) {
> 				value = property.getLpstr();
> 			} else if (property.isSetI4()) {
> 				/* Number in Excel for example.... Why i4 ? Ask microsoft. */
> 				value = Integer.toString(property.getI4());
> 			} else {
> 				/* For other type, do nothing. */
> 				continue;
> 			}
> 			/* Add the custom prefix, as done in old office format. */
> 			addProperty(metadata, "custom:" + property.getName(), value);
> 		}
> 	}
> {quote}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message