tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-2449) Enabling extraction of standard references from text
Date Thu, 14 Sep 2017 00:36:04 GMT

    [ https://issues.apache.org/jira/browse/TIKA-2449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16165536#comment-16165536

ASF GitHub Bot commented on TIKA-2449:

giuseppetotaro closed pull request #204: TIKA-2449: Enabling extraction of standard references
from text
URL: https://github.com/apache/tika/pull/204
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:

> Enabling extraction of standard references from text
> ----------------------------------------------------
>                 Key: TIKA-2449
>                 URL: https://issues.apache.org/jira/browse/TIKA-2449
>             Project: Tika
>          Issue Type: Improvement
>          Components: handler
>            Reporter: Giuseppe Totaro
>              Labels: handler
>         Attachments: flowchart_standards_extraction.png, flowchart_standards_extraction_v02.png,
SOW-TacCOM.pdf, standards_extraction.patch
> Apache Tika currently provides many _ContentHandler_ which help to de-obfuscate some
information from text. For instance, the {{PhoneExtractingContentHandler}} is used to extract
phone numbers while parsing.
> This improvement adds the *{{StandardsExtractingContentHandler}}* to Tika, a new ContentHandler
that relies on regular expressions in order to identify and extract standard references from
> Basically, a standard reference is just a reference to a norm/convention/requirement
(i.e., a standard) released by a standard organization. This work is maily focused on identifying
and extracting the references to the standards already cited within a given document (e.g.,
SOW/PWS) so the references can be stored and provided to the user as additional metadata in
case the StandardExtractingContentHandler is used.
> In addition to the patch, the first version of the {{StandardsExtractingContentHandler}}
along with an example class to easily execute the handler is available on [GitHub|https://github.com/giuseppetotaro/StandardsExtractingContentHandler].
The following sections provide more in detail how the {{StandardsExtractingHandler}} has been
> h1. Background
> From a technical perspective, a standard reference is a string that is usually composed
of two parts: 
> # the name of the standard organization; 
> # the alphanumeric identifier of the standard within the organization. 
> Specifically, the first part can include the acronym or the full name of the standard
organization or even both, and the second part can include an alphanumeric string, possibly
containing one or more separation symbols (e.g., "-", "_", ".") depending on the format adopted
by the organization, representing the identifier of the standard within the organization.
> Furthermore, the standard references are usually reported within the "Applicable Documents"
or "References" section of a SOW, and they can be cited also within sections that include
in the header the word "standard", "requirement", "guideline", or "compliance".
> Consequently, the citation of standard references within a SOW/PWS document can be summarized
by the following rules:
> * *RULE #1*: standard references are usually reported within the section named "Applicable
Documents" or "References".
> * *RULE #2*: standard references can be cited also within sections including the word
"compliance" or another semantically-equivalent word in their name.
> * *RULE #3*: standard references is composed of two parts:
> ** Name of the standard organization (acronym, full name, or both).
> ** Alphanumeric identifier of the standard within the organization.
> * *RULE #4*: The name of the standard organization includes the acronym or the full name
or both. The name must belong to the set of standard organizations {{S = O U V}}, where {{O}}
represents the set of open standard organizations (e.g., ANSI) and {{V}} represents the set
of vendor-specific standard organizations (e.g., Motorola).
> * *RULE #5*: A separation symbol (e.g., "-", "_", "." or whitespace) can be used between
the name of the standard organization and the alphanumeric identifier.
> * *RULE #6*: The alphanumeric identifier of the standard is composed of alphabetic and
numeric characters, possibly split in two or more parts by a separation symbol (e.g., "-",
"_", ".").
> On the basis of the above rules, here are some examples of formats used for reporting
standard references within a SOW/PWS:
> Moreover, some standards are sometimes released by two standard organizations. In this
case, the standard reference can be reported as follows:
> h1. Regular Expressions
> The {{StandardsExtractingContentHandler}} uses a helper class named {{StandardsText}}
that relies on Java regular expressions and provides some methods to identify headers and
standard references, and determine the score of the references found within the given text.
> Here are the main regular expressions used within the {{StandardsText}} class:
> * *REGEX_HEADER*: regular expression to match only uppercase headers.
>   {code}
>   (\d+\.(\d+\.?)*)\p{Blank}+([A-Z]+(\s[A-Z]+)*){5,}
>   {code}
> * *REGEX_APPLICABLE_DOCUMENTS*: regular expression to match the the header of "APPLICABLE
DOCUMENTS" and equivalent sections.
>   {code}
>   {code}
> * *REGEX_FALLBACK*: regular expression to match a string that is supposed to be a standard
>   {code}
>   \(?(?<mainOrganization>[A-Z]\w+)\)?((\s?(?<separator>\/)\s?)(\w+\s)*\(?(?<secondOrganization>[A-Z]\w+)\)?)?(\s(Publication|Standard))?(-|\s)?(?<identifier>([0-9]{3,}|([A-Z]+(-|_|\.)?[0-9]{2,}))((-|_|\.)?[A-Z0-9]+)*)
>   {code}
> * *REGEX_STANDARD*: regular expression to match the standard organization within a string
potentially representing a standard reference.
>   This regular expression is obtained by using a helper class named {{StandardOrganizations}}
that provides a list of the most important standard organizations reported on [Wikipedia|https://en.wikipedia.org/wiki/List_of_technical_standard_organisations].
Basically, the list is composed of International standard organizations, Regional standard
organizations, and American and British among Nationally-based standard organizations. Other
lists of standard organizations are reported on [OpenStandards|http://www.openstandards.net/viewOSnet2C.jsp?showModuleName=Organizations]
and [IBR Standards Portal|https://ibr.ansi.org/Standards/].
> h1. How To Use The Standards Extraction Capability
> The standard references identification performed by using the {{StandardsExtractingContentHandler}}
is based on the following steps (see also the [flow chart|^flowchart_standards_extraction.png]
in attachment):
> # searches for headers;
> # searches for patterns that are supposed to be standard references (basically, every
string mostly composed of uppercase letters followed by an alphanumeric characters);
> # each potential standard reference starts with score equal to 0.25;
> # increases by 0.50 the score of references which include the name of a known standard
> # increases by 0.25 the score of references which have been found within "Applicable
Documents" and equivalent sections;
> # returns the standard references along with scores;
> # adds the standard references as additional metadata.
> The unit test is implemented within the *{{StandardsExtractingContentHandlerTest}}* class
and extracts the standard references from a SoW downloaded from the [FOIA Library|https://foiarr.cbp.gov/streamingWord.asp?i=607].
This [SoW|^SOW-TacCOM.pdf] is also provided as PDF in attachment.
> The *{{StandardsExtractionExample}}* is a class to demonstrate how to use the {{StandardsExtractingContentHandler}}
to get a list of the standard references from every file in a directory.
> The [patch|^standards_extraction.patch] in attachment includes all the changes to add
the support for standards extraction. 

This message was sent by Atlassian JIRA

View raw message