tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hudson (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-2317) Add alert that string was truncated before counting tokens
Date Thu, 06 Apr 2017 17:09:41 GMT

    [ https://issues.apache.org/jira/browse/TIKA-2317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15959319#comment-15959319
] 

Hudson commented on TIKA-2317:
------------------------------

FAILURE: Integrated in Jenkins build tika-2.x-windows #189 (See [https://builds.apache.org/job/tika-2.x-windows/189/])
TIKA-2317 warn user if max content length is hit; allow for easier (tallison: rev 67a5e91b2a4157ee06f924280b0b828819c88223)
* (edit) tika-eval/src/main/java/org/apache/tika/eval/batch/EvalConsumerBuilder.java
* (edit) tika-eval/src/main/java/org/apache/tika/eval/AbstractProfiler.java
* (edit) tika-eval/src/main/java/org/apache/tika/eval/ExtractComparer.java
* (edit) tika-eval/src/main/java/org/apache/tika/eval/batch/ExtractProfilerBuilder.java
* (edit) tika-eval/src/main/java/org/apache/tika/eval/batch/ExtractComparerBuilder.java
* (edit) tika-eval/src/main/resources/log4j.properties
* (edit) tika-eval/src/test/java/org/apache/tika/eval/TikaEvalCLITest.java
* (edit) tika-eval/src/main/java/org/apache/tika/eval/tokens/CommonTokenCountManager.java
* (edit) tika-eval/src/main/java/org/apache/tika/eval/tokens/AnalyzerManager.java
* (edit) tika-eval/src/main/java/org/apache/tika/eval/tokens/AnalyzerDeserializer.java
* (edit) tika-eval/src/main/java/org/apache/tika/eval/io/XMLLogReader.java
* (edit) tika-eval/src/main/java/org/apache/tika/eval/XMLErrorLogUpdater.java
* (edit) tika-eval/src/main/java/org/apache/tika/eval/reports/Report.java
* (edit) tika-eval/src/main/resources/tika-eval-profiler-config.xml
* (edit) tika-eval/src/main/java/org/apache/tika/eval/db/MimeBuffer.java
* (edit) tika-eval/src/test/java/org/apache/tika/eval/AnalyzerManagerTest.java
* (edit) tika-eval/src/main/java/org/apache/tika/eval/io/DBWriter.java
* (edit) tika-eval/src/main/java/org/apache/tika/eval/db/Cols.java
* (edit) tika-eval/src/test/resources/single-file-profiler-crawl-extract-config.xml
* (edit) tika-eval/src/main/resources/lucene-analyzers.json
* (edit) tika-eval/src/main/java/org/apache/tika/eval/io/ExtractReader.java
* (edit) tika-eval/src/main/resources/tika-eval-comparison-config.xml
* (edit) tika-eval/src/test/java/org/apache/tika/eval/SimpleComparerTest.java
* (edit) tika-eval/src/test/java/org/apache/tika/eval/tokens/TokenCounterTest.java
* (edit) tika-eval/src/main/java/org/apache/tika/eval/reports/ResultsReporter.java
* (edit) tika-eval/src/test/java/org/apache/tika/eval/db/AbstractBufferTest.java
* (edit) tika-eval/src/main/resources/profile-reports.xml
* (edit) tika-eval/src/main/resources/comparison-reports.xml
* (edit) tika-eval/src/main/java/org/apache/tika/eval/ExtractProfiler.java
* (edit) tika-eval/src/main/java/org/apache/tika/eval/db/JDBCUtil.java


> Add alert that string was truncated before counting tokens
> ----------------------------------------------------------
>
>                 Key: TIKA-2317
>                 URL: https://issues.apache.org/jira/browse/TIKA-2317
>             Project: Tika
>          Issue Type: Improvement
>          Components: tika-eval
>            Reporter: Tim Allison
>            Priority: Trivial
>
> As a memory safety feature, there's a hard limit in the length of the string that is
processed by the token counter.  We should alert the user to when the string is truncated
because comparisons can be misleading in the case that extractA packs more words into the
first 1000000 characters than does extractB even though there are actually more tokens in
extractB.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message