tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-2100) Html Parser does not keep the html tag attributes
Date Tue, 29 May 2018 18:25:00 GMT

    [ https://issues.apache.org/jira/browse/TIKA-2100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16494009#comment-16494009
] 

ASF GitHub Bot commented on TIKA-2100:
--------------------------------------

tballison closed pull request #238: TIKA-2100 extract content language from html lang attribute
URL: https://github.com/apache/tika/pull/238
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/tika-core/src/main/java/org/apache/tika/sax/XHTMLContentHandler.java b/tika-core/src/main/java/org/apache/tika/sax/XHTMLContentHandler.java
index 4742339fa..a2008208f 100644
--- a/tika-core/src/main/java/org/apache/tika/sax/XHTMLContentHandler.java
+++ b/tika-core/src/main/java/org/apache/tika/sax/XHTMLContentHandler.java
@@ -138,7 +138,12 @@ private void lazyStartHead() throws SAXException {
             
             // Call directly, so we don't go through our startElement(), which will
             // ignore these elements.
-            super.startElement(XHTML, "html", "html", EMPTY_ATTRIBUTES);
+            AttributesImpl htmlAttrs = new AttributesImpl();
+            String lang = metadata.get(Metadata.CONTENT_LANGUAGE);
+            if (lang != null) {
+                htmlAttrs.addAttribute("", "lang", "lang", "CDATA", lang);
+            }
+            super.startElement(XHTML, "html", "html", htmlAttrs);
             newline();
             super.startElement(XHTML, "head", "head", EMPTY_ATTRIBUTES);
             newline();
diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/html/HtmlHandler.java b/tika-parsers/src/main/java/org/apache/tika/parser/html/HtmlHandler.java
index a803f7699..18a1025d7 100644
--- a/tika-parsers/src/main/java/org/apache/tika/parser/html/HtmlHandler.java
+++ b/tika-parsers/src/main/java/org/apache/tika/parser/html/HtmlHandler.java
@@ -119,6 +119,9 @@ public void startElement(
             String uri, String local, String name, Attributes atts)
             throws SAXException {
 
+        if ("HTML".equals(name) && atts.getValue("lang") != null) {
+            metadata.set(Metadata.CONTENT_LANGUAGE, atts.getValue("lang"));
+        }
         if ("SCRIPT".equals(name)) {
             scriptLevel++;
         }
diff --git a/tika-parsers/src/test/java/org/apache/tika/parser/html/HtmlParserTest.java b/tika-parsers/src/test/java/org/apache/tika/parser/html/HtmlParserTest.java
index def25d16b..cf3bd78a5 100644
--- a/tika-parsers/src/test/java/org/apache/tika/parser/html/HtmlParserTest.java
+++ b/tika-parsers/src/test/java/org/apache/tika/parser/html/HtmlParserTest.java
@@ -873,6 +873,25 @@ public void testNewlineAndIndent() throws Exception {
         assertTrue(Pattern.matches("\tone\n\n", result));
     }
 
+    /**
+     * Test case for Tika-2100
+     * @see <a href="https://issues.apache.org/jira/browse/TIKA-2100">TIKA-2100</a>
+     */
+    @Test
+    public void testHtmlLanguage() throws Exception {
+        final String html = "<html lang=\"fr\"></html>";
+
+        StringWriter sw = new StringWriter();
+        Metadata metadata = new Metadata();
+        new HtmlParser().parse(
+                new ByteArrayInputStream(html.getBytes(UTF_8)),
+                makeHtmlTransformer(sw), metadata, new ParseContext());
+
+        assertEquals("fr", metadata.get(Metadata.CONTENT_LANGUAGE));
+        assertTrue("Missing HTML lang attribute",
+                Pattern.matches("(?s)<html[^>]* lang=\"fr\".*", sw.toString()));
+    }
+
     /**
      * Test case for TIKA-961
      *


 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


> Html Parser does not keep the html tag attributes
> -------------------------------------------------
>
>                 Key: TIKA-2100
>                 URL: https://issues.apache.org/jira/browse/TIKA-2100
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.13
>            Reporter: Gerard Bouchar
>            Priority: Major
>
> Parsing a very simple html like 
>  <!DOCTYPE html>
> <html lang="en">
> <head>
> <title>Page Title</title>
> </head>
> <body>
> <h1 align="left">My First Heading</h1>
> <p>My first paragraph.</p>
> </body>
> </html> 
> you won't be able to access the html tag's attributes (here lang="en") in the ContentHandler
: 
> *in the method startElement(String ns, String localName, String name,
>       Attributes atts), atts is empty.
> *Moreover it seems that the html tag's attributes are not passed trough the HtmlMapper.mapSafeAttribute
method too.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message