[ https://issues.apache.org/jira/browse/TIKA-1303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14025253#comment-14025253 ] Hassan Akram commented on TIKA-1303: ------------------------------------ :) Thanks Guys - Do I just close this issue now? And also does that mean this will make it into release 1.7 and not 1.6? > Parsing Html page (not well formed) containing two title tags results in metadata (title) to be overwritten > ----------------------------------------------------------------------------------------------------------- > > Key: TIKA-1303 > URL: https://issues.apache.org/jira/browse/TIKA-1303 > Project: Tika > Issue Type: Bug > Components: metadata, parser > Affects Versions: 1.2, 1.3, 1.4, 1.5 > Reporter: Hassan Akram > Assignee: Ken Krugler > Priority: Minor > Labels: patch > Fix For: 1.7 > > Attachments: HtmlHandler.java, HtmlParserTest.java, TIKA-1303.patch > > > While crawling following web page, we came accross a strange issue where by title for page was not being extracted accurately: > http://www.samsung.com/us/support/faq/FAQ00052677/61239/SM-C105AZWAATT > This html page is not well formed and contains two title tags (one in head and one is body): > e.g. "Simple Content

TitleToIgnore" > Now in this case a simple fix to htmlhandler could make sure that once title value has been set in metadata, it should not be overridden when another title tag is subsequently found. > I am submitting fix for this issue as a path for review (1.5) - hoping that this could be committed to latest? > Can you please review and update kindly. -- This message was sent by Atlassian JIRA (v6.2#6252)