lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Erick Erickson (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (SOLR-10781) Remove innerText of <Script> and <Style> if present inside <Body> during indexing using DATA_WEB_MODE
Date Thu, 01 Jun 2017 04:00:07 GMT

    [ https://issues.apache.org/jira/browse/SOLR-10781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16032405#comment-16032405
] 

Erick Erickson edited comment on SOLR-10781 at 6/1/17 3:59 AM:
---------------------------------------------------------------

Is this a Tika issue, or possibly config options for Tika? Solr just indexes what you give
it. The next user might _want_ this stuff indexed so making Solr do this automagically seems
wrong.

P.S. Please raise issues like this on the user's list first. If consensus is reached that
this is something that should go in Solr, _then_ raise a JIRA.


was (Author: erickerickson):
Is this a Tika issue, or possibly config options for Tika? Solr just indexes what you give
it. The next user might _want_ this stuff indexed so making Solr do this automagically seems
wrong.

> Remove innerText of <Script> and <Style> if present inside <Body> during
indexing using DATA_WEB_MODE
> -----------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-10781
>                 URL: https://issues.apache.org/jira/browse/SOLR-10781
>             Project: Solr
>          Issue Type: Improvement
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: contrib - Solr Cell (Tika extraction), SimplePostTool
>    Affects Versions: 6.5.1
>         Environment: Indexing websites using URL, recurvice and depth ( i.e. in DATA_WEB_MODE
)
>            Reporter: Jayesh Shende
>            Priority: Minor
>              Labels: beginner, easyfix, starter
>             Fix For: 6.6
>
>   Original Estimate: 12h
>  Remaining Estimate: 12h
>
> When Indexing is done using SimplePostTool or Using SolrJ or any means, with data source
as a URL. If fetched HTML page contains <script> and <style> tags inside <body>
tag (not in <head> tag ) then after Posting document to Solr collection using "sample_techproducts_configs"
configuration, the innerText ( i.e. EMAC/JS scripts and CSS styles) remains as part of document
text inside the "content"/"text" field in Index documents.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message