lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Simon Blandford <simon.blandf...@bkconnect.net>
Subject Re: Metadata and HTML ending up in searchable text
Date Tue, 31 May 2016 07:38:03 GMT
Hi Alex,

That sounds similar. I am puzzled by what I am seeing because it looks 
like a major bug and I am following the docs for curl as closely as 
possible, but hardly anyone else seems to have noticed it. To me it is a 
show-stopper.

If I convert the docs to txt with html2text first then I can sort-of 
live with the results, although I'd rather not have the metadata in the 
document, but at least the main text body doesn't have tag content in 
it, as it does with HTML source.

I just want to make sure I'm not missing something really obvious before 
submitting a bug report.

Regards,
Simon


On 27/05/16 20:22, Alexandre Rafalovitch wrote:
> I think Solr's layer above Tika was merging in metadata and text all
> together without a way (that I could see) to separate them.
>
> That's all I remember of my examination of this issue when I run into
> something similar. Not very helpful, I know.
>
> Regards,
>     Alex.
> ----
> Newsletter and resources for Solr beginners and intermediates:
> http://www.solr-start.com/
>
>
> On 27 May 2016 at 23:48, Simon Blandford <simon.blandford@bkconnect.net> wrote:
>> Hi Timothy,
>>
>> Thanks for responding.
>>
>> java -jar tika-app-1.13.jar -t
>> "/home/user/Documents/library/UsingMailingLists.txt"
>> ...gives a clean result with no CSS or other nasties in the output. So it
>> looks like the latest version of tika itself is OK.
>>
>> I was basing the test case on this doc page as closely as possible,
>> including the prefix and content mapping.
>> https://wiki.apache.org/solr/ExtractingRequestHandler
>>
>>  From the same page, extractFormat=text only applies when extractOnly is
>> true, which just shows the output from tika without indexing the document.
>> Running it in "extractOnly" mode resulting in a XML output. The difference
>> between selecting "text" or "xml" format is that the escaped document in the
>> <response> tag is either the original HTML (xml mode) or stripped HTML (text
>> mode). It seems some Javascript creeps into the text version. (See below)
>>
>> Regards,
>> Simon
>>
>> HTML mode sample:
>> <?xml version="1.0" encoding="UTF-8"?>
>> <response>
>> <lst name="responseHeader"><int name="status">0</int><int
>> name="QTime">51</int></lst><str name="UsingMailingLists.html">&lt;?xml
>> version="1.0" encoding="UTF-8"?&gt;
>> &lt;html xmlns="http://www.w3.org/1999/xhtml"&gt;
>> &lt;head&gt;
>> &lt;link
>>              rel="stylesheet" type="text/css" charset="utf-8" media="all"
>> href="/wiki/modernized/css/common.css"/&gt;
>>          &lt;link rel="stylesheet" type="text/css" charset="utf-8"
>>              media="screen" href="/wiki/modernized/css/screen.css"/&gt;
>>          &lt;link rel="stylesheet" type="text/css" charset="utf-8"
>>              media="print" href="/wiki/modernized/css/print.css"/&gt;.......
>>
>> TEXT mode (Blank lines stripped):
>> <response>
>> <lst name="responseHeader"><int name="status">0</int><int
>> name="QTime">47</int></lst><str name="UsingMailingLists.html">
>> UsingMailingLists - Solr Wiki
>> Search:
>> &lt;!--// Initialize search form
>> var f = document.getElementById('searchform');
>> f.getElementsByTagName('label')[0].style.display = 'none';
>> var e = document.getElementById('searchinput');
>> searchChange(e);
>> searchBlur(e);
>> //--&gt;
>> Solr Wiki
>> Login
>>
>>
>>
>>
>>
>>
>> On 27/05/16 13:31, Allison, Timothy B. wrote:
>>> I'm only minimally familiar with Solr Cell, but...
>>>
>>> 1) It looks like you aren't setting extractFormat=text.  According to
>>> [0]...the default is xhtml which will include a bunch of the metadata.
>>> 2) is there an attr_* dynamic field in your index with type="ignored"?
>>> This would strip out the attr_ fields so they wouldn't even be indexed...if
>>> you don't want them.
>>>
>>> As for the HTML file, it looks like Tika is failing to strip out the style
>>> section.  Try running the file alone with tika-app: java -jar tika-app.jar
>>> -t inputfile.html.  If you are finding the noise there.  Please open an
>>> issue on our JIRA: https://issues.apache.org/jira/browse/tika
>>>
>>>
>>> [0]
>>> https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Solr+Cell+using+Apache+Tika
>>>
>>>
>>> -----Original Message-----
>>> From: Simon Blandford [mailto:simon.blandford@bkconnect.net]
>>> Sent: Thursday, May 26, 2016 9:49 AM
>>> To: solr-user@lucene.apache.org
>>> Subject: Metadata and HTML ending up in searchable text
>>>
>>> Hi,
>>>
>>> I am using Solr 6.0 on Ubuntu 14.04.
>>>
>>> I am ending up with loads of junk in the text body. It starts like,
>>>
>>> The JSON entry output of a search result shows the indexed text starting
>>> with...
>>> body_txt_en: " stream_size 36499 X-Parsed-By
>>> org.apache.tika.parser.DefaultParser X-Parsed-By...."
>>>
>>> And then once it gets to the actual text I get CSS class names appearing
>>> that were in <p> or <div> tags etc.
>>> e.g. "....the power of calibre3 silence calibre2 and....", where
>>> "calibre3" etc are the CSS class names.
>>>
>>> All this junk is searchable and is polluting the index.
>>>
>>> I would like to index _only_ the actual content I am interested in
>>> searching for.
>>>
>>> Steps to reproduce:
>>>
>>> 1) Solr installed by untaring solr tgz in /opt.
>>>
>>> 2) Core created by typing "bin/solr create -c mycore"
>>>
>>> 3) Solr started with bin/solr start
>>>
>>> 4) TXT document index using the following command curl
>>> "http://localhost:8983/solr/mycore/update/extract?literal.id=doc1&uprefix=attr_&fmap.content=body_txt_en&commit=true"
>>> -F
>>>
>>> "content/UsingMailingLists.txt=@/home/user/Documents/library/UsingMailingLists.txt"
>>>
>>> 5) HTML document index using following command curl
>>> "http://localhost:8983/solr/mycore/update/extract?literal.id=doc2&uprefix=attr_&fmap.content=body_txt_en&commit=true"
>>> -F
>>>
>>> "content/UsingMailingLists.html=@/home/user/Documents/library/UsingMailingLists.html"
>>>
>>> 6) Query using URL:
>>> http://localhost:8983/solr/mycore/select?q=especially&wt=json
>>>
>>> Result:
>>>
>>> For the txt file, I get the following JSON for the document...
>>>
>>> {
>>>        id: "doc1",
>>>        attr_stream_size: [
>>>            "8107"
>>>        ],
>>>        attr_x_parsed_by: [
>>>            "org.apache.tika.parser.DefaultParser",
>>>            "org.apache.tika.parser.txt.TXTParser"
>>>        ],
>>>        attr_stream_content_type: [
>>>            "text/plain"
>>>        ],
>>>        attr_stream_name: [
>>>            "UsingMailingLists.txt"
>>>        ],
>>>        attr_stream_source_info: [
>>>            "content/UsingMailingLists.txt"
>>>        ],
>>>        attr_content_encoding: [
>>>            "ISO-8859-1"
>>>        ],
>>>        attr_content_type: [
>>>            "text/plain; charset=ISO-8859-1"
>>>        ],
>>>        body_txt_en: " stream_size 8107 X-Parsed-By
>>> org.apache.tika.parser.DefaultParser X-Parsed-By
>>> org.apache.tika.parser.txt.TXTParser stream_content_type text/plain
>>> stream_name UsingMailingLists.txt stream_source_info
>>> content/UsingMailingLists.txt Content-Encoding ISO-8859-1 Content-Type
>>> text/plain; charset=ISO-8859-1 Search: [value ] [Titles] [Text] Solr_Wiki
>>> Login ****** UsingMailingLists ****** * FrontPage * RecentChanges...etc",
>>> _version_: 1535398235801124900
>>> }
>>>
>>> For the HTML file,  I get the following JSON for the document...
>>>
>>> {
>>>        id: "doc2",
>>>            attr_stream_size: [
>>>            "20440"
>>>        ],
>>>        attr_x_parsed_by: [
>>>            "org.apache.tika.parser.DefaultParser",
>>>            "org.apache.tika.parser.html.HtmlParser"
>>>        ],
>>>        attr_stream_content_type: [
>>>            "text/html"
>>>        ],
>>>        attr_stream_name: [
>>>            "UsingMailingLists.html"
>>>        ],
>>>        attr_stream_source_info: [
>>>            "content/UsingMailingLists.html"
>>>        ],
>>>        attr_dc_title: [
>>>            "UsingMailingLists - Solr Wiki"
>>>        ],
>>>        attr_content_encoding: [
>>>            "UTF-8"
>>>        ],
>>>        attr_robots: [
>>>            "index,nofollow"
>>>        ],
>>>        attr_title: [
>>>            "UsingMailingLists - Solr Wiki"
>>>        ],
>>>        attr_content_type: [
>>>            "text/html; charset=utf-8"
>>>        ],
>>>        body_txt_en: " stylesheet text/css utf-8 all
>>> /wiki/modernized/css/common.css stylesheet text/css utf-8 screen
>>> /wiki/modernized/css/screen.css stylesheet text/css utf-8 print
>>> /wiki/modernized/css/print.css stylesheet text/css utf-8 projection
>>> /wiki/modernized/css/projection.css alternate Solr Wiki:
>>> UsingMailingLists
>>>
>>> /solr/UsingMailingLists?diffs=1&show_att=1&action=rss_rc&unique=0&page=UsingMailingLists&ddiffs=1
>>> application/rss+xml Start /solr/FrontPage Alternate Wiki Markup
>>> /solr/UsingMailingLists?action=raw Alternate print Print View
>>> /solr/UsingMailingLists?action=print Search /solr/FindPage Index
>>> /solr/TitleIndex Glossary /solr/WordIndex Help /solr/HelpOnFormatting
>>> stream_size 20440 X-Parsed-By org.apache.tika.parser.DefaultParser
>>> X-Parsed-By org.apache.tika.parser.html.HtmlParser stream_content_type
>>> text/html stream_name UsingMailingLists.html stream_source_info...etc",
>>>        _version_: 1535398408383103000
>>> }
>>>
>>>
>>>


Mime
View raw message