manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: SOLR
Date Tue, 15 Mar 2011 07:42:21 GMT
It is hard to tell what you are seeing here because you need to also
mention where you are seeing it.  But it is unlikely to be a result of
the way the POST is being done within the Solr Connector; that
connector does not perform any XML encoding, so that is not what is
failing.  As I think you have discovered, it sounds like the problem
is that somewhere deep in Solr something is going wrong and a 500
error is being returned with non-XML contents.  The Solr Connector
attempts to parse the response as XML and fails.  I;ve looked at the
code; when this happens, a stack trace is dumped to stdout (which is
not very helpful but is better than nothing).  Ideally, the connector
should dump the response into the log (as part of a warning), and also
write the raw response into the history (as part of the results of the
indexing attempt).  So you should be able to see the actual error in
the crawler UI by getting a simple history.  I've opened a new ticket
(CONNECTORS-168) to capture this work.

Other than that, I would hazard that there is currently nothing
actually wrong with the Solr connector at this time.  There is an
outstanding Jira ticket to port it to SolrJ (CONNECTORS-19), but based
on how unreliable Solr has been of late maybe that's not such a great
idea at the moment.  It's certainly in wide use at this time and
people have not found an actual problem with it.


Thanks,
Karl



On Mon, Mar 14, 2011 at 10:49 PM, Fuad Efendi <fuad@efendi.ca> wrote:
>
> I just noticed:
> Currently, default for ManifoldCF is /update/extract, which corresponds to
> SOLR Cell request handler.
>
> So...
> It is EXTREMELY generic...
> http://wiki.apache.org/solr/ExtractingRequestHandler
>
> What happens is: we submit "field" which is HTML snippet (inside RSS), and
> if that snippet is malformed... SOLR responds with error message such as
> this:
> <u>Unexpected character '
> -' (code 45) in external DTD subset; expected closing '&gt;' after ENTITY
> declaration  at [row,col,system-id]:
> [81,5,&quot;http://www.w3.org/TR/html4/strict.dtd&quot;]
>  from [row,col {unknown-source}]: [1,1]</u></p><p><b>description</b>
<u>The
> request sent by the client was syntactically incorrect (Unexpected charact
> er '-' (code 45) in external DTD subset; expected closing '&gt;' after
> ENTITY declaration  at [row,col,system-id]:
> [81,5,&quot;http://www.w3.org/TR/html4/strict.dtd&quot;]
>
> And, SOLR response is malformed too, so that we have
> [Fatal Error] :7:112: The element type "HR" must be terminated by the
> matching end-tag "</HR>".
> org.apache.manifoldcf.core.interfaces.ManifoldCFException: XML parsing
> error: The element type "HR" must be terminated by the matching end-tag
> "</HR>"
>
>
> two exceptions:
> 1. at SOLR because of malformed HTML such as
> <my_rss_field>&gt;bold&lt;BOLD&gt/body&lt;</my_rss_field>
> 2. at ManifoldCF, because SOLR response is malformed
>
>
> Using SOLR Cell for RSS feeds... we probably need few types of SOLR
> Connectors, or single type (but configurable); and it's much easier with
> SOLRJ client... including troubleshooting... otherwise  we should have unit
> tests for void writeField(OutputStream out, String fieldName, String
> fieldValue) and etc......
>
>
> I want to write new "connector" for my task, based on SOLRJ...
>
>
> -Fuad
>
>
>
>
>
> -----Original Message-----
> From: Fuad Efendi [mailto:fuad@efendi.ca]
> Sent: March-14-11 10:34 PM
> To: connectors-user@incubator.apache.org
> Subject: RE: SOLR
>
>
> It's not trunk version; I use (different) trunk versions in few production
> sites... in SOLR, path "/update" is defined in solrconfig.xml (and usually
> user will copy it from "example" schema and "may be" modify):
>
>  <requestHandler name="/update"
>                  class="solr.XmlUpdateRequestHandler">
>
>
> And, what ManifoldCF expects, which kind of "update" handler?!!
>
> That's why I suggest to use SOLRJ API instead... I noticed a lot of
> low-level coding...
>
>
>
> What kind of SOLR protocol is expected? It is definitely not POST of XML
> content:
>
>
>  /** Write a field */
>  protected static void writeField(OutputStream out, String fieldName,
> String fieldValue)
>    throws IOException
>  {
>    writePreamble(out);
>    writeBoundary(out,"text/plain; charset=UTF-8",fieldName,null);
>
>    byte[] tmp = fieldValue.getBytes("UTF-8");
>    out.write(tmp, 0, tmp.length);
>    writePostamble(out);
>  }
>
>
>
> Do you expect "binary" handler on SOLR?
>  <!-- Binary Update Request Handler
>       http://wiki.apache.org/solr/javabin
>    -->
>  <requestHandler name="/update/javabin"
>                  class="solr.BinaryUpdateRequestHandler" />
>
>
>
>
>
>
> -----Original Message-----
> From: Karl Wright [mailto:daddywri@gmail.com]
> Sent: March-14-11 7:58 PM
> To: connectors-user@incubator.apache.org
> Subject: Re: SOLR
>
> The trunk version of Solr may have changed around how the extracting update
> request handler works.  It changes daily, so there is no way I can keep up
> with it.  Maybe it would be better to go back and use a known quantity.
>
> Thanks,
> Karl
>
>
> On Mon, Mar 14, 2011 at 6:24 PM, Fuad Efendi <fuad@efendi.ca> wrote:
>>
>> Default settings for ManifoldCE: /update/extract
>> http://localhost:8080/solr/update/extract?commit=true
>>
>> And using browser, I see SOLR responds with malformed HTML containing
>> non-closing <HR>...
>>
>> Fix:
>> Update handler:  /update
>>
>>
>> -Fuad
>>
>>
>> -----Original Message-----
>> From: Fuad Efendi [mailto:fuad@efendi.ca]
>> Sent: March-14-11 6:17 PM
>> To: connectors-user@incubator.apache.org
>> Subject: RE: SOLR
>>
>> Hi Karl,
>>
>> I verified (via browser),
>> http://localhost:8080/solr/update?commit=true
>>
>> And response from SOLR:
>> <?xml version="1.0" encoding="UTF-8"?> <response> <lst
>> name="responseHeader"><int name="status">0</int><int
>> name="QTime">15</int></lst> </response>
>>
>> The problem root is
>> org.apache.manifoldcf.agents.output.solr.HttpPoster$CommitThread.run(H
>> ttpPos
>> ter.java:1658)
>>
>>
>> Everything is fine except I can't understand why we have "HR" from
>> SOLR, do we have any multithreading issues? I believe I connect to
>> SOLR, port 8080 is configured via console... may be somewhere else?
>>
>> I believe default setting for "Update handler:" at Connector screen is
>> incorrect, it is /update/extract
>>
>>
>>
>>
>> -----Original Message-----
>> From: Karl Wright [mailto:daddywri@gmail.com]
>> Sent: March-14-11 6:00 PM
>> To: connectors-user@incubator.apache.org
>> Subject: Re: SOLR
>>
>> This is because your solr setup is incorrect.  The post to "solr" is
>> returning HTML, not XML, so you are not actually communicating with
>> Solr at all.
>>
>> In order for the Solr connector to work, you need to have the solr
>> extracting update request handler present and configured.  I am told
>> that the latest release of Solr makes the jar with this code optional
>> - it's a contrib jar that you have to separately download.  If you are
>> building solr off of trunk, then this should not be a problem.
>>
>> Kalr
>>
>> On Mon, Mar 14, 2011 at 5:40 PM, Fuad Efendi <fuad@efendi.ca> wrote:
>>> This exception, XML contains encoded HTML, and it doesn't happen with
>>> standard Java 6 StAX parser:
>>>
>>> [Fatal Error] :124:120: The element type "HR" must be terminated by
>>> the matching end-tag "</HR>".
>>> org.apache.manifoldcf.core.interfaces.ManifoldCFException: XML
>>> parsing
>>> error: The element type "HR" must be terminated by the matching
>>> end-tag "</HR>"
>>> .
>>>        at
>>> org.apache.manifoldcf.core.common.XMLDoc.init(XMLDoc.java:369)
>>>        at
>>> org.apache.manifoldcf.core.common.XMLDoc.<init>(XMLDoc.java:317)
>>>        at
>>> org.apache.manifoldcf.agents.output.solr.HttpPoster.getResponse(HttpP
>>> o
>>> ster.j
>>> ava:619)
>>>        at
>>> org.apache.manifoldcf.agents.output.solr.HttpPoster$CommitThread.run(
>>> H
>>> ttpPos
>>> ter.java:1658)
>>> Caused by: org.xml.sax.SAXParseException: The element type "HR" must
>>> be terminated by the matching end-tag "</HR>".
>>>        at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
>>>        at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown
>>> Source)
>>>        at
>>> javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:124)
>>>        at
>>> org.apache.manifoldcf.core.common.XMLDoc.init(XMLDoc.java:365)
>>>        ... 3 more
>>>
>>>
>>>
>>>
>>>
>>>
>>> -----Original Message-----
>>> From: Fuad Efendi [mailto:fuad@efendi.ca]
>>> Sent: March-14-11 5:37 PM
>>> To: connectors-user@incubator.apache.org
>>> Subject: RE: SOLR
>>>
>>> Thank you very much Karl,
>>>
>>> And I have first problem,
>>> Starting crawler...
>>> [Fatal Error] :124:120: The element type "HR" must be terminated by
>>> the matching end-tag "</HR>".
>>> org.apache.manifoldcf.core.interfaces.ManifoldCFException: XML
>>> parsing
>>> error: The element type "HR" must be terminated by the matching
>>> end-tag "</HR>"
>>> .
>>>        at
>>> org.apache.manifoldcf.core.common.XMLDoc.init(XMLDoc.java:369)
>>>        at
>>> org.apache.manifoldcf.core.common.XMLDoc.<init>(XMLDoc.java:317)
>>>
>>> I am using RSS connector to crawl specific XML (containing
>>> XML-encoded &gt;HR&lt; and other HTML tags). It doesn't happened with
>>> standard StAX parser (Java 6)...
>>>
>>>
>>> Regarding (2), do you mean this interface method?
>>>  /** View specification.
>>>  * This method is called in the body section of a job's view page.
>>> Its purpose is to present the output specification information to the
>> user.
>>>  * The coder can presume that the HTML that is output from this
>>> configuration will be within appropriate <html> and <body> tags.
>>>  *@param out is the output to which any HTML should be sent.
>>>  *@param os is the current output specification for this job.
>>>  */
>>>  public void viewSpecification(IHTTPOutput out, OutputSpecification
>>> os)
>>>    throws ManifoldCFException, IOException
>>>
>>>
>>>
>>> Thanks!
>>>
>>>
>>>
>>>
>>>
>>> -----Original Message-----
>>> From: Karl Wright [mailto:daddywri@gmail.com]
>>> Sent: March-14-11 5:21 PM
>>> To: connectors-user@incubator.apache.org
>>> Subject: Re: SOLR
>>>
>>> Hi Fuad,
>>>
>>> (1) "Arguments" are indeed optional key/value pairs, which are sent
>>> to solr as part of the URL.
>>> (2) ManifoldCF presents tabs for a job of three kinds: (a) tabs that
>>> all jobs have; (b) tabs related to the repository connector's
>>> management of the document specification information; and (c) tabs
>>> related to the output connector's output specification information.
>>> The Solr output connector's output specification information includes
>>> the metadata to solr mapping, so those tabs come from the Solr connector.
>>>
>>> Karl
>>>
>>>
>>> On Mon, Mar 14, 2011 at 4:51 PM, Fuad Efendi <fuad@efendi.ca> wrote:
>>>> Hi, any sample of how to use SOLR connector?
>>>>
>>>> http://incubator.apache.org/connectors/end-user-documentation.html#s
>>>> o
>>>> l
>>>> routputconnector
>>>>
>>>>
>>>>
>>>> Some questions:
>>>>
>>>>
>>>>
>>>> 1.       Argument. Is it optional key=value pairs which can be sent
>>>> to SOLR as part of HTTP GET/POST request?
>>>>
>>>> 2.       I see code for “Connector”, and I see how to configure
SOLR
>>>> Output Connection. But how “Job” happens to know about <metadata>
to
>>>> <solr> mapping, is it generic (without dependency on SOLR)?
>>>>
>>>>
>>>>
>>>> Thanks,
>>>>
>>>> Fuad
>>>
>>>
>>
>>
>
>

Mime
View raw message