manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: RSS Connector
Date Mon, 03 Jun 2013 15:12:10 GMT
I've created CONNECTORS-700 for the date parsing issue.

Karl



On Mon, Jun 3, 2013 at 11:04 AM, Karl Wright <daddywri@gmail.com> wrote:

> Hi Stephane,
>
>
> First, you would not want to select to get dechromed content from the feed
> description field if there is no feed description field.  (In that case, by
> default the connector fall back to use the actual content from the document
> link.)
>
> Second, for this kind of feed, the connector looks for either "published"
> or "updated" and takes the latter of the two if both are found.  However,
> the ISO8601 date parser we are using is not happy with any timezone other
> than Z (zulu) at this time, but your dates have -0400 instead, and that is
> the problem.  I'll create a ticket to deal with that issue.
>
> Karl
>
>
>
> On Mon, Jun 3, 2013 at 10:48 AM, Stephane Gamard <stephane@gamard.net>wrote:
>
>> Hi Karl,
>>
>>
>> Thank you for the prompt reply. Agreed on #1, url is perfectly fine and
>> well used :). As for #2, I am still puzzled about the following. Here's an
>> excerpt from  the feed xml:
>>
>>
>>  <entry>
>>
>> <id>tag:blogger.com
>> ,1999:blog-8623074010562846957.post-6579597884362535238</id>
>>
>> <published>2013-05-21T18:23:00.000-04:00</published>
>>
>> <updated>2013-05-21T18:23:06.451-04:00</updated>
>>
>> <category scheme="http://www.blogger.com/atom/ns#" term="Lucene"/>
>>
>> <title type="text">Dynamic faceting with Lucene</title>
>>
>> <content type="html">Lucene's [...] Happy faceting!</content>
>>
>> <link rel="replies" type="application/atom+xml" href="
>> http://blog.mikemccandless.com/feeds/6579597884362535238/comments/default"
>> title="Post Comments"/>
>>
>> <link rel="replies" type="text/html" href="
>> http://blog.mikemccandless.com/2013/05/dynamic-faceting-with-lucene.html#comment-form"
>> title="0 Comments"/>
>>
>> <link rel="edit" type="application/atom+xml" href="
>> http://www.blogger.com/feeds/8623074010562846957/posts/default/6579597884362535238
>> "/>
>>
>> <link rel="self" type="application/atom+xml" href="
>> http://www.blogger.com/feeds/8623074010562846957/posts/default/6579597884362535238
>> "/>
>>
>> <link rel="alternate" type="text/html" href="
>> http://blog.mikemccandless.com/2013/05/dynamic-faceting-with-lucene.html"
>> title="Dynamic faceting with Lucene"/>
>>
>> <author>
>>
>> <name>Michael McCandless</name>
>>
>> <uri>https://plus.google.com/112759599082866346694</uri>
>>
>> <email>noreply@blogger.com</email>
>>
>> <gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32"
>> height="32" src="//
>> lh5.googleusercontent.com/-uZl5chgeDsM/AAAAAAAAAAI/AAAAAAAAAO4/Go4SFcNl-jY/s512-c/photo.jpg
>> "/>
>>
>> </author>
>>
>> <thr:total>0</thr:total>
>>
>> </entry>
>>
>>
>> Below is the document once ingested in Solr (searched with query:
>> http://localhost:8983/lucene/select?q=id:http%3A%2F%2Fblog.mikemccandless.com%2F2013%2F05%2Fdynamic-faceting-with-lucene.html&fl=*).
>> Note that I use a catch all field (<dynamicField name="*"  type="string"
>>  indexed="true"  multiValued="true" stored="true" omitNorms="true"/>) to
>> save all submitted fields.
>>
>>
>> I have two questions that I don't understand:
>>
>> - I've selected the option "Dechromed content, if present, in
>> 'description' field"  and yet I have no description field
>>
>> - I have no pubDate of publications field available
>>
>>
>> Here's the attached Solr output:
>>
>>
>> This XML file does not appear to have any style information associated
>> with it. The document tree is shown below.
>> <response>
>> <lst name="responseHeader">
>> <int name="status">0</int>
>> <int name="QTime">1</int>
>> <lst name="params">
>> <str name="fl">*</str>
>> <str name="q">
>> id:
>> http://blog.mikemccandless.com/2013/05/dynamic-faceting-with-lucene.html
>> </str>
>> </lst>
>> </lst>
>> <result name="response" numFound="1" start="0">
>> <doc>
>> <arr name="link">
>> <str>http://blog.mikemccandless.com/favicon.ico</str>
>> <str>icon</str>
>> <str>image/x-icon</str>
>> <str>
>> http://blog.mikemccandless.com/2013/05/dynamic-faceting-with-lucene.html
>> </str>
>> <str>canonical</str>
>> <str>alternate</str>
>> <str>application/atom+xml</str>
>> <str>http://blog.mikemccandless.com/feeds/posts/default</str>
>> <str>alternate</str>
>> <str>application/rss+xml</str>
>> <str>
>> http://blog.mikemccandless.com/feeds/posts/default?alt=rss
>> </str>
>> <str>service.post</str>
>> <str>application/atom+xml</str>
>> <str>
>> http://www.blogger.com/feeds/8623074010562846957/posts/default
>> </str>
>> <str>EditURI</str>
>> <str>application/rsd+xml</str>
>> <str>
>> http://www.blogger.com/rsd.g?blogID=8623074010562846957
>> </str>
>> <str>alternate</str>
>> <str>application/atom+xml</str>
>> <str>
>> http://blog.mikemccandless.com/feeds/6579597884362535238/comments/default
>> </str>
>> <str>https://plus.google.com/112759599082866346694</str>
>> <str>publisher</str>
>> <str>text/css</str>
>> <str>stylesheet</str>
>> <str>
>> //www.blogger.com/static/v1/widgets/2159474849-widget_css_2_bundle.css
>> </str>
>> <str>text/css</str>
>> <str>stylesheet</str>
>> <str>
>> //
>> www.blogger.com/dyn-css/authorization.css?targetBlogID=8623074010562846957&zx=93c35911-ffbb-4abb-ba82-d88c30b4b1b8
>> </str>
>> </arr>
>> <arr name="meta">
>> <str>viewport</str>
>> <str>width=1100</str>
>> <str>stream_source_info</str>
>> <str>docname</str>
>> <str>stream_content_type</str>
>> <str>text/html; charset=UTF-8</str>
>> <str>stream_size</str>
>> <str>80779</str>
>> <str>Content-Encoding</str>
>> <str>UTF-8</str>
>> <str>stream_name</str>
>> <str>docname</str>
>> <str>generator</str>
>> <str>blogger</str>
>> <str>MSSmartTagsPreventParsing</str>
>> <str>true</str>
>> <str>Content-Type</str>
>> <str>text/html; charset=UTF-8</str>
>> <str>resourceName</str>
>> <str>docname</str>
>> <str>dc:title</str>
>> <str>Changing Bits: Dynamic faceting with Lucene</str>
>> </arr>
>> <arr name="false">
>> <str>rect</str>
>> <str>http://blog.mikemccandless.com/</str>
>> <str>rect</str>
>> <str>6579597884362535238</str>
>> <str>rect</str>
>> <str>
>>
>> http://blog.mikemccandless.com/2013/02/drill-sideways-faceting-with-lucene.html
>> </str>
>> <str>rect</str>
>> <str>http://jirasearch.mikemccandless.com</str>
>> <str>rect</str>
>> <str>
>> http://www.elasticsearch.org/guide/reference/api/search/facets/
>> </str>
>> <str>rect</str>
>> <str>http://wiki.apache.org/solr/SolrFacetingOverview</str>
>> <str>rect</str>
>> <str>https://issues.apache.org/jira/browse/LUCENE-4795</str>
>> <str>rect</str>
>> <str>https://issues.apache.org/jira/browse/LUCENE-4965</str>
>> <str>rect</str>
>> <str>http://en.wikipedia.org/wiki/Interval_tree</str>
>> <str>rect</str>
>> <str>http://jirasearch.mikemccandless.com</str>
>> <str>rect</str>
>> <str>https://plus.google.com/112759599082866346694</str>
>> <str>author</str>
>> <str>rect</str>
>> <str>
>> http://blog.mikemccandless.com/2013/05/dynamic-faceting-with-lucene.html
>> </str>
>> <str>bookmark</str>
>> <str>rect</str>
>> <str>
>>
>> http://www.blogger.com/email-post.g?blogID=8623074010562846957&postID=6579597884362535238
>> </str>
>> <str>rect</str>
>> <str>
>>
>> http://www.blogger.com/post-edit.g?blogID=8623074010562846957&postID=6579597884362535238&from=pencil
>> </str>
>> <str>rect</str>
>> <str>
>>
>> http://www.blogger.com/share-post.g?blogID=8623074010562846957&postID=6579597884362535238&target=email
>> </str>
>> <str>rect</str>
>> <str>
>>
>> http://www.blogger.com/share-post.g?blogID=8623074010562846957&postID=6579597884362535238&target=blog
>> </str>
>> <str>rect</str>
>> <str>
>>
>> http://www.blogger.com/share-post.g?blogID=8623074010562846957&postID=6579597884362535238&target=twitter
>> </str>
>> <str>rect</str>
>> <str>
>>
>> http://www.blogger.com/share-post.g?blogID=8623074010562846957&postID=6579597884362535238&target=facebook
>> </str>
>> <str>rect</str>
>> <str>http://blog.mikemccandless.com/search/label/Lucene</str>
>> <str>tag</str>
>> <str>rect</str>
>> <str>comments</str>
>> <str>rect</str>
>> <str>comment-form</str>
>> <str>rect</str>
>> <str>
>>
>> http://www.blogger.com/comment-iframe.g?blogID=8623074010562846957&postID=6579597884362535238
>> </str>
>> <str>rect</str>
>> <str>links</str>
>> <str>rect</str>
>> <str/>
>> <str>rect</str>
>> <str>
>> http://blog.mikemccandless.com/2013/05/eating-dog-food-with-lucene.html
>> </str>
>> <str>rect</str>
>> <str>http://blog.mikemccandless.com/</str>
>> <str>rect</str>
>> <str>
>> http://blog.mikemccandless.com/feeds/6579597884362535238/comments/default
>> </str>
>> <str>application/atom+xml</str>
>> <str>rect</str>
>> <str>
>>
>> http://www.netvibes.com/subscribe.php?url=http%3A%2F%2Fblog.mikemccandless.com%2Ffeeds%2Fposts%2Fdefault
>> </str>
>> <str>rect</str>
>> <str>
>>
>> http://www.newsgator.com/ngs/subscriber/subext.aspx?url=http%3A%2F%2Fblog.mikemccandless.com%2Ffeeds%2Fposts%2Fdefault
>> </str>
>> <str>rect</str>
>> <str>
>>
>> http://add.my.yahoo.com/content?url=http%3A%2F%2Fblog.mikemccandless.com%2Ffeeds%2Fposts%2Fdefault
>> </str>
>> <str>rect</str>
>> <str>http://blog.mikemccandless.com/feeds/posts/default</str>
>> <str>rect</str>
>> <str>
>>
>> http://www.netvibes.com/subscribe.php?url=http%3A%2F%2Fblog.mikemccandless.com%2Ffeeds%2F6579597884362535238%2Fcomments%2Fdefault
>> </str>
>> <str>rect</str>
>> <str>
>>
>> http://www.newsgator.com/ngs/subscriber/subext.aspx?url=http%3A%2F%2Fblog.mikemccandless.com%2Ffeeds%2F6579597884362535238%2Fcomments%2Fdefault
>> </str>
>> <str>rect</str>
>> <str>
>>
>> http://add.my.yahoo.com/content?url=http%3A%2F%2Fblog.mikemccandless.com%2Ffeeds%2F6579597884362535238%2Fcomments%2Fdefault
>> </str>
>> <str>rect</str>
>> <str>
>> http://blog.mikemccandless.com/feeds/6579597884362535238/comments/default
>> </str>
>> <str>rect</str>
>> <str>
>> //
>> www.blogger.com/rearrange?blogID=8623074010562846957&widgetType=Subscribe&widgetId=Subscribe1&action=editWidget&sectionId=sidebar-right-1
>> </str>
>> <str>rect</str>
>> <str>https://plus.google.com/112759599082866346694</str>
>> <str>rect</str>
>> <str>https://plus.google.com/112759599082866346694</str>
>> <str>author</str>
>> <str>rect</str>
>> <str>https://plus.google.com/112759599082866346694</str>
>> <str>author</str>
>> <str>rect</str>
>> <str>
>> //
>> www.blogger.com/rearrange?blogID=8623074010562846957&widgetType=Profile&widgetId=Profile1&action=editWidget&sectionId=sidebar-right-1
>> </str>
>> <str>rect</str>
>> <str>
>> http://affiliate.manning.com/idevaffiliate.php?id=1171_147
>> </str>
>> <str>rect</str>
>> <str>
>> //
>> www.blogger.com/rearrange?blogID=8623074010562846957&widgetType=Image&widgetId=Image1&action=editWidget&sectionId=sidebar-right-1
>> </str>
>> <str>rect</str>
>> <str>javascript:void(0)</str>
>> <str>rect</str>
>> <str>
>>
>> http://blog.mikemccandless.com/search?updated-min=2013-01-01T00:00:00-05:00&updated-max=2014-01-01T00:00:00-05:00&max-results=5
>> </str>
>> <str>rect</str>
>> <str>javascript:void(0)</str>
>> <str>rect</str>
>> <str>
>> http://blog.mikemccandless.com/2013_05_01_archive.html
>> </str>
>> <str>rect</str>
>> <str>
>> http://blog.mikemccandless.com/2013/05/dynamic-faceting-with-lucene.html
>> </str>
>> <str>rect</str>
>> <str>
>> http://blog.mikemccandless.com/2013/05/eating-dog-food-with-lucene.html
>> </str>
>> <str>rect</str>
>> <str>javascript:void(0)</str>
>> <str>rect</str>
>> <str>
>> http://blog.mikemccandless.com/2013_02_01_archive.html
>> </str>
>> <str>rect</str>
>> <str>javascript:void(0)</str>
>> <str>rect</str>
>> <str>
>> http://blog.mikemccandless.com/2013_01_01_archive.html
>> </str>
>> <str>rect</str>
>> <str>javascript:void(0)</str>
>> <str>rect</str>
>> <str>
>>
>> http://blog.mikemccandless.com/search?updated-min=2012-01-01T00:00:00-05:00&updated-max=2013-01-01T00:00:00-05:00&max-results=16
>> </str>
>> <str>rect</str>
>> <str>javascript:void(0)</str>
>> <str>rect</str>
>> <str>
>> http://blog.mikemccandless.com/2012_12_01_archive.html
>> </str>
>> <str>rect</str>
>> <str>javascript:void(0)</str>
>> <str>rect</str>
>> <str>
>> http://blog.mikemccandless.com/2012_11_01_archive.html
>> </str>
>> <str>rect</str>
>> <str>javascript:void(0)</str>
>> <str>rect</str>
>> <str>
>> http://blog.mikemccandless.com/2012_09_01_archive.html
>> </str>
>> <str>rect</str>
>> <str>javascript:void(0)</str>
>> <str>rect</str>
>> <str>
>> http://blog.mikemccandless.com/2012_08_01_archive.html
>> </str>
>> <str>rect</str>
>> <str>javascript:void(0)</str>
>> <str>rect</str>
>> <str>
>> http://blog.mikemccandless.com/2012_07_01_archive.html
>> </str>
>> <str>rect</str>
>> <str>javascript:void(0)</str>
>> <str>rect</str>
>> <str>
>> http://blog.mikemccandless.com/2012_05_01_archive.html
>> </str>
>> <str>rect</str>
>> <str>javascript:void(0)</str>
>> <str>rect</str>
>> <str>
>> http://blog.mikemccandless.com/2012_04_01_archive.html
>> </str>
>> <str>rect</str>
>> <str>javascript:void(0)</str>
>> <str>rect</str>
>> <str>
>> http://blog.mikemccandless.com/2012_03_01_archive.html
>> </str>
>> <str>rect</str>
>> <str>javascript:void(0)</str>
>> <str>rect</str>
>> <str>
>> http://blog.mikemccandless.com/2012_01_01_archive.html
>> </str>
>> <str>rect</str>
>> <str>javascript:void(0)</str>
>> <str>rect</str>
>> <str>
>>
>> http://blog.mikemccandless.com/search?updated-min=2011-01-01T00:00:00-05:00&updated-max=2012-01-01T00:00:00-05:00&max-results=20
>> </str>
>> <str>rect</str>
>> <str>javascript:void(0)</str>
>> <str>rect</str>
>> <str>
>> http://blog.mikemccandless.com/2011_11_01_archive.html
>> </str>
>> <str>rect</str>
>> <str>javascript:void(0)</str>
>> <str>rect</str>
>> <str>
>> http://blog.mikemccandless.com/2011_10_01_archive.html
>> </str>
>> <str>rect</str>
>> <str>javascript:void(0)</str>
>> <str>rect</str>
>> <str>
>> http://blog.mikemccandless.com/2011_09_01_archive.html
>> </str>
>> <str>rect</str>
>> <str>javascript:void(0)</str>
>> <str>rect</str>
>> <str>
>> http://blog.mikemccandless.com/2011_06_01_archive.html
>> </str>
>> <str>rect</str>
>> <str>javascript:void(0)</str>
>> <str>rect</str>
>> <str>
>> http://blog.mikemccandless.com/2011_05_01_archive.html
>> </str>
>> <str>rect</str>
>> <str>javascript:void(0)</str>
>> <str>rect</str>
>> <str>
>> http://blog.mikemccandless.com/2011_04_01_archive.html
>> </str>
>> <str>rect</str>
>> <str>javascript:void(0)</str>
>> <str>rect</str>
>> <str>
>> http://blog.mikemccandless.com/2011_03_01_archive.html
>> </str>
>> <str>rect</str>
>> <str>javascript:void(0)</str>
>> <str>rect</str>
>> <str>
>> http://blog.mikemccandless.com/2011_02_01_archive.html
>> </str>
>> <str>rect</str>
>> <str>javascript:void(0)</str>
>> <str>rect</str>
>> <str>
>> http://blog.mikemccandless.com/2011_01_01_archive.html
>> </str>
>> <str>rect</str>
>> <str>javascript:void(0)</str>
>> <str>rect</str>
>> <str>
>>
>> http://blog.mikemccandless.com/search?updated-min=2010-01-01T00:00:00-05:00&updated-max=2011-01-01T00:00:00-05:00&max-results=43
>> </str>
>> <str>rect</str>
>> <str>javascript:void(0)</str>
>> <str>rect</str>
>> <str>
>> http://blog.mikemccandless.com/2010_12_01_archive.html
>> </str>
>> <str>rect</str>
>> <str>javascript:void(0)</str>
>> <str>rect</str>
>> <str>
>> http://blog.mikemccandless.com/2010_11_01_archive.html
>> </str>
>> <str>rect</str>
>> <str>javascript:void(0)</str>
>> <str>rect</str>
>> <str>
>> http://blog.mikemccandless.com/2010_10_01_archive.html
>> </str>
>> <str>rect</str>
>> <str>javascript:void(0)</str>
>> <str>rect</str>
>> <str>
>> http://blog.mikemccandless.com/2010_09_01_archive.html
>> </str>
>> <str>rect</str>
>> <str>javascript:void(0)</str>
>> <str>rect</str>
>> <str>
>> http://blog.mikemccandless.com/2010_08_01_archive.html
>> </str>
>> <str>rect</str>
>> <str>javascript:void(0)</str>
>> <str>rect</str>
>> <str>
>> http://blog.mikemccandless.com/2010_07_01_archive.html
>> </str>
>> <str>rect</str>
>> <str>javascript:void(0)</str>
>> <str>rect</str>
>> <str>
>> http://blog.mikemccandless.com/2010_06_01_archive.html
>> </str>
>> <str>rect</str>
>> <str>javascript:void(0)</str>
>> <str>rect</str>
>> <str>
>> http://blog.mikemccandless.com/2010_05_01_archive.html
>> </str>
>> <str>rect</str>
>> <str>javascript:void(0)</str>
>> <str>rect</str>
>> <str>
>> http://blog.mikemccandless.com/2010_04_01_archive.html
>> </str>
>> <str>rect</str>
>> <str>javascript:void(0)</str>
>> <str>rect</str>
>> <str>
>> http://blog.mikemccandless.com/2010_03_01_archive.html
>> </str>
>> <str>rect</str>
>> <str>javascript:void(0)</str>
>> <str>rect</str>
>> <str>
>> http://blog.mikemccandless.com/2010_02_01_archive.html
>> </str>
>> <str>rect</str>
>> <str>javascript:void(0)</str>
>> <str>rect</str>
>> <str>
>>
>> http://blog.mikemccandless.com/search?updated-min=2009-01-01T00:00:00-05:00&updated-max=2010-01-01T00:00:00-05:00&max-results=18
>> </str>
>> <str>rect</str>
>> <str>javascript:void(0)</str>
>> <str>rect</str>
>> <str>
>> http://blog.mikemccandless.com/2009_12_01_archive.html
>> </str>
>> <str>rect</str>
>> <str>javascript:void(0)</str>
>> <str>rect</str>
>> <str>
>> http://blog.mikemccandless.com/2009_11_01_archive.html
>> </str>
>> <str>rect</str>
>> <str>javascript:void(0)</str>
>> <str>rect</str>
>> <str>
>> http://blog.mikemccandless.com/2009_10_01_archive.html
>> </str>
>> <str>rect</str>
>> <str>javascript:void(0)</str>
>> <str>rect</str>
>> <str>
>> http://blog.mikemccandless.com/2009_09_01_archive.html
>> </str>
>> <str>rect</str>
>> <str>javascript:void(0)</str>
>> <str>rect</str>
>> <str>
>> http://blog.mikemccandless.com/2009_08_01_archive.html
>> </str>
>> <str>rect</str>
>> <str>javascript:void(0)</str>
>> <str>rect</str>
>> <str>
>> http://blog.mikemccandless.com/2009_07_01_archive.html
>> </str>
>> <str>rect</str>
>> <str>
>> //
>> www.blogger.com/rearrange?blogID=8623074010562846957&widgetType=BlogArchive&widgetId=BlogArchive1&action=editWidget&sectionId=sidebar-right-1
>> </str>
>> <str>rect</str>
>> <str>
>> //
>> www.blogger.com/rearrange?blogID=8623074010562846957&widgetType=Followers&widgetId=Followers1&action=editWidget&sectionId=sidebar-right-1
>> </str>
>> <str>rect</str>
>> <str>
>> //
>> www.blogger.com/rearrange?blogID=8623074010562846957&widgetType=FollowByEmail&widgetId=FollowByEmail1&action=editWidget&sectionId=sidebar-right-3
>> </str>
>> <str>rect</str>
>> <str>http://www.blogger.com</str>
>> <str>rect</str>
>> <str>
>> //
>> www.blogger.com/rearrange?blogID=8623074010562846957&widgetType=Attribution&widgetId=Attribution1&action=editWidget&sectionId=footer-3
>> </str>
>> </arr>
>> <arr name="img">
>> <str/>
>> <str>13</str>
>> <str>http://img1.blogblog.com/img/icon18_email.gif</str>
>> <str>18</str>
>> <str/>
>> <str>18</str>
>> <str>
>> http://img2.blogblog.com/img/icon18_edit_allbkg.gif
>> </str>
>> <str>18</str>
>> <str>
>> http://img2.blogblog.com/img/widgets/arrow_dropdown.gif
>> </str>
>> <str/>
>> <str/>
>> <str>http://img1.blogblog.com/img/icon_feed12.png</str>
>> <str>
>> http://img1.blogblog.com/img/widgets/subscribe-netvibes.png
>> </str>
>> <str/>
>> <str>
>> http://img1.blogblog.com/img/widgets/subscribe-newsgator.png
>> </str>
>> <str/>
>> <str>
>> http://img1.blogblog.com/img/widgets/subscribe-yahoo.png
>> </str>
>> <str/>
>> <str>http://img1.blogblog.com/img/icon_feed12.png</str>
>> <str/>
>> <str>
>> http://img2.blogblog.com/img/widgets/arrow_dropdown.gif
>> </str>
>> <str/>
>> <str/>
>> <str>http://img1.blogblog.com/img/icon_feed12.png</str>
>> <str>
>> http://img2.blogblog.com/img/widgets/arrow_dropdown.gif
>> </str>
>> <str/>
>> <str/>
>> <str>http://img1.blogblog.com/img/icon_feed12.png</str>
>> <str>
>> http://img1.blogblog.com/img/widgets/subscribe-netvibes.png
>> </str>
>> <str/>
>> <str>
>> http://img1.blogblog.com/img/widgets/subscribe-newsgator.png
>> </str>
>> <str/>
>> <str>
>> http://img1.blogblog.com/img/widgets/subscribe-yahoo.png
>> </str>
>> <str/>
>> <str>http://img1.blogblog.com/img/icon_feed12.png</str>
>> <str/>
>> <str>
>> http://img2.blogblog.com/img/widgets/arrow_dropdown.gif
>> </str>
>> <str/>
>> <str/>
>> <str>http://img1.blogblog.com/img/icon_feed12.png</str>
>> <str/>
>> <str>18</str>
>> <str>
>> http://img1.blogblog.com/img/icon18_wrench_allbkg.png
>> </str>
>> <str>18</str>
>> <str>My Photo</str>
>> <str>80</str>
>> <str>
>> //
>> lh5.googleusercontent.com/-uZl5chgeDsM/AAAAAAAAAAI/AAAAAAAAAO4/Go4SFcNl-jY/s512-c/photo.jpg
>> </str>
>> <str>80</str>
>> <str/>
>> <str>18</str>
>> <str>
>> http://img1.blogblog.com/img/icon18_wrench_allbkg.png
>> </str>
>> <str>18</str>
>> <str/>
>> <str>187</str>
>> <str>
>>
>> http://1.bp.blogspot.com/-QWxIn-kN_Yg/TZH0g4Vm66I/AAAAAAAAAG0/2jsjFLP9voQ/s250/LuceneInAction2.jpg
>> </str>
>> <str>150</str>
>> <str/>
>> <str>18</str>
>> <str>
>> http://img1.blogblog.com/img/icon18_wrench_allbkg.png
>> </str>
>> <str>18</str>
>> <str/>
>> <str>18</str>
>> <str>
>> http://img1.blogblog.com/img/icon18_wrench_allbkg.png
>> </str>
>> <str>18</str>
>> <str/>
>> <str>18</str>
>> <str>
>> http://img1.blogblog.com/img/icon18_wrench_allbkg.png
>> </str>
>> <str>18</str>
>> <str/>
>> <str>18</str>
>> <str>
>> http://img1.blogblog.com/img/icon18_wrench_allbkg.png
>> </str>
>> <str>18</str>
>> <str/>
>> <str>18</str>
>> <str>
>> http://img1.blogblog.com/img/icon18_wrench_allbkg.png
>> </str>
>> <str>18</str>
>> </arr>
>> <arr name="iframe">
>> <str>0</str>
>> <str>auto</str>
>> <str>410</str>
>> <str>comment-editor</str>
>> <str/>
>> <str>100%</str>
>> </arr>
>> <str name="filename">docname</str>
>> <str name="mimetype">text/html; charset=UTF-8</str>
>> <arr name="source">
>> <str>http://blog.mikemccandless.com/feeds/posts/default</str>
>> </arr>
>> <arr name="category">
>> <str>Lucene</str>
>> </arr>
>> <str name="id">
>> http://blog.mikemccandless.com/2013/05/dynamic-faceting-with-lucene.html
>> </str>
>> <arr name="source_type">
>> <str>rss</str>
>> </arr>
>> <arr name="title">
>> <str>Dynamic faceting with Lucene</str>
>> </arr>
>> <arr name="title_search">
>> <str>Dynamic faceting with Lucene</str>
>> </arr>
>> <arr name="viewport">
>> <str>width=1100</str>
>> </arr>
>> <arr name="stream_source_info">
>> <str>docname</str>
>> </arr>
>> <arr name="stream_content_type">
>> <str>text/html; charset=UTF-8</str>
>> </arr>
>> <arr name="stream_size">
>> <str>80779</str>
>> </arr>
>> <arr name="content_encoding">
>> <str>UTF-8</str>
>> </arr>
>> <arr name="stream_name">
>> <str>docname</str>
>> </arr>
>> <arr name="generator">
>> <str>blogger</str>
>> </arr>
>> <arr name="mssmarttagspreventparsing">
>> <str>true</str>
>> </arr>
>> <arr name="content_type">
>> <str>text/html; charset=UTF-8</str>
>> </arr>
>> <arr name="resourcename">
>> <str>docname</str>
>> </arr>
>> <arr name="dc_title">
>> <str>Changing Bits: Dynamic faceting with Lucene</str>
>> </arr>
>> <arr name="content">
>> <str>
>> Changing Bits: Dynamic faceting with Lucene Changing Bits Tuesday, May
>> 21, 2013 Dynamic faceting with Lucene Lucene's facet module has seen some
>> great improvements recently: sizable (nearly 4X) speedups and new features
>> like DrillSideways . The Jira issues search example showcases a number of
>> facet features. Here I'll describe two recently committed facet features:
>> sorted-set doc-values faceting, already available in 4.3, and dynamic range
>> faceting, coming in the next (4.4) release. To understand these features,
>> and why they are important, we first need a little background. Lucene's
>> facet module does most of its work at indexing time: for each indexed
>> document, it examines every facet label, each of which may be hierarchical,
>> and maps each unique label in the hierarchy to an integer id, and then
>> encodes all ids into a binary doc values field. A separate taxonomy index
>> stores this mapping, and ensures that, even across segments, the same label
>> gets the same id. At search time, faceting cost is minimal: for each
>> matched document, we visit all integer ids and aggregate counts in an
>> array, summarizing the results in the end, for example as top N facet
>> labels by count. This is in contrast to purely dynamic faceting
>> implementations like ElasticSearch 's and Solr 's, which do all work at
>> search time. Such approaches are more flexible: you need not do anything
>> special during indexing, and for every query you can pick and choose
>> exactly which facets to compute. However, the price for that flexibility is
>> slower searching, as each search must do more work for every matched
>> document. Furthermore, the impact on near-real-time reopen latency can be
>> horribly costly if top-level data-structures, such as Solr's
>> UnInvertedField, must be rebuilt on every reopen. The taxonomy index used
>> by the facet module means no extra work needs to be done on each
>> near-real-time reopen. Enough background, now on to our two new features!
>> Sorted-set doc-values faceting These features bring two dynamic
>> alternatives to the facet module, both computing facet counts from
>> previously indexed doc-values fields. The first feature, sorted-set
>> doc-values faceting (see LUCENE-4795 ), allows the application to index a
>> normal sorted-set doc-values field, for example: doc.add(new
>> SortedSetDocValuesField("foo")); doc.add(new
>> SortedSetDocValuesField("bar")); and then to compute facet counts at search
>> time using SortedSetDocValuesAccumulator and SortedSetDocValuesReaderState.
>> This feature does not use the taxonomy index, since all state is stored in
>> the doc-values, but the tradeoff is that on each near-real-time reopen, a
>> top-level data-structure is recomputed to map per-segment integer ordinals
>> to global ordinals. The good news is this should be relatively low cost
>> since it's just merge-sorting already sorted terms, and it doesn't need to
>> visit the documents (unlike UnInvertedField). At search time there is also
>> a small performance hit (~25%, depending on the query) since each
>> per-segment ord must be re-mapped to the global ord space. Likely this
>> could be improved (no time was spend optimizing). Furthermore, this feature
>> currently only works with non-hierarchical facet fields, though this should
>> be fixable (patches welcome!). Dynamic range faceting The second new
>> feature, dynamic range faceting, works on top of a numeric doc-values field
>> (see LUCENE-4965 ), and implements dynamic faceting over numeric ranges.
>> You create a RangeFacetRequest, providing custom ranges with their labels.
>> Each matched document is checked against all ranges and the count is
>> incremented when there is a match. The range-test is a naive simple linear
>> search, which is probably OK since there are usually only a few ranges, but
>> we could eventually upgrade this to an interval tree to get better
>> performance (patches welcome!). Likewise, this new feature does not use the
>> taxonomy index, only a numeric doc-values field. This feature is especially
>> useful with time-based fields. You can see it in action in the Jira issues
>> search example in the Updated field. Happy faceting! Posted by Michael
>> McCandless on 5/21/2013 Email This BlogThis! Share to Twitter Share to
>> Facebook Labels: Lucene No comments: Post a Comment Older Post Home
>> Subscribe to: Post Comments (Atom) Subscribe To Posts Atom Posts Comments
>> Atom Comments About Me Michael McCandless Michael loves building software;
>> he's been building search engines for more than a decade. In 1999 he
>> co-founded iPhrase Technologies, a startup providing a user-centric
>> enterprise search application, written primarily in Python and C. After IBM
>> acquired iPhrase in 2005, Michael fell in love with Lucene, becoming a
>> committer in 2006 and PMC member in 2008. Michael has remained an active
>> committer, helping to push Lucene to new places in recent years. He's
>> co-author of Lucene in Action, 2nd edition. In his spare time Michael
>> enjoys building his own computers, writing software to control his house
>> (mostly in Python), encoding videos and tinkering with all sorts of other
>> things. View my complete profile Blog Archive ▼  2013 (5) ▼  May (2)
>> Dynamic faceting with Lucene Eating dog food with Lucene ►  February (1) ►
>> January (2) ►  2012 (16) ►  December (2) ►  November (1) ►  September (1)
>> ►  August (1) ►  July (3) ►  May (1) ►  April (2) ►  March (3) ►  January
>> (2) ►  2011 (20) ►  November (2) ►  October (3) ►  September (1) ►  June
>> (3) ►  May (2) ►  April (2) ►  March (4) ►  February (2) ►  January (1) ►
>> 2010 (43) ►  December (1) ►  November (1) ►  October (4) ►  September (4)
>> ►  August (4) ►  July (11) ►  June (7) ►  May (6) ►  April (1) ►  March (1)
>> ►  February (3) ►  2009 (18) ►  December (1) ►  November (1) ►  October (1)
>> ►  September (4) ►  August (6) ►  July (5) Followers Follow by Email Simple
>> template. Powered by Blogger .
>> </str>
>> </arr>
>> <arr name="content_search">
>> <str>
>> Changing Bits: Dynamic faceting with Lucene Changing Bits Tuesday, May
>> 21, 2013 Dynamic faceting with Lucene Lucene's facet module has seen some
>> great improvements recently: sizable (nearly 4X) speedups and new features
>> like DrillSideways . The Jira issues search example showcases a number of
>> facet features. Here I'll describe two recently committed facet features:
>> sorted-set doc-values faceting, already available in 4.3, and dynamic range
>> faceting, coming in the next (4.4) release. To understand these features,
>> and why they are important, we first need a little background. Lucene's
>> facet module does most of its work at indexing time: for each indexed
>> document, it examines every facet label, each of which may be hierarchical,
>> and maps each unique label in the hierarchy to an integer id, and then
>> encodes all ids into a binary doc values field. A separate taxonomy index
>> stores this mapping, and ensures that, even across segments, the same label
>> gets the same id. At search time, faceting cost is minimal: for each
>> matched document, we visit all integer ids and aggregate counts in an
>> array, summarizing the results in the end, for example as top N facet
>> labels by count. This is in contrast to purely dynamic faceting
>> implementations like ElasticSearch 's and Solr 's, which do all work at
>> search time. Such approaches are more flexible: you need not do anything
>> special during indexing, and for every query you can pick and choose
>> exactly which facets to compute. However, the price for that flexibility is
>> slower searching, as each search must do more work for every matched
>> document. Furthermore, the impact on near-real-time reopen latency can be
>> horribly costly if top-level data-structures, such as Solr's
>> UnInvertedField, must be rebuilt on every reopen. The taxonomy index used
>> by the facet module means no extra work needs to be done on each
>> near-real-time reopen. Enough background, now on to our two new features!
>> Sorted-set doc-values faceting These features bring two dynamic
>> alternatives to the facet module, both computing facet counts from
>> previously indexed doc-values fields. The first feature, sorted-set
>> doc-values faceting (see LUCENE-4795 ), allows the application to index a
>> normal sorted-set doc-values field, for example: doc.add(new
>> SortedSetDocValuesField("foo")); doc.add(new
>> SortedSetDocValuesField("bar")); and then to compute facet counts at search
>> time using SortedSetDocValuesAccumulator and SortedSetDocValuesReaderState.
>> This feature does not use the taxonomy index, since all state is stored in
>> the doc-values, but the tradeoff is that on each near-real-time reopen, a
>> top-level data-structure is recomputed to map per-segment integer ordinals
>> to global ordinals. The good news is this should be relatively low cost
>> since it's just merge-sorting already sorted terms, and it doesn't need to
>> visit the documents (unlike UnInvertedField). At search time there is also
>> a small performance hit (~25%, depending on the query) since each
>> per-segment ord must be re-mapped to the global ord space. Likely this
>> could be improved (no time was spend optimizing). Furthermore, this feature
>> currently only works with non-hierarchical facet fields, though this should
>> be fixable (patches welcome!). Dynamic range faceting The second new
>> feature, dynamic range faceting, works on top of a numeric doc-values field
>> (see LUCENE-4965 ), and implements dynamic faceting over numeric ranges.
>> You create a RangeFacetRequest, providing custom ranges with their labels.
>> Each matched document is checked against all ranges and the count is
>> incremented when there is a match. The range-test is a naive simple linear
>> search, which is probably OK since there are usually only a few ranges, but
>> we could eventually upgrade this to an interval tree to get better
>> performance (patches welcome!). Likewise, this new feature does not use the
>> taxonomy index, only a numeric doc-values field. This feature is especially
>> useful with time-based fields. You can see it in action in the Jira issues
>> search example in the Updated field. Happy faceting! Posted by Michael
>> McCandless on 5/21/2013 Email This BlogThis! Share to Twitter Share to
>> Facebook Labels: Lucene No comments: Post a Comment Older Post Home
>> Subscribe to: Post Comments (Atom) Subscribe To Posts Atom Posts Comments
>> Atom Comments About Me Michael McCandless Michael loves building software;
>> he's been building search engines for more than a decade. In 1999 he
>> co-founded iPhrase Technologies, a startup providing a user-centric
>> enterprise search application, written primarily in Python and C. After IBM
>> acquired iPhrase in 2005, Michael fell in love with Lucene, becoming a
>> committer in 2006 and PMC member in 2008. Michael has remained an active
>> committer, helping to push Lucene to new places in recent years. He's
>> co-author of Lucene in Action, 2nd edition. In his spare time Michael
>> enjoys building his own computers, writing software to control his house
>> (mostly in Python), encoding videos and tinkering with all sorts of other
>> things. View my complete profile Blog Archive ▼  2013 (5) ▼  May (2)
>> Dynamic faceting with Lucene Eating dog food with Lucene ►  February (1) ►
>> January (2) ►  2012 (16) ►  December (2) ►  November (1) ►  September (1)
>> ►  August (1) ►  July (3) ►  May (1) ►  April (2) ►  March (3) ►  January
>> (2) ►  2011 (20) ►  November (2) ►  October (3) ►  September (1) ►  June
>> (3) ►  May (2) ►  April (2) ►  March (4) ►  February (2) ►  January (1) ►
>> 2010 (43) ►  December (1) ►  November (1) ►  October (4) ►  September (4)
>> ►  August (4) ►  July (11) ►  June (7) ►  May (6) ►  April (1) ►  March (1)
>> ►  February (3) ►  2009 (18) ►  December (1) ►  November (1) ►  October (1)
>> ►  September (4) ►  August (6) ►  July (5) Followers Follow by Email Simple
>> template. Powered by Blogger .
>> </str>
>> </arr>
>> <arr name="language">
>> <str>en</str>
>> </arr>
>> <arr name="url">
>> <str>
>> http://blog.mikemccandless.com/2013/05/dynamic-faceting-with-lucene.html
>> </str>
>> </arr>
>> <arr name="snippet">
>> <str>
>> Changing Bits: Dynamic faceting with Lucene Changing Bits Tuesday, May
>> 21, 2013 Dynamic faceting with Lucene Lucene's facet module has seen some
>> great improvements recently: sizable (nearly 4X) speedups and new features
>> like DrillSideways ....At search time, faceting cost is minimal: for each
>> matched document, we visit all integer ids and aggregate counts in an
>> array, summarizing the results in the end, for example as top N facet
>> labels by count....The range-test is a naive simple linear search, which is
>> probably OK since there are usually only a few ranges, but we could
>> eventually upgrade this to an interval tree to get better performance
>> (patches welcome!)....Share to Twitter Share to Facebook Labels: Lucene No
>> comments: Post a Comment Older Post Home Subscribe to: Post Comments (Atom)
>> Subscribe To Posts Atom Posts Comments Atom Comments About Me Michael
>> McCandless Michael loves building software; he's been building search
>> engines for more than a decade....View my complete profile Blog Archive ▼
>> 2013 (5) ▼  May (2) Dynamic faceting with Lucene Eating dog food with
>> Lucene ►  February (1) ►  January (2) ►  2012 (16) ►  December (2) ►
>> November (1) ►  September (1) ►  August (1) ►  July (3) ►  May (1) ►  April
>> (2) ►  March (3) ►  January (2) ►  2011 (20) ►  November (2) ►  October (3)
>> ►  September (1) ►  June (3) ►  May (2) ►  April (2) ►  March (4) ►
>> February (2) ►  January (1) ►  2010 (43) ►  December (1) ►  November (1) ►
>> October (4) ►  September (4) ►  August (4) ►  July (11) ►  June (7) ►  May
>> (6) ►  April (1) ►  March (1) ►  February (3) ►  2009 (18) ►  December (1)
>> ►  November (1) ►  October (1) ►  September (4) ►  August (6) ►  July (5)
>> Followers Follow by Email Simple template.
>> </str>
>> </arr>
>> <arr name="host">
>> <str>blog.mikemccandless.com</str>
>> </arr>
>> <arr name="path">
>> <str>/2013/05/dynamic-faceting-with-lucene.html</str>
>> </arr>
>> <long name="_version_">1436832383182569472</long>
>> </doc>
>> </result>
>> </response>
>>
>>
>>
>> I can see there are published and updated markup, and yet none of those
>> fields (pubDate or publications) are present in the solr document.
>>
>>
>> Thank you for the prompt reply. Agreed on #1, url is perfectly fine and
>> well used :). As for #2, I am still puzzled about the following. Here's an
>> excerpt from  the feed xml:
>>
>> On June 3, 2013 at 4:25:51 PM, Karl Wright (daddywri@gmail.com) wrote:
>>
>> Hi Stephane,
>>
>> (1) ManifoldCF always uses the URL of a document as the primary ID when
>> it indexes it.  This is the standard treatment and has been since Day 1.
>>
>> (2) For the "creation date" attribute, the RSS connector uses the date in
>> the feed, if there is one.  This is a date in ISO format, and comes out as
>> the metadata value "pubdateiso".  There is also an attribute called
>> "pubdate", which is in milliseconds since epoch, which is EITHER the date
>> in the feed (if present), or if not it's the date the document is fetched.
>>
>> As for your other question, "chromed" data comes from the URLs referenced
>> by the items in the feed, and "dechromed" data comes from either the
>> content or description field that's actually in the feed, whichever you
>> specify.
>>
>> All of this is described in the end-user-documentation, although I do
>> notice that "pubdateiso" is missing from the metadata listed.
>>
>>
>> http://manifoldcf.apache.org/release/trunk/en_US/end-user-documentation.html#rssrepository
>>
>> Karl
>>
>>
>>
>> On Mon, Jun 3, 2013 at 10:13 AM, Stephane Gamard <stephane@gamard.net>wrote:
>>
>>>
>>> Hi all,
>>>
>>>
>>> I'm trying to use the RSS connector for the following feed:
>>> http://blog.mikemccandless.com/feeds/posts/default
>>>
>>> After setting the job up and ingesting documents I have 2 pending
>>> questions:
>>> - why is the connector using the URL as ID instead of the atom ID tag?
>>> - I have no creation and/or modified date in my Solr document, how is it
>>> so?
>>>
>>> Overall I am a bit confused as to where does the crawler gets it's
>>> information (chrome vs dechromed). I've downloaded the feed and tried to
>>> find the entries back into my index but could not do so (could only find
>>> pages which are linked from the rss entry).
>>>
>>> Sorry for the hassle, I'm reading over and over trying to piece it all
>>> together.
>>>
>>> Cheers,
>>>
>>> _Stephane
>>>
>>
>>
>

Mime
View raw message