hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Look <al...@shopzilla.com>
Subject Re: Stack Overflow?
Date Wed, 02 Mar 2011 20:30:30 GMT
Is there any easy way to export this data from sematext / stack overflow?
Or is web crawling/scraping the way to go here?

This is a good use case for Mahout, I've been looking for a problem to play
around on mahout with :)


On 3/2/11 1:05 AM, "Friso van Vollenhoven" <fvanvollenhoven@xebia.com>
wrote:

> You could try using Apache Mahout to at least cluster the messages into groups
> of similar ones based on text features. That should be doable. Given the
> groups, you could manually extract questions (the clusters with most threads
> could be the most frequently asked). Also, if you manage to get this to work
> nicely, it could be a nice tool for other projects as well. Would be a fun
> exercise anyways...
> 
> I am starting to toy with Mahout for another pet project. Once I get more
> comfortable with it, I might be able to take this on (not a promise).
> 
> I think automatic question extraction is a quite ambitious goal.
> 
> Friso
> 
> 
> 
> On 1 mrt 2011, at 19:12, Stack wrote:
> 
>> On Tue, Mar 1, 2011 at 10:03 AM, Otis Gospodnetic
>> <otis_gospodnetic@yahoo.com> wrote:
>>>> Do you have  something in mind?  Could we be making better use of the
>>>> sematext  summaries?
>>> 
>>> Hm... we already index HBase and other Digests on search-hadoop.com.
>>> I was thinking more along the lines of mining the ML archives and doing
>>> automatic Q&A extraction.
>>> I don't know how difficult it would be.  Maybe the input would be too noisy
>>> (people don't ask proper questions, answers are not full sentences, quote
>>> characters prefixing lines from old messages add a layer of complexity...),
>>> but
>>> that's what I thought you might have meant.
>>> 
>> 
>> That'd be a nice addition to the docs.  Our FAQ is in need of
>> updating.  This would be a nice undertaking if someone was up for
>> taking it on.
>> St.Ack
> 
> 


Mime
View raw message