mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ken Krugler <>
Subject Input on PTD dataset results
Date Mon, 26 Apr 2010 17:49:29 GMT
Hi all,

I'm looking for input on two questions about the raw data files from  
the Public Terabyte Dataset project:

1. Target file size. What's the biggest file size that people would  
want to handle?

E.g. we could generate 1000 chunks of 1GB each, or 100 chunks of 10GB,  

2. Any value to specific grouping of data in files?

E.g. we could try to ensure that all data from the same domain goes  
into the same file.

But that might result in individual data files having more skew, and  
thus make it harder to get useful results from processing a subset of  
the data.


-- Ken

Ken Krugler
+1 530-210-6378
e l a s t i c   w e b   m i n i n g

View raw message