spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From <Saif.A.Ell...@wellsfargo.com>
Subject Spark groupby and agg inconsistent and missing data
Date Thu, 22 Oct 2015 15:27:29 GMT
Hello everyone,

I am doing some analytics experiments under a 4 server stand-alone cluster in a spark shell,
mostly involving a huge database with groupBy and aggregations.

I am picking 6 groupBy columns and returning various aggregated results in a dataframe. GroupBy
fields are of two types, most of them are StringType and the rest are LongType.

The data source is a splitted json file dataframe,  once the data is persisted, the result
is consistent. But if I unload the memory and reload the data, the groupBy action returns
different content results, missing data.

Could I be missing something? this is rather serious for my analytics, and not sure how to
properly diagnose this situation.

Thanks,
Saif


Mime
View raw message