spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hamish Whittal <ham...@cloud-fundis.co.za>
Subject Accumulators and other important metrics for your job
Date Thu, 27 May 2021 17:03:36 GMT
Hi folks,

I have a problematic dataset I'm working with and am trying to find ways of
"debugging" the data.

For example, the most simple thing I would like to do is to know how many
rows of data I've read and compare that to a simple count of the lines in
the file.

I could do:
   df.count()

but this seems clunky (and expensive) for something that should be easy to
keep track of. I then thought accumulators might be the solution, but it
seems that I would have to do a second pass through the data at least to
"addInPlace" to the lines total. I might as well do that count then.

I would also expect that if I hit a row without the relevant data, I should
be able to tally that too. Say, a record without the requisite primary key.

I note too that accumulators are only tallies, but what if I want to keep
track of every file read. Say my directory has 100k files or some such, I
want to know that I have read each file by its filename. Accumulators won't
help me there since I want to keep filenames rather than just numbers of
files read. I might for example then be able to work out that it missed
file X because it was a corrupt file.

Has anyone got some advice on handling this sort of stuff?

Thanks in advance.

Mime
View raw message