Hey,
                                        
Currently working with CSV data and I've come across an unusual case. In Databricks, if I use the following code:

GA_pages = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true', comment = '#', mode = "DROPMALFORMED").load(ga_sessions_path)

display(GA_pages)

OUTPUT:
+------------------+--------+--------+--------------+---------+-----------+-------------+---------------------+------------+----------+--------------------------+ |User-Defined Value| Date|Sessions|% New Sessions|New Users|Bounce Rate|Pages/Session|Avg. Session Duration|Transactions| Revenue|E-commerce Conversion Rate| +------------------+--------+--------+--------------+---------+-----------+-------------+---------------------+------------+----------+--------------------------+ | (not set)|20190508| 1,987| 60.09%| 1,194| 34.88%| 5.25| 00:03:34| 18| £2,205.22| 0.91%| | (not set)|20190528| 1,910| 62.20%| 1,188| 36.07%| 4.53| 00:02:47| 9| £1,045.56| 0.47%| | (not set)|20190509| 1,723| 62.45%| 1,076| 34.07%| 5.39| 00:03:49| 13| £2,264.27| 0.75%| | (not set)|20190530| 1,682| 64.98%| 1,093| 31.87%| 5.39| 00:03:39| 23| £3,292.06| 1.37%| | (not set)|20190527| 1,677| 63.33%| 1,062| 35.24%| 4.52| 00:03:02| 11| £2,261.22| 0.66%| | (not set)|20190511| 1,643| 62.08%| 1,020| 39.81%| 5.16| 00:03:06| 14| £2,412.03| 0.85%| | (not set)|20190529| 1,623| 61.24%| 994| 35.55%| 4.87| 00:03:05| 16| £3,213.60| 0.99%| | (not set)|20190513| 1,622| 68.13%| 1,105| 34.77%| 4.81| 00:03:07| 8| £1,660.61| 0.49%| | (not set)|20190505| 1,526| 59.90%| 914| 35.91%| 6.01| 00:03:48| 15| £3,022.73| 0.98%| | (not set)|20190504| 1,515| 61.12%| 926| 38.81%| 5.58| 00:03:26| 16| £2,238.47| 1.06%| | (not set)|20190514| 1,513| 59.35%| 898| 34.10%| 4.98| 00:03:12| 14| £3,055.11| 0.93%| | (not set)|20190526| 1,509| 69.05%| 1,042| 32.94%| 5.01| 00:03:17| 9| £2,346.37| 0.60%| | (not set)|20190501| 1,504| 53.46%| 804| 36.90%| 4.91| 00:03:13| 14| £2,606.60| 0.93%| | (not set)|20190506| 1,502| 63.18%| 949| 36.55%| 4.97| 00:03:19| 16| £2,419.02| 1.07%| | (not set)|20190517| 1,444| 58.86%| 850| 33.73%| 5.15| 00:03:13| 13| £3,650.09| 0.90%| | (not set)|20190521| 1,428| 58.40%| 834| 38.24%| 4.99| 00:03:37| 12| £1,718.84| 0.84%| | (not set)|20190502| 1,398| 58.44%| 817| 37.55%| 5.42| 00:03:40| 14| £2,835.89| 1.00%| | (not set)|20190503| 1,385| 56.03%| 776| 37.83%| 4.98| 00:03:06| 4| £763.65| 0.29%| | (not set)|20190507| 1,366| 59.66%| 815| 38.80%| 5.36| 00:03:30| 16| £2,094.68| 1.17%| | (not set)|20190510| 1,357| 63.23%| 858| 34.49%| 5.3| 00:03:31| 10| £1,432.15| 0.74%| | (not set)|20190601| 1,355| 56.75%| 769| 36.38%| 5.31| 00:03:18| 20| £5,260.78| 1.48%| | (not set)|20190531| 1,332| 60.66%| 808| 36.86%| 5.09| 00:03:26| 13| £2,006.91| 0.98%| | (not set)|20190602| 1,332| 60.96%| 812| 37.16%| 5.08| 00:03:29| 17| £3,419.28| 1.28%| | (not set)|20190603| 1,329| 57.49%| 764| 36.19%| 5.11| 00:03:22| 11| £2,882.18| 0.83%| | (not set)|20190515| 1,299| 59.12%| 768| 35.95%| 5.58| 00:03:57| 3| £118.00| 0.23%| | (not set)|20190604| 1,291| 60.42%| 780| 36.48%| 4.78| 00:03:02| 9| £1,826.35| 0.70%| | (not set)|20190520| 1,274| 61.54%| 784| 35.32%| 5.44| 00:03:33| 14| £2,845.98| 1.10%| | (not set)|20190522| 1,256| 63.54%| 798| 37.18%| 4.71| 00:03:06| 13| £1,551.32| 1.04%| | (not set)|20190605| 1,254| 58.21%| 730| 36.04%| 4.85| 00:03:14| 10| £1,676.51| 0.80%| | (not set)|20190512| 1,244| 64.63%| 804| 36.82%| 5.44| 00:03:15| 12| £1,823.22| 0.96%| | (not set)|20190518| 1,201| 62.03%| 745| 36.47%| 5.46| 00:03:32| 12| £1,765.69| 1.00%| | (not set)|20190516| 1,177| 60.92%| 717| 33.98%| 5.39| 00:03:51| 16| £2,851.27| 1.36%| | (not set)|20190524| 1,137| 61.39%| 698| 40.55%| 4.76| 00:02:52| 10| £1,562.80| 0.88%| | (not set)|20190525| 1,110| 63.87%| 709| 34.14%| 5.83| 00:03:46| 15| £2,520.85| 1.35%| | (not set)|20190523| 1,105| 61.27%| 677| 37.56%| 4.66| 00:03:22| 10| £2,472.89| 0.90%| | (not set)|20190519| 1,081| 64.20%| 694| 32.01%| 5.93| 00:03:35| 10| £2,341.90| 0.93%| | null| null| 51,091| 61.21%| 31,272| 35.99%| 5.15| 00:03:22| 460|£83,464.11| 0.90%| +------------------+--------+--------+--------------+---------+-----------+-------------+---------------------+------------+----------+--------------------------+  

The code does as expected and the malformed data is dropped

However if I carry out filters on the data before displaying, the malformed data reappears

GA_pages = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true', comment = '#', mode = "DROPMALFORMED").load(ga_sessions_path)  

GA_pages = GA_pages.withColumn('date', fn.col('Date'))\
  .withColumn('sessions', fn.regexp_replace("Sessions", r',', ""))\
  .withColumn('pages/session', fn.col('Pages/Session'))\
  .withColumn('pages', (fn.col('pages/session')*fn.col('sessions')))\
  .select('date', 'sessions', 'pages', 'pages/session')\

display(GA_pages)  

OUTPUT:
+--------+-------------+------------------+-------------+ | date| sessions| pages|pages/session| +--------+-------------+------------------+-------------+ |20190508| 1987| 10431.75| 5.25| |20190528| 1910| 8652.300000000001| 4.53| |20190509| 1723| 9286.97| 5.39| |20190530| 1682| 9065.98| 5.39| |20190527| 1677| 7580.039999999999| 4.52| |20190511| 1643| 8477.880000000001| 5.16| |20190529| 1623| 7904.01| 4.87| |20190513| 1622| 7801.82| 4.81| |20190505| 1526| 9171.26| 6.01| |20190504| 1515| 8453.7| 5.58| |20190514| 1513| 7534.740000000001| 4.98| |20190526| 1509| 7560.089999999999| 5.01| |20190501| 1504| 7384.64| 4.91| |20190506| 1502| 7464.94| 4.97| |20190517| 1444| 7436.6| 5.15| |20190521| 1428| 7125.72| 4.99| |20190502| 1398| 7577.16| 5.42| |20190503| 1385| 6897.3| 4.98| |20190507| 1366| 7321.76| 5.36| |20190510| 1357| 7192.099999999999| 5.3| |20190601| 1355| 7195.049999999999| 5.31| |20190531| 1332| 6779.88| 5.09| |20190602| 1332| 6766.56| 5.08| |20190603| 1329|6791.1900000000005| 5.11| |20190515| 1299| 7248.42| 5.58| |20190604| 1291|6170.9800000000005| 4.78| |20190520| 1274| 6930.56| 5.44| |20190522| 1256| 5915.76| 4.71| |20190605| 1254| 6081.9| 4.85| |20190512| 1244| 6767.360000000001| 5.44| |20190518| 1201| 6557.46| 5.46| |20190516| 1177| 6344.03| 5.39| |20190524| 1137| 5412.12| 4.76| |20190525| 1110| 6471.3| 5.83| |20190523| 1105| 5149.3| 4.66| |20190519| 1081| 6410.33| 5.93| | null| 51091| 263118.65| 5.15| |Sessions|Pages/Session| null| null| | 1,504| 4.91| null| null| | 1,398| 5.42| null| null| | 1,385| 4.98| null| null| | 1,515| 5.58| null| null| | 1,526| 6.01| null| null| | 1,502| 4.97| null| null| | 1,366| 5.36| null| null| | 1,987| 5.25| null| null| | 1,723| 5.39| null| null| | 1,357| 5.30| null| null| | 1,643| 5.16| null| null| | 1,244| 5.44| null| null| | 1,622| 4.81| null| null| | 1,513| 4.98| null| null| | 1,299| 5.58| null| null| | 1,177| 5.39| null| null| | 1,444| 5.15| null| null| | 1,201| 5.46| null| null| | 1,081| 5.93| null| null| | 1,274| 5.44| null| null| | 1,428| 4.99| null| null| | 1,256| 4.71| null| null| | 1,105| 4.66| null| null| | 1,137| 4.76| null| null| | 1,110| 5.83| null| null| | 1,509| 5.01| null| null| | 1,677| 4.52| null| null| | 1,910| 4.53| null| null| | 1,623| 4.87| null| null| | 1,682| 5.39| null| null| | 1,332| 5.09| null| null| | 1,355| 5.31| null| null| | 1,332| 5.08| null| null| | 1,329| 5.11| null| null| | 1,291| 4.78| null| null| | 1,254| 4.85| null| null| | 51,091| 5.15| null| null| +--------+-------------+------------------+-------------+
But then if I cache the data frame before filtering, the malformed data is ignored:

GA_pages = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true', comment = '#', mode = "DROPMALFORMED").load(ga_sessions_path).cache()

GA_pages = GA_pages.withColumn('date', fn.col('Date'))\
  .withColumn('sessions', fn.regexp_replace("Sessions", r',', ""))\
  .withColumn('pages/session', fn.col('Pages/Session'))\
  .withColumn('pages', (fn.col('pages/session')*fn.col('sessions')))\
  .select('date', 'sessions', 'pages', 'pages/session')\

display(GA_pages)

Output:
+--------+--------+------------------+-------------+ | date|sessions| pages|pages/session| +--------+--------+------------------+-------------+ |20190508| 1987| 10431.75| 5.25| |20190528| 1910| 8652.300000000001| 4.53| |20190509| 1723| 9286.97| 5.39| |20190530| 1682| 9065.98| 5.39| |20190527| 1677| 7580.039999999999| 4.52| |20190511| 1643| 8477.880000000001| 5.16| |20190529| 1623| 7904.01| 4.87| |20190513| 1622| 7801.82| 4.81| |20190505| 1526| 9171.26| 6.01| |20190504| 1515| 8453.7| 5.58| |20190514| 1513| 7534.740000000001| 4.98| |20190526| 1509| 7560.089999999999| 5.01| |20190501| 1504| 7384.64| 4.91| |20190506| 1502| 7464.94| 4.97| |20190517| 1444| 7436.6| 5.15| |20190521| 1428| 7125.72| 4.99| |20190502| 1398| 7577.16| 5.42| |20190503| 1385| 6897.3| 4.98| |20190507| 1366| 7321.76| 5.36| |20190510| 1357| 7192.099999999999| 5.3| |20190601| 1355| 7195.049999999999| 5.31| |20190531| 1332| 6779.88| 5.09| |20190602| 1332| 6766.56| 5.08| |20190603| 1329|6791.1900000000005| 5.11| |20190515| 1299| 7248.42| 5.58| |20190604| 1291|6170.9800000000005| 4.78| |20190520| 1274| 6930.56| 5.44| |20190522| 1256| 5915.76| 4.71| |20190605| 1254| 6081.9| 4.85| |20190512| 1244| 6767.360000000001| 5.44| |20190518| 1201| 6557.46| 5.46| |20190516| 1177| 6344.03| 5.39| |20190524| 1137| 5412.12| 4.76| |20190525| 1110| 6471.3| 5.83| |20190523| 1105| 5149.3| 4.66| |20190519| 1081| 6410.33| 5.93| | null| 51091| 263118.65| 5.15| +--------+--------+------------------+-------------+

I assume this isn't the intended behavior for mode = 'DROPMALFORMED'. It appears that with Databricks lazy execution, somehow the 'mode = 'DROPMALFORMED' is ignored

I've included the CSV file I have been using

(It's working noting that, in order for this error to be repeatable, if .cache() has been used, the Databricks cluster must be restarted. For some reason .cache() doesn't appear to be cleared by clearing state and results )

Regards,
-- 

Conor Begley   |   QUERYCLICK
Software Developer Intern

e: conor.begley@queryclick.com
dd: +44 (0)131 516 5251
ed: +44 (0)131 556 7078 // lon: +44 (0)207 183 0344

QueryClick is a Healthy Working Lives Bronze Award Winner.

Feed your online marketing muse & join us on Google+, Twitter, LinkedIn and Facebook.

QueryClick Ltd (c/o reg: SC342868 & VAT GB 935518900) is registered in Scotland with offices at The Stamp Office, 10 Waterloo Place, Edinburgh, EH1 3EG & 1 Mark Square, London, EC2A 4EG.

Advice or opinion voiced in this correspondence are those of the author and may not represent those of QueryClick Ltd. Information in this email and any attachments are confidential and intended for the named recipients only.

Performance, improved. http://uk.queryclick.com/