From Mich Talebzadeh <>
Subject Using Lambda function to generate random data in PySpark throws not defined error
Date Fri, 11 Dec 2020 15:08:00 GMT

This used to work but not anymore.

I have file that has these functions

import random
import string
import math

def randomString(length):
    letters = string.ascii_letters
    result_str = ''.join(random.choice(letters) for i in range(length))
    return result_str

def clustered(x,numRows):
    return math.floor(x -1)/numRows

def scattered(x,numRows):
    return abs((x -1 % numRows))* 1.0

def randomised(seed,numRows):
    return abs(random.randint(0, numRows) % numRows) * 1.0

def padString(x,chars,length):
    n = int(math.log10(x) + 1)
    result_str = ''.join(random.choice(chars) for i in range(length-n)) + str(x)
    return result_str

def padSingleChar(chars,length):
    result_str = ''.join(chars for i in range(length))
    return result_str

def println(lst):
    for ll in lst:

Now in the main().py module I import this file as follows:

import UsedFunctions as uf

Then I try the following

import UsedFunctions as uf

 numRows = 100000   ## do in increment of 100K rows
 rdd = sc.parallelize(Range). \
           map(lambda x: (x, uf.clustered(x, numRows), \
                             uf.scattered(x,10000), \
                             uf.randomised(x,10000), \
                             uf.randomString(50), \
                             uf.padString(x," ",50), \
The problem is that now it throws error for numRows as below

  File "C:/Users/admin/PycharmProjects/pythonProject2/pilot/src/",
line 101, in <lambda>
    map(lambda x: (x, uf.clustered(x, numRows), \
NameError: name 'numRows' is not defined

I don't know why this error is coming!

Appreciate any ideas



