Python script to extract key value pairs from CSV file

The python script below reads a CSV value into key value pairs into both a list and dictionary data structure.

import csv

def readKeyValues(inputCsvFile,keyColumn,ValueColumn):
	#input file reader
	infile = open(inputFileName, "r")
	read = csv.reader(infile)
	headers = next(read) # skip header
	
	returnDictionary={}
	returnList=[]
	
	#for each row
	for row in read:
		key   = row[keyColumn]
		value = row[ValueColumn]
		
		#Add to dictionary 
		#note will overwrite and store single occurrences
		returnDictionary[key] = value
		
		#Add to list (note, will store multiple occurrences)
		returnList.append([key,value])
		
#\readKeyValues()


#default values
inputCsvFile = 'infile.csv'
keyColumn    = 0
ValueColumn  = 1

readKeyValues(inputCsvFile,keyColumn,ValueColumn)

Another python script (shown below) was developed to scan csv files in order to extract the key value pairs that had the nearest date to some given value. This is based on the python script shown above.

The solution needed to be developed so that it was scalable to large files. A sample of the algorithm developed is shown below to illustrate how the solution works. This uses a buffered reading of the input file, as to not load the entire file into memory. Loading the entire file would not scale for the large patient data files which this script was needed for.

The iteration of the algorithm is driven by the number of lines in the input file. As to have the capacity to scale to files with a large number of lines, the algorithm stores just the values that are needed to determine the nearest key value.

# init
returnDataWantedRows=[]
returnDate=None 
currentDateAbs = sys.maxint

for row in read:
	currentDate = row[columnDate]	

	if not(currentDate=='' or 
	       currentDate=="" or 
	       currentDate==None):

		currentDate = datetime.datetime.strptime(
				currentDate, '%d/%m/%Y')
	
		tmp = currentDate - dateWanted
		tmp = abs(tmp.days)

		if tmp < currentDateAbs:
			currentDateAbs = tmp

			returnDataWantedRows=[]

			for item in columnDataWanted:
				returnDataWantedRows.append(row[item])

			returnDate = row[columnDate]

The full script was used to parse large patient data files in CSV format, to extract the date and value of certain clinical measurements recorded. For example, the key=>value pair might be stored as patient_id,date,value in the CSV file.

This script was used along with the another processing script (Linux AWK command) in order to provides insight and support patient analysis for a clinical trial.

A course that I recently completed on coursera was Google IT Automation with Python. This course contained some really interesting topics on:

Task automation
Manage IT resources at scale
Using Python to Interact with the Operating System
Automating Real-World Tasks with Python

If you are interested in advance your career with a certification from Google you can find out more about the course on coursera.org.

I would also recommend the algorithms specialization from Stanford University if you want to learn how to create efficient and scalable code. This course taught me skills and principals that made me much better at designing my code to be faster and more efficient.

If you would like any help or have any questions reach out to me.