//Python script to extract key value pairs from CSV file

Python script to extract key value pairs from CSV file

The python script below reads a CSV value into key value pairs into both a list and dictionary data structure.

import csv

def readKeyValues(inputCsvFile,keyColumn,ValueColumn):
	#input file reader
	infile = open(inputFileName, "r")
	read = csv.reader(infile)
	headers = next(read) # skip header
	
	returnDictionary={}
	returnList=[]
	
	#for each row
	for row in read:
		key   = row[keyColumn]
		value = row[ValueColumn]
		
		#Add to dictionary (note, will overwrite and only store single occurrences)
		returnDictionary[key] = value
		
		#Add to list (note, will store multiple occurrences)
		returnList.append([key,value])
		
#\readKeyValues()

#default values
inputCsvFile = 'infile.csv'
keyColumn    = 0
ValueColumn  = 1

readKeyValues(inputCsvFile,keyColumn,ValueColumn)

Another python script (shown below) was developed to scan csv files in order to extract the key value pairs that had the nearest date to some given value. This is based on the python script shown above.

The solution needed to be developed so that it was scaleable to large files. A sample of the algorithm developed is shown below to illustrate how the solution works. This uses a buffered reading of the input file, as to not load the entire file into memory. Loading the entire file would not scale for the large patient data files which this script was needed for.

The iteration of the algorithm is driven by the number of lines in the input file. As to have the capacity to scale to files with a large number of lines, the algorithm stores just the values that are needed to determine the nearest key value.

# init
returnDataWantedRows=[]
returnDate=None 
currentDateAbs = sys.maxint

for row in read:
	currentDate = row[columnDate]		

	if not(currentDate=='' or currentDate=="" or currentDate==None):			
		currentDate = datetime.datetime.strptime(currentDate, '%d/%m/%Y')
	
		tmp = currentDate - dateWanted
		tmp = abs(tmp.days)

		if tmp < currentDateAbs:
			currentDateAbs = tmp

			returnDataWantedRows=[]

			for item in columnDataWanted:
				returnDataWantedRows.append(row[item])

			returnDate = row[columnDate]

The full script was used to parse large patient data files in CSV format, to extract the date and value of certain clinical measurements recorded. For example, the key=>value pair might be stored as patient_id,date,value in the CSV file. This script was used along with the another processing script (Linux AWK command) in order to provides insight and support patient analysis for a clinical trial.