A python script was developed to scan csv files in order to extract the key value pairs that has the nearest date to some given value.
The solution needed to be developed so that it was scaleable to large files. A sample of the algorithm developed is shown below to illustrate how the solution works. This uses a buffered reading of the input file, as to not load the entire file into memory. Loading the entire file would not scale for the large patient data files this script was needed for.
The iteration of the algorithm is driven by the number of lines in the input file. As to have the capacity to scale to files with a large number of lines, the algorithm stores just the values that are needed to determine the nearest key value.
currentDateAbs = sys.maxint
for row in read:
currentDate = row[columnDate]
if not(currentDate=='' or currentDate=="" or currentDate==None):
currentDate = datetime.datetime.strptime(currentDate, '%d/%m/%Y')
tmp = currentDate - dateWanted
tmp = abs(tmp.days)
if tmp < currentDateAbs: currentDateAbs = tmp returnDataWantedRows= for item in columnDataWanted: returnDataWantedRows.append(row[item]) returnDate = row[columnDate]
The full script was used to parse large patient data files in CSV format, such that to extract the date and value of certain clinical measurements recorded. For example, the key=>value pair might be stored as patient_id,date,value in the CSV file. This script was used along with the another processing script (Linux AWK command) in order to provides insight and support patient analysis for a clinical trial.