Fast Filtering of large CSV files using Python
This post provides information on a Python script which was developed to filter large CSV files based on a given list of strings.
The problem is that large CSV exports from our Data Warehouse contain the clinical data to be analysed along side retinal images.
These large CSV files need to be filtered depending on various criteria, in this case, the filtering is a column containing a value from given list of strings.
This python script provides a fast method to filter these files.
import csv
def filterFile(inputFileName,outputFileName,
filterCriteriaFileName,columnToFilter):
#input file reader
infile = open(inputFileName, "r")
read = csv.reader(infile)
headers = next(read) # header
#output file writer
outfile = open(outputFileName, "w")
write = csv.writer(outfile)
write.writerow(headers) # write headers
#Filtering Criteria
inFilterfile = open(filterCriteriaFileName, "r")
filterCriteriaList = list(csv.reader(inFilterfile))
#for each row
for row in read:
if any( [x for x in filterCriteriaList
if x[0] in row[columnToFilter]] ):
write.writerow(row)
#/filterFile()
filterFile('InputFile.csv','OutputFile.csv','filterCriteria.csv',4)