Fast Filtering of large CSV files using Python

This post provides information on a Python script which was developed to filter large CSV files based on a given list of strings.

The problem is that large CSV exports from our Data Warehouse contain the clinical data to be analysed along side retinal images.

These large CSV files need to be filtered depending on various criteria, in this case, the filtering is a column containing a value from given list of strings.

This python script provides a fast method to filter these files.

import csv

def filterFile(inputFileName,outputFileName,

	#input file reader
	infile = open(inputFileName, "r")
	read = csv.reader(infile)
	headers = next(read) # header

	#output file writer
	outfile = open(outputFileName, "w")
	write = csv.writer(outfile)

	write.writerow(headers) # write headers

	#Filtering Criteria
	inFilterfile = open(filterCriteriaFileName, "r")
	filterCriteriaList = list(csv.reader(inFilterfile))

	#for each row
	for row in read:
		if any( [x for x in filterCriteriaList 
			if x[0] in row[columnToFilter]] ):