Fast Filtering of large CSV files using Python

This post provides information on a Python script which was developed to filter large CSV files based on a given list of strings.

The problem is that large CSV exports from our Data Warehouse contain the clinical data to be analysed along side retinal images.

These large CSV files need to be filtered depending on various criteria, in this case, the filtering is a column containing a value from given list of strings.

This python script provides a fast method to filter these files.

import csv

def filterFile(inputFileName,outputFileName,
		filterCriteriaFileName,columnToFilter):

	#input file reader
	infile = open(inputFileName, "r")
	read = csv.reader(infile)
	headers = next(read) # header

	#output file writer
	outfile = open(outputFileName, "w")
	write = csv.writer(outfile)

	write.writerow(headers) # write headers

	#Filtering Criteria
	inFilterfile = open(filterCriteriaFileName, "r")
	filterCriteriaList = list(csv.reader(inFilterfile))

	#for each row
	for row in read:
		if any( [x for x in filterCriteriaList 
			if x[0] in row[columnToFilter]] ):
			write.writerow(row)

#/filterFile()

filterFile('InputFile.csv','OutputFile.csv','filterCriteria.csv',4)

If you would like to see more python scripts that I have written you can search by tag python. Or if you have any questions about this script, or your wider project, please get in touch.

What can you do to improve your life as a programmer?

my article provides some ideas!