This post provides information on a Python script which was developed to filter large CSV files based on a given list of strings.
The problem is that large CSV exports from our Data Warehouse contain the clinical data to be analysed along side retinal images.
These large CSV files need to be filtered depending on various criteria, in this case, the filtering is a column containing a value from given list of strings.
This python script provides a fast method to filter these files.
import csv def filterFile(inputFileName,outputFileName, filterCriteriaFileName,columnToFilter): #input file reader infile = open(inputFileName, "r") read = csv.reader(infile) headers = next(read) # header #output file writer outfile = open(outputFileName, "w") write = csv.writer(outfile) write.writerow(headers) # write headers #Filtering Criteria inFilterfile = open(filterCriteriaFileName, "r") filterCriteriaList = list(csv.reader(inFilterfile)) #for each row for row in read: if any( [x for x in filterCriteriaList if x in row[columnToFilter]] ): write.writerow(row) #/filterFile() filterFile('InputFile.csv','OutputFile.csv','filterCriteria.csv',4)