AWK script for data extraction from large CSV files

A script was developed which can be used to extract data from large CSV files. The script was developed primarily for use in the situation of processing large CSV files containing data on patients. The script was needed in this particular environment because:

The files are too large to open in standard spreadsheet applications (many millions of rows)
The task is repetitive such that we want to automate

The script shown below is a small sample script to show how it was used. The script is run under the linux OS using the AWK application

awk -F, '$1=="123" { print $1,$2,$10,$45,$46 }' LargeFile.csv > Result.csv

The script above will search a given csv file ("-F,"), in this case “LargeFile.csv”. If the id of the row matches the specified id (" $1=="123" "). This means the id is contained in column number 1 of the “LargeFile.csv” and the id values equals “123”. The specified columns of the row ("{ print $1,$2,$10,$45,$46 }") is output to the Result.csv file, in this case, column 1, which is the id column, along with columns 2,10,45 and 46.

Example

LargeFile.csv

Id_Col	Val_1	Val_2	Val_3	Val_4	Val_5 …
098	a26	b55	f656	d468	2ak	…
128	a78	b95	g756	r898	3aj	…
098	a86	b85	c856	d478	5ah	…
123	a35	b38	d956	b478	3ed	…

The output Result.csv would look something like this if we wanted only columns 1,2 and 5

awk -F, '$1=="123" { print $1,$2,$5 }' LargeFile.csv > Result.csv

Id_Col	Val_1	Val_5
123	a35	3ed

You can see how this short script can be very valuable when dealing with large CSV files containing many millions of rows of data which cannot be open using spreadsheet applications and tasks need to be performed routinely.