AWK script for data extraction from large CSV files

A script was developed which can be used to extract data from large CSV files. The script was developed primarily for use in the situation of processing large CSV files containing data on patients. The script was needed in this particular environment because:

  • The files are too large to open in standard spreadsheet applications (many millions of rows)
  • The task is repetitive such that we want to automate

The script shown below is a small sample script to show how it was used. The script is run under the linux OS using the AWK application

awk -F, '$1=="123" { print $1,$2,$10,$45,$46 }' LargeFile.csv > Result.csv

The script above will search a given csv file ("-F,"), in this case “LargeFile.csv”. If the id of the row matches the specified id (" $1=="123" "). This means the id is contained in column number 1 of the “LargeFile.csv” and the id values equals “123”. The specified columns of the row ("{ print $1,$2,$10,$45,$46 }") is output to the Result.csv file, in this case, column 1, which is the id column, along with columns 2,10,45 and 46.

Example

LargeFile.csv

Id_Col Val_1 Val_2 Val_3 Val_4 Val_5 …  
098 a26 b55 f656 d468 2ak
128 a78 b95 g756 r898 3aj
098 a86 b85 c856 d478 5ah
123 a35 b38 d956 b478 3ed

The output Result.csv would look something like this if we wanted only columns 1,2 and 5

awk -F, '$1=="123" { print $1,$2,$5 }' LargeFile.csv > Result.csv

Id_Col Val_1 Val_5
123 a35 3ed

You can see how this short script can be very valuable when dealing with large CSV files containing many millions of rows of data which cannot be open using spreadsheet applications and tasks need to be performed routinely.

Creating your first programming language is easier than you think,
...also looks great on your resume/cv.