//AWK script for data extraction from large CSV files

AWK script for data extraction from large CSV files

A script was developed which can be used to extract data from large CSV files. The script was developed primarily for use in the situation of processing large CSV files containing data on patients. The script was needed in this particular environment because:

  • The files are too large to open in standard spreadsheet applications (many millions of rows)
  • The task is repetitive such that we want to automate

The script shown below is a small sample script to show how it was used.
The script is run under the linux OS using the AWK application

awk -F, '$1=="123" { print $1,$2,$10,$45,$46 }' LargeFile.csv > Result.csv

The script above will search a given csv file ("-F,"), in this case “LargeFile.csv”. If the id of the row matches the specified id (" $1=="123" "). This means the id is contained in column number 1 of the “LargeFile.csv” and the id values equals “123”. The specified columns of the row ("{ print $1,$2,$10,$45,$46 }") is output to the Result.csv file, in this case, column 1, which is the id column, along with columns 2,10,45 and 46.

Example

LargeFile.csv

Id_Col, Val_1, Val_2, Val_3, Val_4, Val_5 ...
098,    a26    b55    f656   d468   2ak   ...
128,    a78    b95    g756   r898   3aj   ...
098,    a86    b85    c856   d478   5ah   ...
123,    a35    b38    d956   b478   3ed   ...

The output Result.csv would look something like this if we wanted only columns 1,2 and 5

awk -F, '$1=="123" { print $1,$2,$5 }' LargeFile.csv > Result.csv

Id_Col, Val_1, Val_5 
123,    a35    3ed   

You can see how this short script can be very valuable when dealing with large CSV files containing many millions of rows of data which cannot be open using spreadsheet applications and tasks need to be performed routinely.