AWK script for data extraction from large CSV files
A script was developed which can be used to extract data from large CSV files. The script was developed primarily for use in the situation of processing large CSV files containing data on patients. The script was needed in this particular environment because:
- The files are too large to open in standard spreadsheet applications (many millions of rows)
- The task is repetitive such that we want to automate
The script shown below is a small sample script to show how it was used. The script is run under the linux OS using the AWK application
awk -F, '$1=="123" { print $1,$2,$10,$45,$46 }' LargeFile.csv > Result.csv
The script above will search a given csv file ("-F,"
), in this case “LargeFile.csv”. If the id of the row matches the specified id (" $1=="123" "
). This means the id is contained in column number 1 of the “LargeFile.csv” and the id values equals “123”. The specified columns of the row ("{ print $1,$2,$10,$45,$46 }"
) is output to the Result.csv file, in this case, column 1, which is the id column, along with columns 2,10,45 and 46.
Example
LargeFile.csv
Id_Col | Val_1 | Val_2 | Val_3 | Val_4 | Val_5 … | |
---|---|---|---|---|---|---|
098 | a26 | b55 | f656 | d468 | 2ak | … |
128 | a78 | b95 | g756 | r898 | 3aj | … |
098 | a86 | b85 | c856 | d478 | 5ah | … |
123 | a35 | b38 | d956 | b478 | 3ed | … |
The output Result.csv would look something like this if we wanted only columns 1,2 and 5
awk -F, '$1=="123" { print $1,$2,$5 }' LargeFile.csv > Result.csv
Id_Col | Val_1 | Val_5 |
---|---|---|
123 | a35 | 3ed |
You can see how this short script can be very valuable when dealing with large CSV files containing many millions of rows of data which cannot be open using spreadsheet applications and tasks need to be performed routinely.