awk – Filtering a file from another file

This was one of the first tips that I learned about awk.

All started when I had to manage a 700k SNP file. In one step of the process I had to filter this file based on a list of individuals. I remember that I tried to do it with R but it did not work as I expected (my PC froze in every attempt). So, I went for another tool.

I found the solution with awk, using the follow command:

# Filtering SNP.txt by the first column based on the first column of ID.txt

$ awk 'FNR==NR {arr[$1]; next} ($1 in arr) {print $0}' ID.txt SNP.txt > filteredSNP.txt

In short, in the first part ({arr[$1]; next}) awk will store the values of the first column of ID.txt in “arr”. Then it will compare the first column of SNP.txt with the values in “arr” (($1 in arr)). Finally, when both values matched, it will print the row of SNP.txt ({print $0}).

Here you will find more information about what the command does, even when the discussion in the link is about FNR==NR.

Maybe some examples will help to understand.

# Just to prepare two files for the example.  
$ echo "a 1
a 2
a 1
a 2
b 1
b 2
b 1
b 2
c 3" > A.txt 
$ echo "a b 1 2" > B.txt

# Examples of the behavior of the command 
$ awk 'FNR==NR { a[$1]; next } ($1 in a) {print $0}' B.txt A.txt
a 1
a 2
a 1
a 2
$ awk 'FNR==NR { a[$2]; next } ($1 in a){print $0}' B.txt A.txt
b 1
b 2
b 1
b 2
$ awk 'FNR==NR { a[$3]; next } ($2 in a){print $0}' B.txt A.txt
a 1
a 1
b 1
b 1
$ awk 'FNR==NR { a[$4]; next } ($2 in a) {print $0}' B.txt A.txt
a 2
a 2
b 2
b 2
$ awk 'FNR==NR { a[$4]; next } ($2 in a) {print $1}' B.txt A.txt
a
a
b
b
$ awk 'FNR==NR { a[$4]; next } ($2 in a) {print $2}' B.txt A.txt
2
2
2
2
$ awk 'FNR==NR { a[$4]; next } ($2==3) {print $0}' B.txt A.txt
c 3

awk

filter

Linux