This was one of the first tips that I learned about awk.
All started when I had to manage a 700k SNP file. In one step of the process I had to filter this file based on a list of individuals. I remember that I tried to do it with R but it did not work as I expected (my PC froze in every attempt). So, I went for another tool.
I found the solution with awk, using the follow command:
# Filtering SNP.txt by the first column based on the first column of ID.txt $ awk 'FNR==NR {arr[$1]; next} ($1 in arr) {print $0}' ID.txt SNP.txt > filteredSNP.txt
In short, in the first part ({arr[$1]; next}
) awk will store the values of the first column of ID.txt in “arr”. Then it will compare the first column of SNP.txt with the values in “arr” (($1 in arr)
). Finally, when both values matched, it will print the row of SNP.txt ({print $0}
).
Here you will find more information about what the command does, even when the discussion in the link is about FNR==NR
.
Maybe some examples will help to understand.
# Just to prepare two files for the example. $ echo "a 1 a 2 a 1 a 2 b 1 b 2 b 1 b 2 c 3" > A.txt $ echo "a b 1 2" > B.txt # Examples of the behavior of the command $ awk 'FNR==NR { a[$1]; next } ($1 in a) {print $0}' B.txt A.txt a 1 a 2 a 1 a 2 $ awk 'FNR==NR { a[$2]; next } ($1 in a){print $0}' B.txt A.txt b 1 b 2 b 1 b 2 $ awk 'FNR==NR { a[$3]; next } ($2 in a){print $0}' B.txt A.txt a 1 a 1 b 1 b 1 $ awk 'FNR==NR { a[$4]; next } ($2 in a) {print $0}' B.txt A.txt a 2 a 2 b 2 b 2 $ awk 'FNR==NR { a[$4]; next } ($2 in a) {print $1}' B.txt A.txt a a b b $ awk 'FNR==NR { a[$4]; next } ($2 in a) {print $2}' B.txt A.txt 2 2 2 2 $ awk 'FNR==NR { a[$4]; next } ($2==3) {print $0}' B.txt A.txt c 3