This post is for people that are interested in cross-validation or need, for any reason, to create random groups to perform some analysis.
Some years ago, I had to do cross-validation and received from Ignacio Aguilar a simple and efficient tip using awk.
To create random groups in a data set you can use the following in a bash script:
awk -v seed=$RANDOM 'BEGIN{srand(seed);}{ print $0, 1 + int(rand()*4) }' yourdatafile
The main function in the command is rand()
. This function returns a random number between 0 and 1. The number could be 0 but never will be 1. First, we define a variable seed
with $RANDOM
(an internal bash function) to generate a random start point; if you want to repeat the same groups many times you have to change the seed for a constant. srand(seed)
set the start point for the function rand()
. In the last part, we print all the data plus one column with (in this case) numbers from 1 to 4.
Let’s see in detail this part of the code:
1 + int(rand()*4)
As I said above, rand()
returns a number between 0 and 1 and the number could be 0 but never 1. So, to obtain an integer we have to do some other manipulations. The first one is to multiply the random number by the number of groups we want to have. In this case, I want 4 groups, so I multiply by 4, rand()*4
. If you wanted 10 groups, then should be rand()*10
, etc. After that we have to round the number to an integer with int(rand()*4)
and as we can have a 0 we just sum 1: 1+int(rand()*4)
; it is not specially needed but just to have groups starting from 1.
So if the generated number is, for example, 0.02
then 0.02*4=0.08
; int(0.08)=0
and 1+0=1
. If rand()
generates a 0.85
then 0.85*4=3.4
; int(3.4)=3
and 1+3=4
.
In the end, you just have to filter the data by each group and perform the analysis that you want without a group.
Examples:
# groups from 0 to 3 using a constant as seed
seq 1 1000 | awk 'BEGIN{srand(1);}{ print $0, int(rand()*4) }' | head
seq 1 1000 | awk 'BEGIN{srand(1);}{ print $0, int(rand()*4) }'| awk '{print $2}' | sort | uniq -c
# the same groups but from 1 to 4
seq 1 1000 | awk 'BEGIN{srand(1);}{ print $0, 1+ int(rand()*4) }' | head
seq 1 1000 | awk 'BEGIN{srand(1);}{ print $0, 1 + int(rand()*4) }'| awk '{print $2}' | sort | uniq -c
# with a random seed
seq 1 1000 | awk -v seed=$RANDOM 'BEGIN{srand(seed);}{ print $0, 1+ int(rand()*4) }'| awk '{print $2}' | sort | uniq -c
# suppose you have to do a cross-validation
seq 1 1000 > mydata
awk -v seed=$RANDOM 'BEGIN{srand(seed);}{ print $0, 1+ int(rand()*4) }' mydata > group.dat
for i in $(seq 1 4)
do
awk -v group=$i '$2!=group' group.dat > training.dat
wc -l training.dat
# here you can run a blupf90 (for example) using the training.dat data-set and collecting the solutions for the validation group.
# blupf90 renf90.par
# awk -v group=$i '$2==group' group.dat > val.dat
# awk '$2==(the animal effect)' solutions > sol.anim
# awk 'FNR==NR {a[$1]; next} ($3 in a) val.dat sol.anim >> sol.val
done
With this tip, you can generate random groups almost of the same size. As the generation is random then you will not get groups of the exactly same size. If you need groups of the same size you can do other things.
To obtain groups of equal size we can take repeatedly samples of the data but excluding the already sampled data each time. For example:
seq 1 1000 > id
seq 0 0.001 0.999 > phen
paste -d " " id phen > mydata
echo "" > myfilter
for runs in $(seq 1 4)
do
awk 'FNR==NR { a[$1]; next } !($1 in a)' myfilter mydata | shuf -n 250 > tmpv
awk 'FNR==NR { a[$1]; next } {if ($1 in a) print $1,$2="NA"; else print $1,$2}' tmpv mydata > tmp
awk '$2!="NA"' tmp > training.dat
wc -l training.dat
### Perform the analysis with the training data
# here we add the already sampled data to the filter, so in the next round they will be excluded
cat myfilter tmpv > tmp1
mv tmp1 myfilter
done
shuf
sort randomly the data and as we just take the first 250 rows we have a random sample of the data. And, as we always filter the data by the already sampled data, we will have different random groups of the same size in each loop.
You can also create distant groups based on the relationship with kmeans()
in R, but I will not show, because the idea of the post is show random ways.