Groups for cross validation

This post is for people that are interested in cross-validation or need, for any reason, to create random groups to perform some analysis.

Some years ago, I had to do cross-validation and received from Ignacio Aguilar a simple and efficient tip using awk.

To create random groups in a data set you can use the following in a bash script:

awk -v seed=$RANDOM 'BEGIN{srand(seed);}{ print  $0, 1  + int(rand()*4) }' yourdatafile

The main function in the command is rand(). This function returns a random number between 0 and 1. The number could be 0 but never will be 1. First, we define a variable seed with $RANDOM (an internal bash function) to generate a random start point; if you want to repeat the same groups many times you have to change the seed for a constant. srand(seed) set the start point for the function rand(). In the last part, we print all the data plus one column with (in this case) numbers from 1 to 4.

Let’s see in detail this part of the code:

 1  + int(rand()*4)

As I said above, rand() returns a number between 0 and 1 and the number could be 0 but never 1. So, to obtain an integer we have to do some other manipulations. The first one is to multiply the random number by the number of groups we want to have. In this case, I want 4 groups, so I multiply by 4, rand()*4. If you wanted 10 groups, then should be rand()*10, etc. After that we have to round the number to an integer with int(rand()*4) and as we can have a 0 we just sum 1: 1+int(rand()*4); it is not specially needed but just to have groups starting from 1.

So if the generated number is, for example, 0.02 then 0.02*4=0.08; int(0.08)=0 and 1+0=1. If rand() generates a 0.85 then 0.85*4=3.4; int(3.4)=3 and 1+3=4.

In the end, you just have to filter the data by each group and perform the analysis that you want without a group.

Examples:

# groups from 0 to 3 using a constant as seed
seq 1 1000 | awk 'BEGIN{srand(1);}{ print  $0, int(rand()*4) }' | head
seq 1 1000 | awk 'BEGIN{srand(1);}{ print  $0, int(rand()*4) }'| awk '{print $2}' | sort | uniq -c

# the same groups but from 1 to 4 
seq 1 1000 | awk 'BEGIN{srand(1);}{ print  $0, 1+ int(rand()*4) }' | head 
seq 1 1000 | awk 'BEGIN{srand(1);}{ print  $0, 1 + int(rand()*4) }'| awk '{print $2}' | sort | uniq -c

# with a random seed 
seq 1 1000 | awk -v seed=$RANDOM 'BEGIN{srand(seed);}{ print  $0, 1+ int(rand()*4) }'| awk '{print $2}' | sort | uniq -c

# suppose you have to do a cross-validation
seq 1 1000 > mydata 
awk -v seed=$RANDOM 'BEGIN{srand(seed);}{ print  $0, 1+ int(rand()*4) }' mydata > group.dat

for i in $(seq 1 4) 
do 
   awk -v group=$i '$2!=group' group.dat > training.dat
   wc -l training.dat
   # here you can run a blupf90 (for example) using the training.dat data-set and collecting the solutions for the validation group. 
   # blupf90 renf90.par 
   # awk -v group=$i '$2==group' group.dat > val.dat
   # awk '$2==(the animal effect)' solutions > sol.anim 
   # awk 'FNR==NR {a[$1]; next} ($3 in a) val.dat sol.anim >> sol.val 
done

With this tip, you can generate random groups almost of the same size. As the generation is random then you will not get groups of the exactly same size. If you need groups of the same size you can do other things.

To obtain groups of equal size we can take repeatedly samples of the data but excluding the already sampled data each time. For example:

seq 1 1000 > id
seq 0 0.001 0.999 > phen
paste -d " " id phen > mydata
echo "" > myfilter

for runs in $(seq 1 4) 
do 
 
  awk 'FNR==NR { a[$1]; next } !($1 in a)' myfilter mydata | shuf -n 250 > tmpv
  awk 'FNR==NR { a[$1]; next } {if ($1 in a) print $1,$2="NA"; else print $1,$2}' tmpv mydata > tmp
  awk '$2!="NA"' tmp > training.dat
  wc -l training.dat

  ### Perform the analysis with the training data

  # here we add the already sampled data to the filter, so in the next round they will be excluded
   cat myfilter tmpv > tmp1
   mv tmp1 myfilter
done

shuf sort randomly the data and as we just take the first 250 rows we have a random sample of the data. And, as we always filter the data by the already sampled data, we will have different random groups of the same size in each loop.

You can also create distant groups based on the relationship with kmeans() in R, but I will not show, because the idea of the post is show random ways.

bash

cross-validation

random groups