In R we have a special object called factor. In a simple way we can say that factor is a categorical variable, and it is very important to statistical modeling (more information about this object: help page in R ?factor
, or you can find a good review here).
It is a very efficient way to store characters when there are repeated values because it stores labels and encodes them numerically.
Let’s look at two simple examples of factors.
Using characters values:
> z=factor(c("v","d","v","e","v","d","e","d","v","e"))
> z
[1] v d v e v d e d v e
Levels: d e v
> str(z)
Factor w/ 3 levels "d","e","v": 3 1 3 2 3 1 2 1 3 2
> table(z)
z
d e v
3 3 4
Using numeric values:
> set.seed(1)
> x=factor(sample(c(6000:6003),10,replace = T))
> x
[1] 6000 6003 6002 6000 6001 6000 6002 6002 6001 6001
Levels: 6000 6001 6002 6003
> str(x)
Factor w/ 4 levels "6000","6001",..: 1 4 3 1 2 1 3 3 2 2
> table(x)
x
6000 6001 6002 6003
3 3 3 1
As you can see when you call the factor directly you will obtain the levels as output, but when we use the command str() (to inspect the structure of an object) it shows that R uses, internally, integers to represent the different levels:
> str(z)
Factor w/ 3 levels "d","e","v": 3 1 3 2 3 1 2 1 3 2
> str(x)
Factor w/ 4 levels "6000","6001",..: 1 4 3 1 2 1 3 3 2 2
In the case of z d=1, e=2 and v=3; and for x 6000=1, 6001=2, 6002=3 and 6003=4.
Sometimes we are interested in doing some operations on the values of a factor. For example, suppose we want to get the mean of x
> (6000 + 6003 + 6002 + 6000 + 6001 + 6000 + 6002 + 6002 + 6001 + 6001)/10
[1] 6001.2
We can try to use mean() with x as a factor:
> mean(x)
[1] NA
Warning message:
In mean.default(x) : argument is not numeric or logical: returning NA
The problem is that we need numeric (or logical) objects to obtain the mean. We can try to use as.numeric():
> mean(as.numeric(x))
[1] 2.2
But R takes the internal encoding to transform the variable into numeric and we get an incorrect result.
There are 2 ways (as far as I know) to fix the problem.
The first one is to transform factor to character before to transform to a numeric variable:
> mean(as.numeric(as.character(x)))
[1] 6001.2
And the second one, slightly more efficient, is using the levels() function:
> mean(as.numeric(levels(x)[x]))
[1] 6001.2