Dealing with missing data when subsetting valuesrstats ·
Working with data frames can be tricky at first. For example it seems to be very logical especially for a not really experienced R users to manage the rows subsettings by using square brackets such like this: example_df[column_1 == “A”, ] . Actually It works well but only that cases when there is no missing value in the data frame. In this little introductory post I am going to show the proper way how to deal with missing value in similar situations.
So first of all let me give you a small example which shows the problem that I’ve already mentioned. When I try to subset observations in the mtcars dataset the result is fine as we expected:
Now lets try the same thing in that case when there is missing data on the data frame object:
It seems as though there is a problem with the output. It is because R doesn’t understand how to deal with NA’s (unless we tell the instruction) when there is a logical condition. Cells which contains missing values is not smaller or greater than e.g 20 in the “Ozone” column. It is just a missing value that’s why R attach this rows to the selected ones. This results usually unexpected outputs.
You can tell to R that there are missing values on the dataset so needs to be care about it. The is.na() function would be the quick fix for this situations.
Honestly I do not really recommend this way to handle the problem because it easily becomes chaotic when there are many logical conditions.
One better solution:
It is much better to use which() function because in that case you don’t need to use is.na() anymore inside square bracket. The code becomes also simpler than the previous one and result remains the same of course:
Almost the best solution:
One of the biggest advantage of the subset() function is that it do not needs to always reference the name of the data frame inside square bracket. This so called feature makes our code even simpler:
My personal recommendation:
There are popular packages which are specialized for data manipulation. Actually it is worth to learn the use of these packages. Usually these are faster than the basic functions and also there are many useful features. Just check both of if and pick one which fits better for you.
The dplyr solution:
The data.table solution:
There are many ways how to avoid unexpected output when you need to subset rows in data frame object which has missing values. All of the solutions I’ve showed handle the problem in some way but as a programmer I would recommend that you have to always care about clean code so in this situation pick the simplest code which results the same output. So use the dplyr or data.table package or if you don’t prefer to use packages for some reasons function subsets() would also works well.