Chapter 4 Missing values
4.1 Missing Values Patterns Plots
For this section, we did the missing value analysis on the finance table we found on wikipedia.
Missing Values Patterns Plots for Counts
Missing Values Patterns Plots for Percent
4.2 Reasoning and Intepretation
We want to discover the reason behind the missing values, so we first print out the missing patterns of the finance table.
## Year Revenue NetIncome TotalAssets Employees
## Min. :2000 Min. : 5363 Min. : -25 Min. : 6021 Min. : 14800
## 1st Qu.:2005 1st Qu.: 13931 1st Qu.: 1328 1st Qu.: 11516 1st Qu.: 33725
## Median :2010 Median : 65225 Median :14013 Median : 75183 Median : 76550
## Mean :2010 Mean :111160 Mean :23818 Mean :142555 Mean : 77388
## 3rd Qu.:2015 3rd Qu.:215639 3rd Qu.:45687 3rd Qu.:290345 3rd Qu.:117750
## Max. :2020 Max. :274515 Max. :59531 Max. :375319 Max. :147000
## NA's :5
## Year Revenue NetIncome TotalAssets Employees
## 1 FALSE FALSE FALSE FALSE TRUE
## 2 FALSE FALSE FALSE FALSE TRUE
## 3 FALSE FALSE FALSE FALSE TRUE
## 4 FALSE FALSE FALSE FALSE TRUE
## 5 FALSE FALSE FALSE FALSE TRUE
## 6 FALSE FALSE FALSE FALSE FALSE
## 7 FALSE FALSE FALSE FALSE FALSE
## 8 FALSE FALSE FALSE FALSE FALSE
## 9 FALSE FALSE FALSE FALSE FALSE
## 10 FALSE FALSE FALSE FALSE FALSE
## 11 FALSE FALSE FALSE FALSE FALSE
## 12 FALSE FALSE FALSE FALSE FALSE
## 13 FALSE FALSE FALSE FALSE FALSE
## 14 FALSE FALSE FALSE FALSE FALSE
## 15 FALSE FALSE FALSE FALSE FALSE
## 16 FALSE FALSE FALSE FALSE FALSE
## 17 FALSE FALSE FALSE FALSE FALSE
## 18 FALSE FALSE FALSE FALSE FALSE
## 19 FALSE FALSE FALSE FALSE FALSE
## 20 FALSE FALSE FALSE FALSE FALSE
## 21 FALSE FALSE FALSE FALSE FALSE
As we can see from the missing pattern table, the missing values of employees occur in the first five rows. Since our year column is in chronological order, we know that the five missing rows in Employees column is positively correlated with the year column. Therefore, we want to print out the first five missing rows and the next few employees rows without missing values. By comparing the values of these rows, we may discover the reason behind the missing values in Employees column in the first five years.
## # A tibble: 6 × 5
## Year Revenue NetIncome TotalAssets Employees
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2000 7983 786 6803 NA
## 2 2001 5363 -25 6021 NA
## 3 2002 5742 65 6298 NA
## 4 2003 6207 69 6815 NA
## 5 2004 8279 274 8050 NA
## 6 2005 13931 1328 11516 14800
After we apply the plot_missing
function on the finance table from Wikipedia, we find that Apple’s finance table only has two missing patterns, and the missing values are all occurred in the employee’s column. Also, by looking at the top plot of the missing graph, we find out that there are five rows of missing values in the finance table. The year of these missing values is from 2000 to 2004, which is very early, so the employee’s data could be missing. This could be one of the main reasons.
From 2004 to 2005, we can see that the change is more significant compared to the changes in previous years. Thus, we think another possible reason could be that from 2000 - 2004, the number of employees does not matter too much; while from 2004-2005, the company’s rapid development, more information is needed. So since then, the number of employees started to be collected.