Forecast Hungarian higher education data in R

I had to forecast the further number of enrolled students in the Hungarian higher education sphere. I did the whole job in R of course.

In Hungary the most important higher education statistics are published in each year so You can easily access to the data here. I chose the most typical group of students who enrolled directly after secondary education.
I wanted to use just a simple method, so I chose Holt’s Exponential Smoothing (here is a basic description of how forecast in R btw), I recommend to use package forecast in similar cases.

The output is a GIF animation which was made with the package gganimate:

The forecast is quite OK according to correlogram and Ljung-Box test but because of the small number of observation the assumption of normal distribution of forecast errors does not seems to be met completely.

According to the results it is likely to be a decrease which is a sad fact because based on the most tertiary education statistics Hungary has already performed worst than the OECD average.

Bump Charts in R

Recently I found this guy who create beautiful charts in Tableau. Especially I like this Bump Chart style visualization. I just wondered it can be easy to reproduce it in R so I gave it a try.

I used the Hungarian first name database which I have already showed in the previous post. I uploaded it to data.world so You can download the whole database. The Bump Chart in other words is just a simple line chart with a minimal correction, but this kind of plot can be useful to visualize rank result. Here is my implementation in R:

This visualization shows the popularity trend of the top10 male first name in 2016 between 2000 and 2016 according to yearly rank of names. There are names which were not always in the top10 between the selected period that’s why there is a 10+ line in the bottom. You can highlight any name by clicking on it or You can also select any of it from the drop-down list.

I would like to also publish the code to help to reproduce my work. I used Shiny so there are two separete files.

server.r

library(openxlsx)
library(ggplot2)
library(plotly)
library(ggthemr)
library(crosstalk)
library(shiny)
library(scales)
 
#-------------------------------------
# LOAD THE DATABASE
#-------------------------------------
 
database = read.xlsx("Hungarian_first_and_middle_name_db_1954_2016.xlsx", startRow = 1, colNames = TRUE)
 
##### filter the years
 
db = database
db = database[database$YEAR >= 2000,]
 
##### top10 names in 2016
 
top10_name = db$NAME_MALE[db$YEAR == 2016 & db$RANK <= 10]
 
#-------------------------------------
# SHINY APP
#-------------------------------------
 
shinyServer(
	function(input, output) {
 
  data <- reactive({
		db = database
		db = database[database$YEAR >= 2000,]
		db = as.data.frame(xtabs(RANK ~ YEAR + eval(parse(text = "NAME_MALE")), data = db))
		colnames(db)[2] = "NAME_MALE"
 
		###### select only the top10
		top10_name = db$NAME_MALE[db$YEAR == 2016 & db$Freq <= 10 & db$Freq > 0]
		db = db[is.na(match(db$NAME_MALE, top10_name)) == FALSE,]
 
		###### override all the values which is greather than 10
 
		db$Freq[db$Freq > 10] = 11
 
		db = cbind(db, label = db$Freq)
		db$label[db$label == 11] = "10+"
 
		db
  })
 
  output$plot <- renderPlotly({
					pdf(NULL)
					db = data()
 
					db$YEAR = as.numeric(as.character(db$YEAR))
 
					sd <- SharedData$new(db, ~NAME_MALE, group = "Choose the first name You want to highlight")
					gg = ggplot(sd, aes(x = YEAR, y = Freq, colour = NAME_MALE, text = NAME_MALE)) + 
							geom_point(size = 8) + 
							geom_line(size = 1.1) +
							geom_text(aes(label = paste0("#",label)), color = "white", size=3.5) +
							scale_y_continuous("", limits = c(1,11), breaks = seq(0,11,1), labels = c(seq(0,10,1),"10+")) +
							scale_x_continuous("", breaks = seq(2000,2016,1)) +
							guides(colour = guide_legend(override.aes = list(size=1))) +
					    scale_y_reverse() +
							theme(legend.position="none",
							      axis.title.y=element_blank(),
							      axis.text.y=element_blank(),
							      axis.ticks.y=element_blank(),
							      panel.background = element_rect(fill = '#34495e'),
								    panel.grid.major = element_blank())
 
					gg <- ggplotly(gg, tooltip = c("text")) %>%
						highlight(on = "plotly_click", persistent = FALSE, selectize = TRUE)  
 
					gg
  })
}
)

ui.r

shinyUI(fluidPage(
  plotlyOutput("plot", height = "700px", width = "100%")
))

Hungarian version of FiveThirtyEight age estimator

I tried to reproduce the FiveThirtyEight age estimator with Hungarian data. The methodology is quite the same. Unfortunately there is no as accurate database as the US version use. In Hungary only the common top 100 names has been collected and published here. The biggest problem is that before 2000 only date ranges has been published so I assumed uniform distribution between that years. However I found an up-to-date life table thanks to WHO.

I tried to create something similar like this in R. I used Shiny and Plotly for the visualization. Here is the result of it:

The dark-grey line shows the number of birth of the selected name in each years, while the blue color area is the number of birth which was adjusted to the life table of Hungary. The vertical red line shows the median living year of birth of the selected name.

Life without Shiny #1: Tableau

I just wondered what about the competitors of Rstudio Shiny. At first I would say I really like Shiny and I also absolutely a big fan of R. This is just an experiment for me and a kind of knowledge increasement because I haven’t really tried anything in addition to Shiny.

In this little series I will introduce some known alternatives. I will try to reproduce this dashboard which was made in R. In the chosen program I will try to create all of the functionality which it has. I choosed this dashboard bacause

  • of its simplcity
  • the data source of it is free and I have already known it
  • it has some popular chart types like line chart, pie chart or map (also there is a table on the dashboard)
  • it has some basic required functions like drop-down filters, single and multiple plot selectors

I choosed Tableau at first because this is the market leader BI report tool. Unfortunately Tableau is not a free and open-source program (unlike R), but luckily there is a free version of it called Tableau Public. It is mainly for educational use and also for journalists. Of course it has many restrictions for example You can’t make private dashboards or charts but now this is perfect for me to just try it.

I have never used this program before it so at first I have done an online course called: Tableau 10 for Data Scientists. I recommend it btw. The duration of the course is around 2 hours. After it I could do the dashborad quite easily. It lasted 2-3 hours to create it:

I begin to like this program because it is really so simple and user friendly and there is a big user community behind of it. For example If I got stuck I could find a solution after 5 minutes Google search. The simplictiy is the greathest strength of it (for example You can easily create maps with a drag and drop method because the countries are automaticly recognised by it names and if there is a problem during the recognition You easily make the correction manually) so I very curious about how it can cope with the advanced expectations. Also You can transform the data and create new functions and variables within the program, it has an own language. There are opportunites to publish the output. For example you can embed the dashboard to your website (like I did) after publishing it to Tableau Public Server. This kind of feature is missing for me when I use Shiny.

It is also possible to use R within Tableau. This will be the next topic which I would like to know better. I have already understood the basics now I would like to know the limitations of it compared to Shiny.

Strange bug when You use Plotly on Your own server

I used my Shiny apps only locally and only with shinyapps.io in the past, but now I’ve got a VPS. When I migrated my previous apps to the VPS I found a strange bug. When there was ggplot based Plotly plot in my code I got an error message: “An error has occurred. Check your logs or contact the app author for clarification.”.

I checked the log file which said:

Warning: Error in : cannot open file 'Rplots.pdf'
Stack trace (innermost first):
    88: 
    87: grid.Call
    86: convertUnit
    85: convert
    84: unitConvert
    83: %||%
    82: gg2list
    81: ggplotly.ggplot
    80: ggplotly
    79: ggplotly [/srv/shiny-server/survey_from_gf/server.R#136]
    78: func
    77: origRenderFunc
    76: output$plotlyBar
     1: runApp

Fortunately I found this blog post which says I just only need to paste pdf(NULL) to my code which solves the problem.

So if You have a same issue You just only need to paste it to the start of the renderPlotly function like this:

shinyServer(
  function(input, output) {
     output$plotlyPlot <- renderPlotly({
        pdf(NULL)
 
        ggplot(mtcars, aes(factor(cyl))) +
        geom_bar()
 
        ggplotly(p)
     })
})

Write me if You know a more elegant way to fix this bug.

Free data sources from data.world

Sometimes its hard to find good data source when You make some side projects especially when You want to use survey data. I found an interesting site called data.world. There are more than thousand database freely available after registration.

It has a cool feature: you can export the data directly to R, so You do not need to download it to Your local drive.

However sometimes it is tricky to use this function because the file format is not always .csv as it assumes, but of course You can use the link of the data source in this case.

Here is a minimal example how You can use it:

library(openxlsx)
library(googleVis)
 
# set working directory where You want to download the database
# setwd("C:/Users/yourName/Desktop")
 
download.file("https://query.data.world/s/9k1dnvrr5ykop5r89vnhwb7na", "database.xlsx", mode="wb")
 
# load the data with the openxlsx package
db = read.xlsx("database.xlsx", startRow = 1, colNames = TRUE)
 
# aggregate the data to a County level
db = aggregate(db[,"POP2010"], by=list(db[,"STNAME"]), FUN=sum, na.rm = TRUE)
 
# plot it with googleVis package
GeoStates <- gvisGeoChart(db, "Group.1", "x",
                          options=list(region="US", 
                                       displayMode="regions", 
                                       resolution="provinces",
                                       width=600, height=400))
plot(GeoStates)

Where to find and how to use NUTS2 level maps in R

There are many opportunities to find maps which are good for R, also You can easily find country level maps.

TIP1: GADM database + basic plot function

For example GADM is an awesome site, You can freely download any country map in a format of SpatialPolygonsDataFrame. Most of the countries has multiple levels.

Here is a simple example: (I will use map of Hungary through the whole example)

#load the data
download.file("http://biogeo.ucdavis.edu/data/gadm2.8/rds/HUN_adm1.rds",
              "HUN_adm1.rds", mode = "wb")
countries = readRDS("HUN_adm1.rds")
 
#check the structure of the data
countries@data
 
#simply plot it with random color
plot(countries, col = colorRampPalette(c("white", "red"))(nrow(countries@data)))

TIP2: googleVis

Also there is a cool interactive web-based solution thanks to Google (and of course the creators of googleVis package) which also supoorts most of the countries:

library(googleVis)
library(sp)
 
#create random database with county codes 
countyName = c("HU-BU",
                "HU-BK",
                "HU-BA",
                "HU-BE",
                "HU-BZ",
                "HU-CS",
                "HU-FE",
                "HU-GS",
                "HU-HB",
                "HU-HE",
                "HU-JN",
                "HU-KE",
                "HU-NO",
                "HU-PE",
                "HU-SO",
                "HU-SZ",
                "HU-TO",
                "HU-VA",
                "HU-VE",
                "HU-ZA")
randomData = runif(length(countyName),0,100)
exampleData <- data.frame(countyName, randomData)
 
GeoMaps <- gvisGeoChart(exampleData, "countyName", "randomData",
                          options=list(region="HU", 
                                       displayMode="regions", 
                                       resolution="provinces",
                                       width=600, height=400))
plot(GeoMaps)

TIP3: Eurostat geodata + basic plot function

But it was really hard to find a NUTS2 level country maps for me, but finally I came across the geodata of Eurostat. I recommend to use the 1:1 Million scale if You want to plot countries.

library(rgdal)
 
#download the file
temp <- tempfile(fileext = ".zip")
download.file("http://ec.europa.eu/eurostat/cache/GISCO/geodatafiles/NUTS_2013_01M_SH.zip", temp)
unzip(temp)
 
#load the data and filter it to Hungary and NUTS2 level
EU_NUTS = readOGR(dsn = "./NUTS_2013_01M_SH/data", layer = "NUTS_RG_01M_2013")
map_nuts2 <- subset(EU_NUTS, STAT_LEVL_ == 2) # set NUTS level
country <- substring(as.character(map_nuts2$NUTS_ID), 1, 2)
map <- c("HU") # limit it to Hungary
map_nuts2a <- map_nuts2[country %in% map,]
 
#plot it
plot(map_nuts2a, col = colorRampPalette(c("white", "red"))(nrow(map_nuts2a@data)))

Bonus!

When I used the geodata for my project I also used the cartography package which is an easy-to-use map creator.
Here is a small example how You can use it:

library(cartography)
 
plot(map_nuts2a)
 
cols <-	 carto.pal(pal1 = "green.pal", n1 = nrow(map_nuts2a@data)+1)
nuts2_id = map_nuts2a@data[,"NUTS_ID"]
value = runif(nrow(map_nuts2a@data),0,50)
hun_nuts2_df = data.frame(nuts2_id, value)
 
choroLayer(spdf = map_nuts2a, # SpatialPolygonsDataFrame of the regions
           df = hun_nuts2_df, # target data frame 
           var = "value", # target value
           breaks = c(0,5,10,15,20,25,30,35,100), # list of breaks
           col = cols, # colors 
           border = "white", # color of the polygons borders
           lwd = 2, # width of the borders
           legend.pos = "right", # position of the legend
           legend.title.txt = "",
           legend.values.rnd = 2, # number of decimal in the legend values
           add = TRUE) # add the layer to the current plot
 
labelLayer(spdf = map_nuts2a, # SpatialPolygonsDataFrame used to plot he labels
           df = hun_nuts2_df, # data frame containing the lables
           txt = "nuts2_id", # label field in df
           col = "black", # color of the labels
           cex = 0.9, # size of the labels
           font = 2)  # label font

Please write it down if You have an other source of NUTS2 level geodata which is also compatible with R especially if there is a (interactive) JavaScript based solution.