Forecasting in R: a People Analytics Tool

You are here:

The last 6 months have, more than ever, emphasized the importance of knowing what is coming. In this article, we take a closer look at forecasting. Forecasting can be applied to a range of HR-related topics. We will specifically examine how forecasting models can be deployed in R and end with an example analysis on the rise in popularity of the term “people analytics”.

The goal is to know what’s coming…

Predictions come in different shapes and sizes. There are many Supervised Machine Learning algorithms that can generate predictions of outcomes, such as flight risk, safety incidents, performance and engagement outcomes, and personnel selection. These examples represent the highly popular realm of “Predictive Analytics”. 

However, a less mainstream topic in the realm of prediction is that of “Forecasting” – often referred to as Time Series Analysis. In a nutshell, Forecasting takes values over time (e.g., closing price of a stock over 120 days) to forecast the likely value in the future. 

The main difference between predictive analytics and forecasting is best characterized by the data used. Generally, forecasting relies upon historical data, and the patterns identified therein, to predict future values. 

An HR-related example would be using historical rates of attrition in a business or geography to forecast future rates of attrition. In contrast, predictive analytics uses a variety of additional variables, such as company performance metrics, economic indicators, employment data, and so on, to predict future rates of turnover. Depending upon the use case, there is a time and a place for both approaches. 

In the current article, we focus on forecasting and highlight a new library in the R ecosystem called ModelTime. ModelTime enables the application of multiple forecasting models quickly and easily while employing a tidy framework (for those not familiar with R don’t worry about this). 

Related (free) resource ahead! Continue reading below ↓

People Analytics Resource Library

Download our list of key HR Analytics resources (90+) that will help you improve your expertise and initiatives. Your one-stop-shop for People Analytics!

To illustrate the ease of using ModelTime we forecast the future level of interest in the domain of People Analytics using Google Trends data (code included). From there we will discuss potential applications of forecasting supply and demand in the context of HR.

Data Collection

The time-series data we will use for our example comes directly from Google Trends. Google Trends is an online tool that enables users to discover trends in search behavior within Google Search, Google News, Google Images, Google Shopping, and YouTube.

To do so, users are required to specify the following:

  1. A search term (up to four additional comparison search terms are optional), 
  2. A geography (i.e., where the Google Searches were performed),
  3. A time period, and
  4. Google source for searches (e.g., Web Search, Image Search, News Search, Google Shopping, or YouTube Search).

It is important to note that the search data returned does NOT represent the actual search volume in numbers, but rather a normalized index ranging from 0-100. The values returned represent the search interest relative to the highest search interest during the time period selected. A value of 100 is the peak popularity for the term. A value of 50 means that the term is half as popular at that point in time. A score of 0 means there was not enough data for this term.

# Libraries
library(gtrendsR)
library(tidymodels)
library(modeltime)
library(tidyverse)
library(timetk)
library(lubridate)
library(flextable)


# Google Trends Parameters
search_term   <- "people analytics"
location      <- "" # global
time          <- "2010-01-01 2020-08-01" # uses date format "Y-m-d Y-m-d"
gprop         <- "web"

# Google Trends Data Request
gtrends_result_list <- gtrendsR::gtrends(
    keyword = search_term,
    geo     = location,
    time    = time,
    gprop   = gprop


)

# Data Cleaning
gtrends_search_tbl <- gtrends_result_list %>%
    pluck("interest_over_time") %>%
    as_tibble() %>%
    select(date, hits) %>%
    mutate(date = ymd(date)) %>%
    rename(value = hits)

# Visualization of Google Trends Data
gtrends_search_tbl %>%
    timetk::plot_time_series(date, value)

Time series plot

We can see from the visualization (go here for the interactive version of the graph) that the term “people analytics” has trended upwards in Google web searches from January 2010 through to August 2020. The blue trend line, established using a LOESS smoother (i.e., a non-parametric technique that tries to find a curve of best fit without assuming the data adheres to a specific distribution) illustrates a continual rise in interest. The raw data also indicates that the Google search term of “people analytics”, perhaps unsurprisingly, peaked in June of 2020. 

HR Metrics & Reporting Certificate Become an HR
Reporting Specialist
Master data-driven HR and learn to create powerful HR dashboards
in just 10 weeks with 4 hours of studying a week.
Download Syllabus

This peak may relate to the impact of COVID-19, specifically the requirement for organizations to deliver targeted ad-hoc reporting on personnel topics during this time. Irrespective, the future for People Analytics seems to be of increasing importance.

Modeling

Let’s move into some Forecasting! The process employed using ModelTime is as follows:

  1. We separate our dataset into “Training” and “Test” datasets. The Training data represents that data from January 2010 to January 2019, while the Test data represents the last 18 months of data (i.e., February 2019 – August 2020). A visual representation of this split is presented in the image you see below number 4.
  2. The Training data is used to generate an 18-month forecast using several different models. In this article, we have chosen the following models: Exponential Smoothing, ARIMA, ARIMA Boost, Prophet, and Prophet Boost.
  3. The forecasts generated are then compared to the Test data (i.e., actual data) to determine the accuracy of the different models.
  4. Based on the accuracy of the different models, one or more models are then applied to the entire dataset (i.e., Jan 2010 – August 2020) to provide a forecast into 2021. 
Training test data

We have presented the R code below for steps 1 through to 3.

# Train/Test
k <- 18

no_of_months <-

lubridate::interval(base::min(gtrends_search_tbl$date),
                            base::max(gtrends_search_tbl$date)) %/%
base::months(1)

prop <- (no_of_months - k) / no_of_months

# remove the last 18 months (i.e. k) of data from the training set so that we can determine the model accuracy

splits <- rsample::initial_time_split(gtrends_search_tbl, prop = prop)

# visualize the training data (i.e., black line) and test data (i.e., red line)
splits %>%
    tk_time_series_cv_plan() %>%
    plot_time_series_cv_plan(date, value) 

Time series cross validation plan
Go here for the interactive version of this graph.

# Modeling

# Exponential Smoothing
model_fit_ets <- modeltime::exp_smoothing() %>%
    parsnip::set_engine(engine = "ets") %>%
    parsnip::fit(value ~ date, data = training(splits))

## frequency = 12 observations per 1 year

# ARIMA
model_fit_arima <- modeltime::arima_reg() %>%
    parsnip::set_engine("auto_arima") %>%
    parsnip::fit(
        value ~ date,
        data = training(splits))

## frequency = 12 observations per 1 year

# ARIMA Boost
model_fit_arima_boost <- modeltime::arima_boost() %>%
    parsnip::set_engine("auto_arima_xgboost") %>%
    parsnip::fit(
        value ~ date + as.numeric(date) + month(date, label = TRUE),
        data = training(splits))

## frequency = 12 observations per 1 year

# Prophet
model_fit_prophet <- modeltime::prophet_reg() %>%
    parsnip::set_engine("prophet") %>%
    parsnip::fit(
        value ~ date,
        data = training(splits))

# Prophet Boost
model_fit_prophet_boost <- modeltime::prophet_boost() %>%
    parsnip::set_engine("prophet_xgboost") %>%
parsnip::fit(value ~ date + as.numeric(date) + month(date, label = TRUE),
        data = training(splits))

# Modeltime Table
model_tbl <- modeltime_table(
    model_fit_ets,
    model_fit_arima,
    model_fit_arima_boost,
    model_fit_prophet,
    model_fit_prophet_boost)

# Calibrate the model accuracy using the hold out data
calibration_tbl <- model_tbl %>%
modeltime_calibrate(testing(splits))

calibration_tbl %>%
modeltime_accuracy() %>%
flextable() %>%
bold(part = "header") %>%
bg(bg = "#D3D3D3", part = "header") %>%
autofit()

The table below illustrates the metrics derived when evaluating the accuracy of the respective models using the Test set. While it is beyond the scope of this article to explain the models and their accuracy metrics, a simple rule of thumb when looking at the below table is that smaller numbers indicate a better model! 

HR 2025
Competency Assessment

Do you have the competencies needed to remain relevant? Take the 5 minute assessment to find out!

Start Free Assessment

Our models indicate a moderate degree of accuracy. If we look simply at the “mape” (Mean Absolute Percentage Error) statistic, we can see that the best model (3 – ARIMA with XGBoost Errors) shows about 10% difference from the actual data, while the remainder varies from 11% – 13% error. 

Model_id

The graph below illustrates how the models performed relative to the actual data (i.e., our Test set). Go here for the interactive version of this graph.

Forecast plot

Based on these metrics we decided to employ all five models, to forecast into 2021. We will then take an average of the five forecasts for each month to create an aggregate model. We can see below in both the code and the interactive visualization, that the ongoing trend for people analytics is one of increasing popularity over time.

# Refit the five models with all data to forecast 12 months (i.e. Sep 2020 – Sep 2021)
refit_tbl <- calibration_tbl %>%
    modeltime_refit(data = gtrends_search_tbl) 

forecast_tbl <- refit_tbl %>%
    modeltime_forecast(
        h = "1 year",
        actual_data = gtrends_search_tbl,
        conf_interval = 0.90) 

# Plot the five forecasts into 2021
forecast_tbl %>%
    plot_modeltime_forecast(.interactive = TRUE)

Forecast plot II
Go here for the interactive version of this graph.

# Model Average
mean_forecast_tbl <- forecast_tbl %>%
    filter(.key != "actual") %>%
    group_by(.key, .index) %>%
    summarise(across(.value:.conf_hi, mean)) %>%
    mutate(
        .model_id   = 6,
        .model_desc = "AVERAGE OF MODELS"
)
# * Visualize Model Average ----
forecast_tbl %>%
    filter(.key == "actual") %>%
    bind_rows(mean_forecast_tbl) %>%
    plot_modeltime_forecast()

Forecasting in R plot III

The forecast of Google Search interest for the next 12 months appears to continue its trend of upward growth – the future for People Analytics seems bright! For the interactive version of this graph, go here.

Implications of Forecasting in HR

The above example illustrates the ease with which analysts can perform forecasting in R with time-series data to be better prepared for the future. In addition, the use of automated models (i.e., those that self-optimize) can be an excellent entry point for forecasting. Technologies such as ModelTime in R enable users to apply numerous sophisticated forecasting models to perform scenario planning within organizations. 

Scenario planning need not be something that is performed once and later shelved to collected dust, despite varying environmental conditions. In the realm of HR, forecasting can play a crucial, and yet often-underutilized part of strategic activities such as the following:

  1. Workforce Planning 
    • How many employees will the organization need to replace in the future?
    • Will the local job market or universities “produce” sufficient employees to cover an organization’s forecasted graduate/ professional employee needs?
    • When opening new facilities in new markets, is the local population sufficient to support our employee requirements? 
  2. Talent Acquisition
    • How many employees are we likely to recruit in the next 2 – 4 quarters to meet business goals?
    • How many talent acquisition staff will be required in specific geographies to meet seasonal recruiting requirements (which do vary by geography!)?
  3. Outsourcing
    • Based on the historical outsourcing activities, what is the current trend and what are the financial implications associated with that requirement?
    • Will the outsourcing provider be able to cater to future demand requirements?
    • Based on turnover among outsourced roles, what are the future implications for onboarding and training needs in specific businesses/geographies?
  4. Talent management
    • Are there job profile areas where we are likely to experience talent shortfalls in the near to medium-term future?
    • What proportion of the employee population is likely to retire over the coming 2 – 5 years?
  5. Learning and development
    • Of the identified skills of the future, what are the future requirements?
  6. Financial budgeting
    • What are the future budget requirements for HR activities?
    • What is the future financial requirement associated with establishing a people analytics team? 😉

Happy Forecasting!

Acknowledgement: The authors would like to acknowledge the work of Matt Dancho, both in the development of TimeTK and ModelTime, and the Learning Labs Pro Series ran by Matt, upon which this article is based.

Are you ready for the future of HR?

Learn modern and relevant HR skills, online

Browse courses Enroll now