Here is a fun and easy way to forecast a time series. The magic trick being the addition of features through auto-calculation.

The data we-ll use can be fetched with the tidyquant package and you will need to have h2o installed as well. All of this is just a click away in RStudio.

install.packages("timetk")
install.packages("tidyquant")
library(h2o)         
library(timetk)    
library(tidyquant)

The data looks like below, you can pick up any other data set by browsing around and fetching the parameter in the URL.

df <- tq_get("UNRATE", get = "economic.data", from = "1950-01-01", to = "2018-06-01")

We'll use the data up to 2011 as training data, use five years for validation and predict the year 2018:

df_aug <- df %>%  tk_augment_timeseries_signature()
df_clean = df_aug %>% select_if(~ !is.Date(.)) %>%
  select_if(~ !any(is.na(.))) %>%
  mutate_if(is.ordered, ~ as.character(.) %>% as.factor)

train_tbl <- df_clean %>% filter(year < 2011)
valid_tbl <- df_clean %>% filter(year >= 2011 & year<=2017)
test_tbl  <- df_clean %>% filter(year >2017)


train_h2o <- as.h2o(train_tbl)
valid_h2o <- as.h2o(valid_tbl)
test_h2o  <- as.h2o(test_tbl)

The magic happens in the augmenting method. Rather than a simple time series you get a table full of features the auto-ml of H2O can use for ML. From here on you can use the full breadth of H2O:

y <- "price"
x <- setdiff(names(train_h2o), y)

automl_models_h2o <- h2o.automl(
  x = x, 
  y = y, 
  training_frame = train_h2o, 
  validation_frame = valid_h2o, 
  leaderboard_frame = test_h2o, 
  max_runtime_secs = 120, 
  stopping_metric = "deviance")


automl_leader <- automl_models_h2o@leader
pred_h2o <- h2o.predict(automl_leader, newdata = test_h2o)
h2o.performance(automl_leader, newdata = test_h2o)


error_tbl <- df %>% 
  filter(lubridate::year(date) > 2017) %>%
  add_column(pred = pred_h2o %>% as.tibble() %>% pull(predict)) %>%
  rename(actual = price) %>%
  mutate(
    error     = actual - pred,
    error_pct = error / actual
  ) 

plot(error_tbl$actual, col="green", t="l")
par(new=T)
plot(error_tbl$pred, col='orange', t="l")
par(new=F)
error_tbl

The tibble you get looks like this:

date       actual  pred     error  error_pct
  <date>      <dbl> <dbl>     <dbl>      <dbl>
1 2018-01-01    4.1  4.85 -0.754    -0.184    
2 2018-02-01    4.1  3.79  0.312     0.0762   
3 2018-03-01    4.1  4.10 -0.000150 -0.0000366
4 2018-04-01    3.9  4.59 -0.690    -0.177    
5 2018-05-01    3.8  3.97 -0.175    -0.0460   
6 2018-06-01    4    3.80  0.196     0.0490

and isn’t bad at all considering the non-orthodox approach to forecasting.

 

Tags: