We’re building a solution that uses some R scripts for data analysis and cleanup. The R scripts are tested during the integration phase when the database is available. We would like to test the scripts when new versions are pushed to source control without the need for a database. This is where unittests come in.
The scripts all follow the same steps:
- setup,
- read data from database,
- process data,
- write data to database,
- report
First we need to split logic, flow and parameters. The easiest way to do this is to implement functions with the names of the steps listed above and call the functions from a new script. Some code will speak a thousand words.
# script with logic and flow functions
library(iterators)
library(foreach)
library(data.table)
setup <- function() { }
read_data <- function(db, param) { }
process_data <- function(data) { data.table(mean_price = mean(data$price)) }
write_data <- function(db, data) { }
report <- function(db, param) { }
do_work <- function(db, param) {
setup()
unprocessed_data <- read_data(db, param)
iterator <- iter(unprocessed_data$data)
foreach (row = iterator) %do% {
processed_row <- process_data(row)
write_data(db, processed_row)
}
report(db, param)
}
# script with parameters
source('script_with_functions.R')
connection <- odbcDriverConnect(connectionstring = '...')
year <- 2016
do_work(connection, year)
odbcClose(connection)
During execution of the script with parameters the functions are loaded and the outcome will be the same as the initial script. This code refactoring enabled us to write unittests for the functions where mock objects can be used to mimic the external dependencies. Again some code to explain what we’re talking about
library(testthat)
library(mockery)
library(data.table)
source('script_with_functions.R')
describe('process_data', {
it('calculates_mean_price', {
# data table with 4 rows with price 10
four_rows_price_10 <- data.table(price = rep(10, 4))
result <- process_data(four_rows_price_10)
expect_equal(result$mean_price, 10)
})
})
# other functions only read/write data: no unittest needed, since no logic
describe('do_work', {
it('calls setup', {
# create mock object for setup function
fake_setup = mock()
# replace setup function with mock for calls to do_work
stub(do_work, 'setup', fake_setup)
# call the 'flow' function
fake_db = mock()
do_work(fake_db, 2016)
# verify setup was called
expect_called(fake_setup, 1)
})
it('calls process_data 4 times', {
# create mock object for setup function
fake_process_data = mock()
# replace process_data function with mock for calls to do_work
stub(do_work, 'process_data', fake_process_data)
# return 4 sets of data to process (with 3 rows each)
four_sets_of_data <- tibble(g=1:4, data=list(data.table(price=1:3)))
stub(do_work, 'read_data', four_rows_with_data)
# call the 'flow' function
fake_db = mock()
do_work(fake_db, 2016)
# verify process_data was called 4 times
expect_called(fake_process_data, 4)
})
})
By splitting the logic and flow into functions we’re able to write unittests that check the logic in the process_data function and the flow in the do_work function. The functions not unittested all need a database to work – we could use sqlite inmemory – but that is a database too – so we leave those tests for the integration tests.
Running the unittests above will result in rainbows and a smiley. Complete working code in my github.
Test passed 🌈 Test passed 🌈 Test passed 😀