HAIBA and me

Introduction

So. There’s this thing called HAIBA. It is the danish Hospital-Acquired Infections database.

The database is built on the proud danish tradition of databasing everything, it works by crossreferencing the national patient registry (LPR) and the national microbiological database (MiBa). LPR provides times of admission to hospital and transfers between departments. MiBa provides samples taken and sample results from clinical microbiology labs.

The database is great, and it is not so great.

Monitoring hospital-acquired infections is obviously a great goal and the crossreferencing of two massive databases is a gargantuan task. But as a report from the Danish National audit (rigsrevisionen) details, almost no departments actually use the database to improve.

I wanted to analyse differences over time and between departments in the database.

From the database i downloaded 93 datasets - corresponding to one from every department in the region of central Denmark, it took a while. If you want to follow along you can download a csv version of the dataset i made.

library(tidyverse)
library(furrr)
plan(multiprocess)
library(brms)
library(here)
library(ebbr) #devtools::install_github("dgrtwo/ebbr")

Loading

files <- list.files(here("content", "post", "haiba", "data_uvi"), recursive = TRUE, full.names = TRUE, pattern = ".csv")

read_haiba <- function(file) {
    blank_lines <- which(grepl("^$", readLines(file)))
    meta <- read_csv(file,
                     skip = blank_lines[length(blank_lines) - 1] - 1) %>%
        mutate_all( ~ str_split(.x, ": ") %>% map_chr(2))
    colnames(meta) <- c("sidste_opdatering",
                        "rapport_fra",
                        "region",
                        "ejer",
                        "hospital", 
                        "afdeling")
    raw <- read_csv(file, 
                    skip = 3, 
                    n_max = blank_lines[2] - blank_lines[1] - 2, 
                    locale = locale(decimal_mark = ","))
    clean <- raw %>% 
        select(Year:Denominator2) %>% 
        janitor::clean_names() %>% 
        rename(week = month) %>% 
        mutate(yearweek = sprintf("%i-W%02i", year, week),
               sort_date = paste0(yearweek, "-5") %>% ISOweek::ISOweek2date(),
               date_rank = rank(sort_date)) %>% 
        filter(sort_date < lubridate::dmy(meta$sidste_opdatering)[1]) 
    cbind(clean, select(meta, hospital, afdeling))
}    

afd_type <- read_csv("typer.csv")
haiba <- future_map_dfr(files, read_haiba) %>% 
    left_join(afd_type) %>% 
    mutate_at(vars(contains("nominator")), as.integer)

Okay, so that’s loading done. Let’s do some basic checks. It appears that the web-version of HAIBA reports use the columns nominator1 and denominator1.

Exploration

head(haiba)
##   year week nominator1 incidence1 denominator1 nominator4 incidence3
## 1 2019    4          1     317.04           32          1     317.04
## 2 2019    3          0       0.00           25          0       0.00
## 3 2019    2          0       0.00           15          0       0.00
## 4 2019    1          0       0.00           12          0       0.00
## 5 2018   52          0       0.00           15          0       0.00
## 6 2018   51          0       0.00           13          0       0.00
##   denominator2 yearweek  sort_date date_rank                    hospital
## 1           32 2019-W04 2019-01-25       265 Aarhus Universitetshospital
## 2           25 2019-W03 2019-01-18       264 Aarhus Universitetshospital
## 3           15 2019-W02 2019-01-11       263 Aarhus Universitetshospital
## 4           12 2019-W01 2019-01-04       262 Aarhus Universitetshospital
## 5           15 2018-W52 2018-12-28       261 Aarhus Universitetshospital
## 6           13 2018-W51 2018-12-21       260 Aarhus Universitetshospital
##                  afdeling afd_type
## 1 Akutafdeling Akutafsnit     akut
## 2 Akutafdeling Akutafsnit     akut
## 3 Akutafdeling Akutafsnit     akut
## 4 Akutafdeling Akutafsnit     akut
## 5 Akutafdeling Akutafsnit     akut
## 6 Akutafdeling Akutafsnit     akut
mean(haiba$denominator1 < 1) %>% scales::percent()
## [1] "15.7%"
with(haiba, mean(nominator1 > denominator1)) %>% scales::percent()
## [1] "0.00401%"

16% of observations are 0 and some observations seem to show more cases than persondays, which is weird to say the least.

Let’s do a little visualisation

quarters <- data.frame(week = 1:53, quarter = c(rep(1:4, each = 13), 4))

haiba %>% 
    left_join(quarters) %>% 
    mutate(yearq = paste(year, quarter, sep = "-")) %>% 
    group_by(afd_type, yearq) %>% 
    summarise(nominator = sum(nominator1),
              denominator = sum(denominator1)) %>% 
    ungroup %>% 
    filter(denominator > 0) %>% 
    mutate(date_rank = dense_rank(yearq)) %>% #Dangerous, but seems to work
    group_by(afd_type) %>% 
    group_modify(~ add_ebb_estimate(.x, nominator, denominator)) -> afd_type_quarter
## Joining, by = "week"
ggplot(afd_type_quarter, aes(date_rank, .fitted*1e4, colour = afd_type)) + 
    geom_smooth(method = "lm", se = FALSE) +
    geom_point() + 
    facet_wrap(~afd_type, scales = "free_y") + 
    expand_limits(y = 0) +
    scale_colour_viridis_d() +
    theme(legend.position = "none")

Let’s gussy that up a bit.

ggplot(afd_type_quarter, aes(date_rank, .raw*1e4, colour = afd_type)) + 
    geom_ribbon(aes(ymin = .low*1e4, ymax = .high*1e4), fill = "gray", colour = "gray") + 
    geom_smooth(method = "lm", se = FALSE) +
    geom_point() + 
    facet_wrap(~afd_type, scales = "free_y", labeller = labeller(afd_type = Hmisc::capitalize), nrow = 2) + 
    expand_limits(y = 0) +
    scale_colour_viridis_d() +
    scale_x_continuous(breaks = c(seq(0, 20, by = 4))) +
    labs(x = "kvartal fra start", 
         y = "incidens pr 10^4 patientdøgn", 
         caption = "Data fra HAIBA\n95% Konfidensintervaller baseret på empirical bayes\nLineær regression vist uden konfidensintervaller",
         title = "Hospitalserhvervede urinvejsinfektioner i region midt") +
    theme(legend.position = "none")

There are some interesting things going on in this visualisation. Most department types seem to be have decreasing UTI incidence. But, what is going on with pediatric and acute departments?

Some insight may be gleaned from the diference between the linear regression on incidence and the grayed out ribbons. Remember, the ribbons display empirical bayes estimates, these are calculated WITHOUT knowledge of the time effects.

For acute departments there appears to be a somewhat seasonal effect, perhaps patients dont stay on the service for long enough to become cases in summer.

Pediatric departments seem more erratic, and it would seem fair to assume that what we’re seeing is merely random variation.

Modeling

If i want an actual answer to whether incidences are decreasing, there’s probably no way around some modeling work.

Seeing as we’re working on count data with large observation times, it would seem reasonable to model the outcome as poisson distributed.

I want to be able to compare individual departments on their initial incidence, their progress over time, and their incidence at the end. I could feasibly fit a normal glm model with offset = log(denominator1) and fixed effects for departments, but some departments are much smaller than others and their mere size would bias the analysis so that large departments would look closer to average, just from having less random variation.

So, what i actually want is a bayesian model with partial pooling on departments for both intercept and slope, fixed effects for department type. Let’s see if i can figure that out

halfyear <- data.frame(week = 1:53, halfyear = c(rep(1:2, each = 26), 2))

null_departments <- haiba %>% # Some departments have 0 cases in total. Thats great, but not very informative.
    group_by(afdeling) %>% 
    summarise(nominator = sum(nominator1),
              denominator = sum(denominator1)) %>% 
    filter(nominator == 0) 

haiba %>% 
    anti_join(null_departments, by = "afdeling") %>% # Potentielt kontroversielt, men i praksis nok mere data-cleaning
    left_join(halfyear) %>% 
    mutate(yearh = paste(year, halfyear, sep = "-")) %>% 
    group_by(afdeling, yearh) %>% 
    summarise(nominator = sum(nominator1),
              denominator = sum(denominator1), 
              afd_type = unique(afd_type)) %>% 
    ungroup %>% 
    filter(denominator > 0) %>% 
    mutate(date_rank = dense_rank(yearh)) -> afdeling_halfyear
## Joining, by = "week"
mm_data_afd <- afdeling_halfyear %>% 
    mutate_at(vars(contains("nominator")), as.integer) %>% 
    mutate_at(vars(contains("afd")), as.factor)

brms model

n_cores <- parallel::detectCores()
model <- brm(nominator ~ 0 + offset(log(denominator)) + date_rank + afd_type + (1 + date_rank | afdeling),
          data = mm_data_afd,
          family = "poisson",
          chains = n_cores, cores = n_cores, file = here("content", "post", "haiba", "model")) 

Model checks

Did the chains converge?

plot(model)

Seems reasonable enough.

model
##  Family: poisson 
##   Links: mu = log 
## Formula: nominator ~ 0 + offset(log(denominator)) + date_rank + afd_type + (1 + date_rank | afdeling) 
##    Data: mm_data_afd (Number of observations: 822) 
## Samples: 6 chains, each with iter = 2000; warmup = 1000; thin = 1;
##          total post-warmup samples = 6000
## 
## Group-Level Effects: 
## ~afdeling (Number of levels: 80) 
##                          Estimate Est.Error l-95% CI u-95% CI Rhat
## sd(Intercept)                0.71      0.07     0.58     0.87 1.01
## sd(date_rank)                0.03      0.01     0.02     0.04 1.00
## cor(Intercept,date_rank)    -0.41      0.19    -0.73     0.02 1.00
##                          Bulk_ESS Tail_ESS
## sd(Intercept)                1420     2875
## sd(date_rank)                2149     3391
## cor(Intercept,date_rank)     2630     4044
## 
## Population-Level Effects: 
##                  Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS
## date_rank           -0.02      0.01    -0.03    -0.01 1.00     2836
## afd_typeakut        -5.54      0.33    -6.20    -4.88 1.00     3854
## afd_typeandet       -5.88      0.24    -6.35    -5.41 1.00     1969
## afd_typebørn        -6.88      0.38    -7.62    -6.15 1.00     3958
## afd_typeintensiv    -3.88      0.69    -5.22    -2.49 1.00     5173
## afd_typekirurgi     -5.86      0.12    -6.11    -5.63 1.00     1017
## afd_typekræft       -5.92      0.40    -6.70    -5.16 1.00     2634
## afd_typemedicin     -5.30      0.15    -5.59    -5.01 1.00      804
##                  Tail_ESS
## date_rank            3828
## afd_typeakut         4202
## afd_typeandet        2427
## afd_typebørn         4450
## afd_typeintensiv     4229
## afd_typekirurgi      1925
## afd_typekræft        3600
## afd_typemedicin      1294
## 
## Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample 
## is a crude measure of effective sample size, and Rhat is the potential 
## scale reduction factor on split chains (at convergence, Rhat = 1).

No dire warnings here either. Effective sample sizes could be higher.

Hypothesis 1: Infection control is improving

Lets test the hypothesis that number of cases is falling over time

hypothesis(model, "date_rank < 0")
## Hypothesis Tests for class b:
##        Hypothesis Estimate Est.Error CI.Lower CI.Upper Evid.Ratio
## 1 (date_rank) < 0    -0.02      0.01    -0.03    -0.01        299
##   Post.Prob Star
## 1         1    *
## ---
## 'CI': 90%-CI for one-sided and 95%-CI for two-sided hypotheses.
## '*': For one-sided hypotheses, the posterior probability exceeds 95%;
## for two-sided hypotheses, the value tested against lies outside the 95%-CI.
## Posterior probabilities of point hypotheses assume equal prior probabilities.
marginal_effects(model, "date_rank", spaghetti = TRUE, nsamples = 1e3, conditions = data.frame(denominator = 1e4))

It would appear beyond reasonable doubt that there is indeed a falling incidence of HA-UTI. What is the magnitude of this fall?

broom::tidy(model) %>% 
  filter(term == "b_date_rank") %>% 
  select(-std.error) %>% 
  mutate_at(vars(estimate:upper), ~ 1 - exp(. * (mm_data_afd$date_rank %>% range %>% diff))) %>% 
  mutate_at(vars(estimate:upper), scales::percent) %>% 
  glue::glue_data("The overall reduction from the beginning of HAIBA (2014 week 1) to the latest point in our dataset (2019 week 4) is {estimate} [95% CI: {upper}; {lower}]")
## The overall reduction from the beginning of HAIBA (2014 week 1) to the latest point in our dataset (2019 week 4) is 15.9% [95% CI: 7.25%; 23.6%]

Are any departments falling way out of line with the overall trend?

hypothesis(model, "date_rank = 0", group = "afdeling", scope = "ranef") %>% 
  .$hypothesis %>% 
  filter(Star == "*") -> differently_changed

mm_data_afd %>% 
    semi_join(differently_changed, by = c("afdeling" = "Group")) %>% 
    ggplot(aes(date_rank, nominator/(denominator/1e4), colour = afdeling)) +
    geom_point() +
    scale_colour_viridis_d(option = "A") +
    geom_smooth(method = "lm", se = FALSE) + 
    expand_limits(y = 0)
## Warning: Column `afdeling`/`Group` joining factor and character vector,
## coercing into character vector

It would appear that both departments have made massive improvements.

Hypothesis 2: Departments are (very) different

marginal_effects(model, "afd_type", conditions = data.frame(denominator = 1e4))

last_plot() + coord_flip()

Okay, so intensive care units have more cases. That’s probably not a big surprise to anyone. Many of the “cases” are probably misclassifications due to more frequent urinary sampling and the increased difficulty in ascertaining symptoms from patients in intensive care.

hypothesis(model, "Intercept = 0", group = "afdeling", scope = "ranef") %>%
    .$hypothesis %>% 
    filter(Star == "*") -> differently_started

differently_started %>% 
    mutate(Group = forcats::fct_reorder(Group, Estimate)) %>% 
    ggplot(aes(Group, exp(Estimate), colour = Estimate)) + 
    geom_ribbon(aes(ymin = exp(CI.Lower), ymax = exp(CI.Upper))) +
    geom_point() + 
    scale_colour_viridis_c(option = "C") + 
    theme(legend.position = "none") + 
    coord_flip() +
    labs(y = "Rate ratio ifht afdelingstype", x = NULL,
         caption = "Data fra HAIBA\nVist med 95% konfidensintervaller") 

A few surprises here, but overall as expected. Some of the departments seem to have been picked up due to insufficient correction in the “Other” department type.

That’s the intercept done with. But where do these departments end up? Do they come into line with the pack?

new_from_differently_started <- mm_data_afd %>% 
  semi_join(differently_started, by = c("afdeling" = "Group")) %>% 
  select(afdeling, afd_type) %>% 
  distinct() %>% 
  mutate(date_rank = 12,
         denominator = 1e4)
## Warning: Column `afdeling`/`Group` joining factor and character vector,
## coercing into character vector
quick_clean <- . %>% 
  as_tibble() %>% 
  janitor::clean_names()

predictions <- predict(model, newdata = new_from_differently_started) %>% 
  quick_clean %>% 
  bind_cols(new_from_differently_started) %>% 
  mutate(condition = "At end")

unknown_departments <- mm_data_afd %>% 
  distinct(afd_type) %>% 
  mutate(afdeling = wakefield::name(n()),
         date_rank = 12,
         denominator = 1e4)

predicted_beginnings <- predict(model, newdata = new_from_differently_started %>% mutate(date_rank = 1)) %>% 
  quick_clean() %>% 
  bind_cols(new_from_differently_started %>% select(afdeling)) %>% 
  mutate(condition = "At beginning")

predictions_for_unknown <- predict(model, newdata = unknown_departments, allow_new_levels = TRUE) %>% 
  as_tibble() %>% 
  janitor::clean_names() %>% 
  bind_cols(unknown_departments) %>% 
  mutate(condition = "Unknown department")

different_started_by_the_end <- predictions %>% 
  left_join(predictions_for_unknown, by = "afd_type") %>% 
  left_join(predicted_beginnings, by = c("afdeling.x" = "afdeling"), suffix = c("", ".z"))

bind_rows(predicted_beginnings, predictions) %>% 
  left_join(predictions_for_unknown, by = "afd_type", suffix = c("", ".y")) %>% 
  mutate(afdeling = forcats::fct_reorder(afdeling, estimate, .fun = first)) %>% 
  ggplot(aes(afdeling, estimate, colour = condition)) +
  geom_ribbon(aes(afdeling, ymin = q2_5, ymax = q97_5), data = ~ filter(.x, condition == "At end"), colour = "gray") +
  geom_point() +
  geom_point(aes(afdeling, estimate.y, colour = "Department type"), show.legend = TRUE) + 
  coord_flip() + 
  scale_colour_brewer(palette = "Set2", direction = -1) + 
  theme(legend.position = "bottom") + 
  labs(x = "Department", 
       y = "Cases pr. 10000 persondays", 
       colour = "Prediction case", 
       caption = "Data from HAIBA")
## Warning: Removed 34 rows containing missing values (geom_point).

Honorable mention to the departments without any cases

null_departments %>% arrange(desc(denominator))
## # A tibble: 13 x 3
##    afdeling                                       nominator denominator
##    <chr>                                              <int>       <int>
##  1 Livsstilscenter Brædstrup, Sengeafdeling - BRÆ         0       31597
##  2 Fokuseret Neurorehabilitering Hammel                   0       13759
##  3 Tand-, Mund- og Kæbekirurgi Sengeafdeling              0        2493
##  4 Hånd Stamafdeling Ortopædkirurgi                       0        2475
##  5 Skulder Stamafdeling Ortopædkirurgi                    0        1677
##  6 Idræt Stamafdeling Ortopædkirurgi                      0        1220
##  7 Brystkirurgi Stationær                                 0         678
##  8 Øre-, Næse- og Halssygdomme - Randers                  0         290
##  9 Infektion Stamafdeling Ortopædkirurgi                  0          28
## 10 Akutmodtagelse Q - Randers                             0          21
## 11 Øjenafdeling Dagkirurgi Stationær J                    0          16
## 12 Neurokirurgisk Dagkirurgi Stationær NK                 0          15
## 13 Kirurgisk Dagkirurgi P Stationær Afdeling              0           6
Avatar
Marc Trunjer Kusk Nielsen

I’m a medical doctor in specialty training to become a Clinical Microbiologist. I’ll post on the intersection of data science and clinical microbiology.