HAIBA and me

Introduction

So. There’s this thing called HAIBA. It is the danish Hospital-Acquired Infections database.

The database is built on the proud danish tradition of databasing everything, it works by crossreferencing the national patient registry (LPR) and the national microbiological database (MiBa). LPR provides times of admission to hospital and transfers between departments. MiBa provides samples taken and sample results from clinical microbiology labs.

The database is great, and it is not so great.

Monitoring hospital-acquired infections is obviously a great goal and the crossreferencing of two massive databases is a gargantuan task. But as a report from the Danish National audit (rigsrevisionen) details, almost no departments actually use the database to improve.

I wanted to analyse differences over time and between departments in the database.

From the database i downloaded 93 datasets - corresponding to one from every department in the region of central Denmark, it took a while. If you want to follow along you can download a csv version of the dataset i made.

library(tidyverse)
library(furrr)
plan(multisession)
library(brms)
library(here)
library(ebbr) #devtools::install_github("dgrtwo/ebbr")

Loading

files <- list.files(here("content", "post", "haiba", "2019-10-21-haiba-and-me", "data_uvi"), recursive = TRUE, full.names = TRUE, pattern = ".csv")

read_haiba <- function(file) {
    blank_lines <- which(grepl("^$", readLines(file)))
    meta <- read_csv(file,
                     skip = blank_lines[length(blank_lines) - 1] - 1,
                     show_col_types = FALSE) %>%
        mutate_all( ~ str_split(.x, ": ") %>% map_chr(2))
    colnames(meta) <- c("sidste_opdatering",
                        "rapport_fra",
                        "region",
                        "ejer",
                        "hospital", 
                        "afdeling")
    raw <- read_csv(file, 
                    skip = 3, 
                    n_max = blank_lines[2] - blank_lines[1] - 2, 
                    locale = locale(decimal_mark = ","),
                    show_col_types = FALSE)
    clean <- raw %>% 
        select(Year:Denominator2) %>% 
        janitor::clean_names() %>% 
        rename(week = month) %>% 
        mutate(yearweek = sprintf("%i-W%02i", year, week),
               sort_date = paste0(yearweek, "-5") %>% ISOweek::ISOweek2date(),
               date_rank = rank(sort_date)) %>% 
        filter(sort_date < lubridate::dmy(meta$sidste_opdatering)[1]) 
    cbind(clean, select(meta, hospital, afdeling))
}    

afd_type <- read_csv("typer.csv", show_col_types = FALSE)
haiba <- future_map_dfr(files, read_haiba) %>% 
    left_join(afd_type) %>% 
    mutate_at(vars(contains("nominator")), as.integer)

Okay, so that’s loading done. Let’s do some basic checks. It appears that the web-version of HAIBA reports use the columns nominator1 and denominator1.

Exploration

head(haiba)
##   year week nominator1 incidence1 denominator1 nominator4 incidence3
## 1 2019    4          1     317.04           32          1     317.04
## 2 2019    3          0       0.00           25          0       0.00
## 3 2019    2          0       0.00           15          0       0.00
## 4 2019    1          0       0.00           12          0       0.00
## 5 2018   52          0       0.00           15          0       0.00
## 6 2018   51          0       0.00           13          0       0.00
##   denominator2 yearweek  sort_date date_rank                    hospital
## 1           32 2019-W04 2019-01-25       265 Aarhus Universitetshospital
## 2           25 2019-W03 2019-01-18       264 Aarhus Universitetshospital
## 3           15 2019-W02 2019-01-11       263 Aarhus Universitetshospital
## 4           12 2019-W01 2019-01-04       262 Aarhus Universitetshospital
## 5           15 2018-W52 2018-12-28       261 Aarhus Universitetshospital
## 6           13 2018-W51 2018-12-21       260 Aarhus Universitetshospital
##                  afdeling afd_type
## 1 Akutafdeling Akutafsnit     akut
## 2 Akutafdeling Akutafsnit     akut
## 3 Akutafdeling Akutafsnit     akut
## 4 Akutafdeling Akutafsnit     akut
## 5 Akutafdeling Akutafsnit     akut
## 6 Akutafdeling Akutafsnit     akut
mean(haiba$denominator1 < 1) %>% scales::percent()
## [1] "16%"
with(haiba, mean(nominator1 > denominator1)) %>% scales::percent()
## [1] "0%"

16% of observations are 0 and some observations seem to show more cases than persondays, which is weird to say the least.

Let’s do a little visualisation

quarters <- data.frame(week = 1:53, quarter = c(rep(1:4, each = 13), 4))

haiba %>% 
    left_join(quarters) %>% 
    mutate(yearq = paste(year, quarter, sep = "-")) %>% 
    group_by(afd_type, yearq) %>% 
    summarise(nominator = sum(nominator1),
              denominator = sum(denominator1)) %>% 
    ungroup %>% 
    filter(denominator > 0) %>% 
    mutate(date_rank = dense_rank(yearq)) %>% #Dangerous, but seems to work
    group_by(afd_type) %>% 
    group_modify(~ add_ebb_estimate(.x, nominator, denominator)) -> afd_type_quarter
## Joining, by = "week"
## `summarise()` has grouped output by 'afd_type'. You can override using the
## `.groups` argument.
## Warning: `data_frame()` was deprecated in tibble 1.1.0.
## Please use `tibble()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
ggplot(afd_type_quarter, aes(date_rank, .fitted*1e4, colour = afd_type)) + 
    geom_smooth(method = "lm", se = FALSE) +
    geom_point() + 
    facet_wrap(~afd_type, scales = "free_y") + 
    expand_limits(y = 0) +
    scale_colour_viridis_d() +
    theme(legend.position = "none")
## `geom_smooth()` using formula 'y ~ x'

Let’s gussy that up a bit.

ggplot(afd_type_quarter, aes(date_rank, .raw*1e4, colour = afd_type)) + 
    geom_ribbon(aes(ymin = .low*1e4, ymax = .high*1e4), fill = "gray", colour = "gray") + 
    geom_smooth(method = "lm", se = FALSE) +
    geom_point() + 
    facet_wrap(~afd_type, scales = "free_y", labeller = labeller(afd_type = Hmisc::capitalize), nrow = 2) + 
    expand_limits(y = 0) +
    scale_colour_viridis_d() +
    scale_x_continuous(breaks = c(seq(0, 20, by = 4))) +
    labs(x = "kvartal fra start", 
         y = "incidens pr 10^4 patientdøgn", 
         caption = "Data fra HAIBA\n95% Konfidensintervaller baseret på empirical bayes\nLineær regression vist uden konfidensintervaller",
         title = "Hospitalserhvervede urinvejsinfektioner i region midt") +
    theme(legend.position = "none")
## `geom_smooth()` using formula 'y ~ x'

There are some interesting things going on in this visualisation. Most department types seem to be have decreasing UTI incidence. But, what is going on with pediatric and acute departments?

Some insight may be gleaned from the diference between the linear regression on incidence and the grayed out ribbons. Remember, the ribbons display empirical bayes estimates, these are calculated WITHOUT knowledge of the time effects.

For acute departments there appears to be a somewhat seasonal effect, perhaps patients dont stay on the service for long enough to become cases in summer.

Pediatric departments seem more erratic, and it would seem fair to assume that what we’re seeing is merely random variation.

Modeling

If i want an actual answer to whether incidences are decreasing, there’s probably no way around some modeling work.

Seeing as we’re working on count data with large observation times, it would seem reasonable to model the outcome as poisson distributed.

I want to be able to compare individual departments on their initial incidence, their progress over time, and their incidence at the end. I could feasibly fit a normal glm model with offset = log(denominator1) and fixed effects for departments, but some departments are much smaller than others and their mere size would bias the analysis so that large departments would look closer to average, just from having less random variation.

So, what i actually want is a bayesian model with partial pooling on departments for both intercept and slope, fixed effects for department type. Let’s see if i can figure that out

halfyear <- data.frame(week = 1:53, halfyear = c(rep(1:2, each = 26), 2))

null_departments <- haiba %>% # Some departments have 0 cases in total. Thats great, but not very informative.
    group_by(afdeling) %>% 
    summarise(nominator = sum(nominator1),
              denominator = sum(denominator1)) %>% 
    filter(nominator == 0) 

haiba %>% 
    anti_join(null_departments, by = "afdeling") %>% # Potentielt kontroversielt, men i praksis nok mere data-cleaning
    left_join(halfyear) %>% 
    mutate(yearh = paste(year, halfyear, sep = "-")) %>% 
    group_by(afdeling, yearh) %>% 
    summarise(nominator = sum(nominator1),
              denominator = sum(denominator1), 
              afd_type = unique(afd_type)) %>% 
    ungroup %>% 
    filter(denominator > 0) %>% 
    mutate(date_rank = dense_rank(yearh)) -> afdeling_halfyear
## Joining, by = "week"
## `summarise()` has grouped output by 'afdeling'. You can override using the
## `.groups` argument.
mm_data_afd <- afdeling_halfyear %>% 
    mutate_at(vars(contains("nominator")), as.integer) %>% 
    mutate_at(vars(contains("afd")), as.factor)

brms model

n_cores <- parallel::detectCores()
model <- brm(nominator ~ 0 + offset(log(denominator)) + date_rank + afd_type + (1 + date_rank | afdeling),
          data = mm_data_afd,
          family = "poisson",
          chains = n_cores, cores = n_cores, file = here("content", "post", "haiba", "2019-10-21-haiba-and-me", "model")) 

Model checks

Did the chains converge?

plot(model)

Seems reasonable enough.

model
##  Family: poisson 
##   Links: mu = log 
## Formula: nominator ~ 0 + offset(log(denominator)) + date_rank + afd_type + (1 + date_rank | afdeling) 
##    Data: mm_data_afd (Number of observations: 822) 
##   Draws: 6 chains, each with iter = 2000; warmup = 1000; thin = 1;
##          total post-warmup draws = 6000
## 
## Group-Level Effects: 
## ~afdeling (Number of levels: 80) 
##                          Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS
## sd(Intercept)                0.71      0.07     0.58     0.87 1.01     1420
## sd(date_rank)                0.03      0.01     0.02     0.04 1.00     2150
## cor(Intercept,date_rank)    -0.41      0.19    -0.73     0.02 1.00     2630
##                          Tail_ESS
## sd(Intercept)                2875
## sd(date_rank)                3391
## cor(Intercept,date_rank)     4044
## 
## Population-Level Effects: 
##                  Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## date_rank           -0.02      0.01    -0.03    -0.01 1.00     2836     3828
## afd_typeakut        -5.54      0.33    -6.20    -4.88 1.00     3854     4202
## afd_typeandet       -5.88      0.24    -6.35    -5.41 1.00     1969     2427
## afd_typebørn        -6.88      0.38    -7.62    -6.15 1.00     3958     4450
## afd_typeintensiv    -3.88      0.69    -5.22    -2.49 1.00     5174     4229
## afd_typekirurgi     -5.86      0.12    -6.11    -5.63 1.00     1017     1925
## afd_typekræft       -5.92      0.40    -6.70    -5.16 1.00     2634     3600
## afd_typemedicin     -5.30      0.15    -5.59    -5.01 1.00      803     1294
## 
## Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).

No dire warnings here either. Effective sample sizes could be higher.

Hypothesis 1: Infection control is improving

Lets test the hypothesis that number of cases is falling over time

hypothesis(model, "date_rank < 0")
## Hypothesis Tests for class b:
##        Hypothesis Estimate Est.Error CI.Lower CI.Upper Evid.Ratio Post.Prob
## 1 (date_rank) < 0    -0.02      0.01    -0.03    -0.01        299         1
##   Star
## 1    *
## ---
## 'CI': 90%-CI for one-sided and 95%-CI for two-sided hypotheses.
## '*': For one-sided hypotheses, the posterior probability exceeds 95%;
## for two-sided hypotheses, the value tested against lies outside the 95%-CI.
## Posterior probabilities of point hypotheses assume equal prior probabilities.
marginal_effects(model, "date_rank", spaghetti = TRUE, nsamples = 1e3, conditions = data.frame(denominator = 1e4))
## Warning: Method 'marginal_effects' is deprecated. Please use
## 'conditional_effects' instead.
## Warning: Argument 'nsamples' is deprecated. Please use argument 'ndraws'
## instead.

It would appear beyond reasonable doubt that there is indeed a falling incidence of HA-UTI. What is the magnitude of this fall?

broom.mixed::tidy(model) %>% 
  filter(term == "date_rank") %>% 
  select(-std.error) %>% 
  mutate_at(vars(estimate:conf.high), ~ 1 - exp(. * (mm_data_afd$date_rank %>% range %>% diff))) %>% 
  mutate_at(vars(estimate:conf.high), scales::percent) %>% 
  glue::glue_data("The overall reduction from the beginning of HAIBA (2014 week 1) to the latest point in our dataset (2019 week 4) is {estimate} [95% CI: {conf.high}; {conf.low}]")
## Warning in tidy.brmsfit(model): some parameter names contain underscores: term
## naming may be unreliable!
## The overall reduction from the beginning of HAIBA (2014 week 1) to the latest point in our dataset (2019 week 4) is 16% [95% CI: 5%; 25%]

Are any departments falling way out of line with the overall trend?

hypothesis(model, "date_rank = 0", group = "afdeling", scope = "ranef") %>% 
  .$hypothesis %>% 
  filter(Star == "*") -> differently_changed

mm_data_afd %>% 
    semi_join(differently_changed, by = c("afdeling" = "Group")) %>% 
    ggplot(aes(date_rank, nominator/(denominator/1e4), colour = afdeling)) +
    geom_point() +
    scale_colour_viridis_d(option = "A") +
    geom_smooth(method = "lm", se = FALSE) + 
    expand_limits(y = 0)
## `geom_smooth()` using formula 'y ~ x'

It would appear that both departments have made massive improvements.

Hypothesis 2: Departments are (very) different

conditional_effects(model, "afd_type", conditions = data.frame(denominator = 1e4))

last_plot() + coord_flip()

Okay, so intensive care units have more cases. That’s probably not a big surprise to anyone. Many of the “cases” are probably misclassifications due to more frequent urinary sampling and the increased difficulty in ascertaining symptoms from patients in intensive care.

hypothesis(model, "Intercept = 0", group = "afdeling", scope = "ranef") %>%
    .$hypothesis %>% 
    filter(Star == "*") -> differently_started

differently_started %>% 
    mutate(Group = forcats::fct_reorder(Group, Estimate)) %>% 
    ggplot(aes(Group, exp(Estimate), colour = Estimate)) + 
    geom_ribbon(aes(ymin = exp(CI.Lower), ymax = exp(CI.Upper))) +
    geom_point() + 
    scale_colour_viridis_c(option = "C") + 
    theme(legend.position = "none") + 
    coord_flip() +
    labs(y = "Rate ratio ifht afdelingstype", x = NULL,
         caption = "Data fra HAIBA\nVist med 95% konfidensintervaller") 

A few surprises here, but overall as expected. Some of the departments seem to have been picked up due to insufficient correction in the “Other” department type.

That’s the intercept done with. But where do these departments end up? Do they come into line with the pack?

new_from_differently_started <- mm_data_afd %>% 
  semi_join(differently_started, by = c("afdeling" = "Group")) %>% 
  select(afdeling, afd_type) %>% 
  distinct() %>% 
  mutate(date_rank = 12,
         denominator = 1e4)

quick_clean <- . %>% 
  as_tibble() %>% 
  janitor::clean_names()

predictions <- predict(model, newdata = new_from_differently_started) %>% 
  quick_clean %>% 
  bind_cols(new_from_differently_started) %>% 
  mutate(condition = "At end")

unknown_departments <- mm_data_afd %>% 
  distinct(afd_type) %>% 
  mutate(afdeling = wakefield::name(n()),
         date_rank = 12,
         denominator = 1e4)

predicted_beginnings <- predict(model, newdata = new_from_differently_started %>% mutate(date_rank = 1)) %>% 
  quick_clean() %>% 
  bind_cols(new_from_differently_started %>% select(afdeling)) %>% 
  mutate(condition = "At beginning")

predictions_for_unknown <- predict(model, newdata = unknown_departments, allow_new_levels = TRUE) %>% 
  as_tibble() %>% 
  janitor::clean_names() %>% 
  bind_cols(unknown_departments) %>% 
  mutate(condition = "Unknown department")

different_started_by_the_end <- predictions %>% 
  left_join(predictions_for_unknown, by = "afd_type") %>% 
  left_join(predicted_beginnings, by = c("afdeling.x" = "afdeling"), suffix = c("", ".z"))

bind_rows(predicted_beginnings, predictions) %>% 
  left_join(predictions_for_unknown, by = "afd_type", suffix = c("", ".y")) %>% 
  mutate(afdeling = forcats::fct_reorder(afdeling, estimate, .fun = first)) %>% 
  ggplot(aes(afdeling, estimate, colour = condition)) +
  geom_ribbon(aes(afdeling, ymin = q2_5, ymax = q97_5), data = ~ filter(.x, condition == "At end"), colour = "gray") +
  geom_point() +
  geom_point(aes(afdeling, estimate.y, colour = "Department type"), show.legend = TRUE) + 
  coord_flip() + 
  scale_colour_brewer(palette = "Set2", direction = -1) + 
  theme(legend.position = "bottom") + 
  labs(x = "Department", 
       y = "Cases pr. 10000 persondays", 
       colour = "Prediction case", 
       caption = "Data from HAIBA")
## Warning: Removed 34 rows containing missing values (geom_point).

Honorable mention to the departments without any cases

null_departments %>% arrange(desc(denominator))
## # A tibble: 13 × 3
##    afdeling                                       nominator denominator
##    <chr>                                              <int>       <int>
##  1 Livsstilscenter Brædstrup, Sengeafdeling - BRÆ         0       31597
##  2 Fokuseret Neurorehabilitering Hammel                   0       13759
##  3 Tand-, Mund- og Kæbekirurgi Sengeafdeling              0        2493
##  4 Hånd Stamafdeling Ortopædkirurgi                       0        2475
##  5 Skulder Stamafdeling Ortopædkirurgi                    0        1677
##  6 Idræt Stamafdeling Ortopædkirurgi                      0        1220
##  7 Brystkirurgi Stationær                                 0         678
##  8 Øre-, Næse- og Halssygdomme - Randers                  0         290
##  9 Infektion Stamafdeling Ortopædkirurgi                  0          28
## 10 Akutmodtagelse Q - Randers                             0          21
## 11 Øjenafdeling Dagkirurgi Stationær J                    0          16
## 12 Neurokirurgisk Dagkirurgi Stationær NK                 0          15
## 13 Kirurgisk Dagkirurgi P Stationær Afdeling              0           6
Marc Trunjer Kusk Nielsen
Marc Trunjer Kusk Nielsen

I’m a medical doctor in specialty training to become a Clinical Microbiologist. I’ll post on the intersection of data science and clinical microbiology.