Adding Confidence to A Patient’s State

Overview

In this vignette we’re going to show an example workflow of data preprocessing. We’ll start with a toy dataset which is fully synthetic and consist of two elements:

Our goal is to extrapolate the state from a visit date to a wider time window. We’d also like to quantify the confidence of our extrapolation. We’re going to use dedicated functions from the package to make it.

Create the toy data set

Let’s read libraries we need

library(bipolarPreprocessing)
library(ggplot2)

Next, let’s create the datasets.

set.seed(123)
df <- create_toy_dataset()
visits <- df$visits
recordings <- df$recordings

Let’s take a look a the visits data frame. There are columns phases, visit_date, patient_id and visit_id. The phases column holds information about patient’s state on a particular day, e.g. depression.

recordings data frame contains three call parameters - x1, x2 and x3 variables - as well as the id of a patient and the date when the call took place.

Configuration of time window extrapolation

We’d like to extrapolate the information of a patient’s state to adjacent days before and after a visit with a psychiatrist. A convenient way to do this is by creating a config file that defines time ranges around a visit date for each phase. The auto_create_phases_config function is designed to do this. By default, it creates a time window of 7 days before and 2 days after a visit.

auto_config <- auto_create_phases_config(visits, phases)
auto_config
#>        phase days_before days_after
#> 1 depression           7          2
#> 2   euthymia           7          2
#> 3      mania           7          2
#> 4      mixed           7          2

As the result we get phase and time ranges. In our example every phase has the same time window definition, but it can be adjusted to reflect different scenarios.

Confidence

We’re going to model how confident we are when extrapolating the actual patient’s state on the time range we defined. For example, we may assume that we are most certain about the state in day 0, i.e. the visit day. Moving away from the visit day our confidence may decrease. This is one of possible scenarios. Another possibility is to treat the whole time window as equally certain, so in the range of -7 to 2 days around the visit day our confidence about the state is the same. We also need a way to quantify it somehow.

There’s a function in our package - add_confidence - which we use to enhance our data with confidence. The confidence is expressed as a number between 0 and 1.

Here are some examples how this function can be used, for details please refer to its help file:

# constant:
auto_config %>% add_confidence(values = .5)
#> # A tibble: 40 × 3
#>    phase      time_point confidence
#>    <chr>           <int>      <dbl>
#>  1 depression         -7          1
#>  2 depression         -6          1
#>  3 depression         -5          1
#>  4 depression         -4          1
#>  5 depression         -3          1
#>  6 depression         -2          1
#>  7 depression         -1          1
#>  8 depression          0          1
#>  9 depression          1          1
#> 10 depression          2          1
#> # … with 30 more rows
auto_config %>% add_confidence(values = .5, normalize = FALSE)
#> # A tibble: 40 × 3
#>    phase      time_point confidence
#>    <chr>           <int>      <dbl>
#>  1 depression         -7        0.5
#>  2 depression         -6        0.5
#>  3 depression         -5        0.5
#>  4 depression         -4        0.5
#>  5 depression         -3        0.5
#>  6 depression         -2        0.5
#>  7 depression         -1        0.5
#>  8 depression          0        0.5
#>  9 depression          1        0.5
#> 10 depression          2        0.5
#> # … with 30 more rows
# steps:
auto_config %>% add_confidence(values = c(0.5, 0.5, 0.7, 0.7, 1, 1, 1,1,1,1))
#> # A tibble: 40 × 3
#>    phase      time_point confidence
#>    <chr>           <int>      <dbl>
#>  1 depression         -7        0.5
#>  2 depression         -6        0.5
#>  3 depression         -5        0.7
#>  4 depression         -4        0.7
#>  5 depression         -3        1  
#>  6 depression         -2        1  
#>  7 depression         -1        1  
#>  8 depression          0        1  
#>  9 depression          1        1  
#> 10 depression          2        1  
#> # … with 30 more rows
# Gaussian density function:
gauss <- auto_config %>% add_confidence(func = dnorm)

add_confidence(values = .5) adds a constant confidence value of 0.5, but by default the function will normalize the confidence so the maximum value is 1. You can compare the output with the second example.

The last example shows that we may fit time windows with functions, e.g. gaussian density function. By default it will be centered at 0 (the day of a visit).

Next, we’ll apply this information to our original visits data.

config_with_confidence <- auto_config %>% add_confidence(func = dnorm)
extended_visits <- expand_ground_truth_period(
  d = visits,
  config = auto_config,
  phases_col = phases,
  visit_date_col = visit_date
)
extended_visits_clean <- transform_overlapping_phases(extended_visits, config_with_confidence)

Please notice the use of another two functions, namely expand_ground_truth_period and transform_overlapping_phases. The first one simply add confidence to visits data frame. The other is useful in case there are two visits close to each other and their time windows overlap. See the help for details about how it’s implemented.

Combining all together

Now we’re ready to incorporate our confidence into the recordings data:

# First, transform visits with expanded time window into wider form
visits_wide <- extended_visits_clean %>%
  mutate(val = 1) %>%
  tidyr::pivot_wider(names_from = phase, values_from = val, values_fill = 0)


# now join with recordings
model_df <- left_join(
  recordings,
  visits_wide,
  by = c("date", "patient_id")
) %>% arrange(patient_id, date, visit_date)

# preview the data:
model_df %>%
  filter(between(date, as.Date("2022-04-17"), as.Date("2022-05-01")), patient_id == 23) %>%
  print(n = 25)
#> # A tibble: 15 × 14
#>    date       patien…¹ recor…²      x1      x2     x3 visit_date visit…³ time_…⁴
#>    <date>        <int>   <int>   <dbl>   <dbl>  <dbl> <date>       <int>   <int>
#>  1 2022-04-17       23    1567  0.169   3.56   0.480  NA              NA      NA
#>  2 2022-04-18       23    1568  1.41    0.0704 0.619  NA              NA      NA
#>  3 2022-04-19       23    1569  1.01   -3.90   0.137  NA              NA      NA
#>  4 2022-04-20       23    1570  0.255   0.761  1.49   NA              NA      NA
#>  5 2022-04-21       23    1571  0.665   0.985  1.10   2022-04-28     116      -7
#>  6 2022-04-22       23    1572 -2.59   -0.215  0.0264 2022-04-28     116      -6
#>  7 2022-04-23       23    1573 -0.244  -1.29   0.289  2022-04-28     116      -5
#>  8 2022-04-24       23    1574 -1.11   -4.56   2.63   2022-04-28     116      -4
#>  9 2022-04-25       23    1575  1.28    0.620  0.720  2022-04-28     116      -3
#> 10 2022-04-26       23    1576  0.0161 -3.91   0.485  2022-04-28     116      -2
#> 11 2022-04-27       23    1577 -0.674   2.23   0.619  2022-04-28     116      -1
#> 12 2022-04-28       23    1578  0.417   1.73   0.174  2022-04-28     116       0
#> 13 2022-04-29       23    1579  1.82   -1.16   1.43   2022-04-28     116       1
#> 14 2022-04-30       23    1580 -0.568   4.66   0.621  2022-04-28     116       2
#> 15 2022-05-01       23    1581 -0.410   3.46   1.26   NA              NA      NA
#> # … with 5 more variables: confidence <dbl>, mania <dbl>, euthymia <dbl>,
#> #   depression <dbl>, mixed <dbl>, and abbreviated variable names ¹​patient_id,
#> #   ²​recording_id, ³​visit_id, ⁴​time_point

The resulting data frame can be further used in a modeling pipeline, e.g. feature engineering and selection etc.