In this vignette we’re going to show an example workflow of data preprocessing. We’ll start with a toy dataset which is fully synthetic and consist of two elements:
patients’ visits data - when a patient visited a psychiatrist
mobile calls recordings - voice parameters from mobile calls
Our goal is to extrapolate the state from a visit date to a wider time window. We’d also like to quantify the confidence of our extrapolation. We’re going to use dedicated functions from the package to make it.
Let’s read libraries we need
library(bipolarPreprocessing)
library(ggplot2)
Next, let’s create the datasets.
set.seed(123)
<- create_toy_dataset()
df <- df$visits
visits <- df$recordings recordings
Let’s take a look a the visits data frame. There are columns phases, visit_date, patient_id and visit_id. The phases column holds information about patient’s state on a particular day, e.g. depression.
recordings data frame contains three call parameters - x1, x2 and x3 variables - as well as the id of a patient and the date when the call took place.
We’d like to extrapolate the information of a patient’s state to
adjacent days before and after a visit with a psychiatrist. A convenient
way to do this is by creating a config file that defines time ranges
around a visit date for each phase. The
auto_create_phases_config
function is designed to do this.
By default, it creates a time window of 7 days before and 2 days after a
visit.
<- auto_create_phases_config(visits, phases)
auto_config
auto_config#> phase days_before days_after
#> 1 depression 7 2
#> 2 euthymia 7 2
#> 3 mania 7 2
#> 4 mixed 7 2
As the result we get phase and time ranges. In our example every phase has the same time window definition, but it can be adjusted to reflect different scenarios.
We’re going to model how confident we are when extrapolating the actual patient’s state on the time range we defined. For example, we may assume that we are most certain about the state in day 0, i.e. the visit day. Moving away from the visit day our confidence may decrease. This is one of possible scenarios. Another possibility is to treat the whole time window as equally certain, so in the range of -7 to 2 days around the visit day our confidence about the state is the same. We also need a way to quantify it somehow.
There’s a function in our package - add_confidence
-
which we use to enhance our data with confidence. The confidence is
expressed as a number between 0 and 1.
Here are some examples how this function can be used, for details please refer to its help file:
# constant:
%>% add_confidence(values = .5)
auto_config #> # A tibble: 40 × 3
#> phase time_point confidence
#> <chr> <int> <dbl>
#> 1 depression -7 1
#> 2 depression -6 1
#> 3 depression -5 1
#> 4 depression -4 1
#> 5 depression -3 1
#> 6 depression -2 1
#> 7 depression -1 1
#> 8 depression 0 1
#> 9 depression 1 1
#> 10 depression 2 1
#> # … with 30 more rows
%>% add_confidence(values = .5, normalize = FALSE)
auto_config #> # A tibble: 40 × 3
#> phase time_point confidence
#> <chr> <int> <dbl>
#> 1 depression -7 0.5
#> 2 depression -6 0.5
#> 3 depression -5 0.5
#> 4 depression -4 0.5
#> 5 depression -3 0.5
#> 6 depression -2 0.5
#> 7 depression -1 0.5
#> 8 depression 0 0.5
#> 9 depression 1 0.5
#> 10 depression 2 0.5
#> # … with 30 more rows
# steps:
%>% add_confidence(values = c(0.5, 0.5, 0.7, 0.7, 1, 1, 1,1,1,1))
auto_config #> # A tibble: 40 × 3
#> phase time_point confidence
#> <chr> <int> <dbl>
#> 1 depression -7 0.5
#> 2 depression -6 0.5
#> 3 depression -5 0.7
#> 4 depression -4 0.7
#> 5 depression -3 1
#> 6 depression -2 1
#> 7 depression -1 1
#> 8 depression 0 1
#> 9 depression 1 1
#> 10 depression 2 1
#> # … with 30 more rows
# Gaussian density function:
<- auto_config %>% add_confidence(func = dnorm) gauss
add_confidence(values = .5)
adds a constant confidence
value of 0.5, but by default the function will normalize the confidence
so the maximum value is 1. You can compare the output with the second
example.
The last example shows that we may fit time windows with functions, e.g. gaussian density function. By default it will be centered at 0 (the day of a visit).
Next, we’ll apply this information to our original visits data.
<- auto_config %>% add_confidence(func = dnorm)
config_with_confidence <- expand_ground_truth_period(
extended_visits d = visits,
config = auto_config,
phases_col = phases,
visit_date_col = visit_date
)<- transform_overlapping_phases(extended_visits, config_with_confidence) extended_visits_clean
Please notice the use of another two functions, namely
expand_ground_truth_period
and
transform_overlapping_phases
. The first one simply add
confidence to visits data frame. The other is useful in case
there are two visits close to each other and their time windows overlap.
See the help for details about how it’s implemented.
Now we’re ready to incorporate our confidence into the recordings data:
# First, transform visits with expanded time window into wider form
<- extended_visits_clean %>%
visits_wide mutate(val = 1) %>%
::pivot_wider(names_from = phase, values_from = val, values_fill = 0)
tidyr
# now join with recordings
<- left_join(
model_df
recordings,
visits_wide,by = c("date", "patient_id")
%>% arrange(patient_id, date, visit_date)
)
# preview the data:
%>%
model_df filter(between(date, as.Date("2022-04-17"), as.Date("2022-05-01")), patient_id == 23) %>%
print(n = 25)
#> # A tibble: 15 × 14
#> date patien…¹ recor…² x1 x2 x3 visit_date visit…³ time_…⁴
#> <date> <int> <int> <dbl> <dbl> <dbl> <date> <int> <int>
#> 1 2022-04-17 23 1567 0.169 3.56 0.480 NA NA NA
#> 2 2022-04-18 23 1568 1.41 0.0704 0.619 NA NA NA
#> 3 2022-04-19 23 1569 1.01 -3.90 0.137 NA NA NA
#> 4 2022-04-20 23 1570 0.255 0.761 1.49 NA NA NA
#> 5 2022-04-21 23 1571 0.665 0.985 1.10 2022-04-28 116 -7
#> 6 2022-04-22 23 1572 -2.59 -0.215 0.0264 2022-04-28 116 -6
#> 7 2022-04-23 23 1573 -0.244 -1.29 0.289 2022-04-28 116 -5
#> 8 2022-04-24 23 1574 -1.11 -4.56 2.63 2022-04-28 116 -4
#> 9 2022-04-25 23 1575 1.28 0.620 0.720 2022-04-28 116 -3
#> 10 2022-04-26 23 1576 0.0161 -3.91 0.485 2022-04-28 116 -2
#> 11 2022-04-27 23 1577 -0.674 2.23 0.619 2022-04-28 116 -1
#> 12 2022-04-28 23 1578 0.417 1.73 0.174 2022-04-28 116 0
#> 13 2022-04-29 23 1579 1.82 -1.16 1.43 2022-04-28 116 1
#> 14 2022-04-30 23 1580 -0.568 4.66 0.621 2022-04-28 116 2
#> 15 2022-05-01 23 1581 -0.410 3.46 1.26 NA NA NA
#> # … with 5 more variables: confidence <dbl>, mania <dbl>, euthymia <dbl>,
#> # depression <dbl>, mixed <dbl>, and abbreviated variable names ¹patient_id,
#> # ²recording_id, ³visit_id, ⁴time_point
The resulting data frame can be further used in a modeling pipeline, e.g. feature engineering and selection etc.