Observations of trends in Electronic Health Record Data Among Patients Who Have Undergone a Stroke Evaluation
Preliminary information
Because patients who are monitored for stroke status during hospital visits are often watched closely and put through a myriad of tests, they may have exceptionally rich EHR data and be useful clinical research subjects. The purpose of this study is to highlight four interesting research questions and answer them using EHR data pulled from patients who had been evaluated for stroke.
1 Setup and Data Ingest
1.1 Initial R setup
The packages and parameters used in this code chunk are present in order to generate a legible report.
library(knitr)
library(rmdformats)
library(rmarkdown)
options(max.print="100")
::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(comment = NA)
knitroptions(width = 70)
1.2 Loading Necessary R Packages
Each of the following packages were necessary for the analysis and/or graphical display of information used in this project.
setwd(dirname(getwd()))
source("Love-boost.R")
library(janitor)
library(Epi)
library(kableExtra)
library(GGally)
library(ggforce)
library(ggdist)
library(gghalves)
library(ggmosaic)
library(car)
library(tidyr)
library(magrittr)
library(mosaic)
library(equatiomatic)
library(simputation)
library(patchwork)
library(broom)
library(naniar)
library(tidyverse)
theme_set(theme_bw())
1.3 Data ingest
The following code is present for the purpose of ingesting the the
raw data from the STROKEDATA.csv
file. Additionally, this
code converts certain character variables used in this study into
factors. Specifically, all of the categorical variables were converted
into factors. The data for bmi
, which is a quantitative
variable, was kept in the numeric category alongside
avg_glucose_level
and age
. The id
variable is maintained as a character variable here, as it is only
present for the purpose of displaying a unique number alongside each
observation.
<- read.csv("STROKEDATA.csv") |>
stroke_raw mutate(across(where(is.character), as.factor)) |>
mutate(id = as.character(id)) |>
mutate(across(where(is.integer), as.factor)) |>
mutate(bmi = as.character(bmi)) |>
mutate(bmi = as.numeric(bmi))
Standardizing the “unknown” variables for the raw dataset
It appears that missing values are noted as “unknown” in only the
smoking_status
variable. The following code is meant to
change it to “N/A” so that it matches all other variables. A quick count
of the smoking status factors is provided to show that the name was
properly changed.
$smoking_status <- recode_factor(stroke_raw$smoking_status,
stroke_raw"Unknown" = "N/A")
|> count(smoking_status) stroke_raw
smoking_status n
1 N/A 1544
2 formerly smoked 885
3 never smoked 1892
4 smokes 789
2 Cleaning the Data
Using the code below, a new data frame called
stroke_complete_cases
was created, including only the
patients in stroke_raw
with information on all of the
variables collected. For enhanced legibility, binary categorical
variables with factors of 0 and 1 were renamed such that the factors
reflected the actual clinical observation. For example, instead of “0”
and “1” being used in the hypertension
variable, “No
Hypertension” and “Hypertension” were used respectively. Finally, a
glimpse of the stroke_complete_cases
tibble is
provided.
Because the research question for analysis E, which is explained
later, only involves employed persons, the two categories for unemployed
people (children
and never_worked
) were
omitted from the work_type
variable.
<- stroke_raw |>
stroke_complete_cases filter(complete.cases(id, age, gender,
hypertension, heart_disease,
ever_married, work_type,
Residence_type,
avg_glucose_level, bmi,
smoking_status, stroke), != "N/A", smoking_status != "N/A", gender != "Other") |>
bmi mutate(gender = fct_collapse(factor(gender) ,
"Male" = "Male",
"Female" = c("Female"))) |>
filter(
== "Govt_job" |
work_type == "Private" |
work_type == "Self-employed") |> droplevels()
work_type
$work_type <- recode_factor(stroke_complete_cases$work_type,
stroke_complete_cases"Govt_job" = "Government Job",
"Private" = "Private Sector",
"Self-employed" = "Self-Employed")
#renaming the factors
$hypertension <- recode_factor(stroke_complete_cases$hypertension,
stroke_complete_cases"0" = "No Hypertension",
"1" = "Hypertension")
$heart_disease <- recode_factor(stroke_complete_cases$heart_disease,
stroke_complete_cases"0" = "No Heart Disease",
"1" = "Heart Disease" )
$stroke <- recode_factor(stroke_complete_cases$stroke,
stroke_complete_cases"0" = "No Stroke",
"1" = "Stroke")
<- stroke_complete_cases |>
stroke_complete_cases mutate(smoking_status = fct_relevel(smoking_status, "never smoked",
"formerly smoked", "smokes"))|>
droplevels()
<- stroke_complete_cases |> as_tibble(stroke_complete_cases)
stroke_complete_cases
# Reviewing the data set
glimpse(stroke_complete_cases)
Rows: 3,343
Columns: 12
$ id <chr> "9046", "31112", "60182", "1665", "56669",…
$ gender <fct> Male, Male, Female, Female, Male, Male, Fe…
$ age <dbl> 67, 80, 49, 79, 81, 74, 69, 81, 61, 54, 79…
$ hypertension <fct> No Hypertension, No Hypertension, No Hyper…
$ heart_disease <fct> Heart Disease, Heart Disease, No Heart Dis…
$ ever_married <fct> Yes, Yes, Yes, Yes, Yes, Yes, No, Yes, Yes…
$ work_type <fct> Private Sector, Private Sector, Private Se…
$ Residence_type <fct> Urban, Rural, Urban, Rural, Urban, Rural, …
$ avg_glucose_level <dbl> 228.69, 105.92, 171.23, 174.12, 186.21, 70…
$ bmi <dbl> 36.6, 32.5, 34.4, 24.0, 29.0, 27.4, 22.8, …
$ smoking_status <fct> formerly smoked, never smoked, smokes, nev…
$ stroke <fct> Stroke, Stroke, Stroke, Stroke, Stroke, St…
Checking to make sure that none of the values for
bmi
are unrealistically high:
The following code is present for the purpose of displaying the
number of bmi
observations above 50 (10 points above the
clinical level of morbid obesity as per the CDC) and below 10
(8 points below the clinically underweight cutoff point as per
the CDC). This allows for any abnormally large or small values in the
data set to be noted so that they can be omitted.
sum(stroke_complete_cases$bmi > 50)
[1] 58
sum(stroke_complete_cases$bmi < 10)
[1] 0
Although the lowest bmi
value in the
stroke_complete_cases
data frame appears to be in a
reasonable area for a low BMI per CDC guidance, there are a total of 58
recorded BMI values over 50, which is considerably above the widely
accepted cutoff value for morbid obesity of 40 (as per CDC guidance).
For this reason values of bmi
over 50 will be removed from
the data frame, as they are either serious anomalies or possibly
improperly measured/reported values.
Removing impossibly high values
The following code is present for the purpose of removing
bmi
values in the stroke_complete_cases
data
frame above 50.
<- stroke_complete_cases[which(stroke_complete_cases$bmi < 50),] stroke_complete_cases
2.1 Creating a data set containing only the variables to be used in this study
Because this project will only use five of the variables included in
stroke_complete_cases
there is no reason to include
information on the other variables in the analytic data frame that will
be used in this project. The following code is present to only select
the necessary variables for this study and omit the remaining
variables.
<- stroke_complete_cases |>
A1_stroke_complete_cases select(id, gender, stroke, bmi,
hypertension, smoking_status, work_type)
glimpse(A1_stroke_complete_cases)
Rows: 3,285
Columns: 7
$ id <chr> "9046", "31112", "60182", "1665", "56669", "5…
$ gender <fct> Male, Male, Female, Female, Male, Male, Femal…
$ stroke <fct> Stroke, Stroke, Stroke, Stroke, Stroke, Strok…
$ bmi <dbl> 36.6, 32.5, 34.4, 24.0, 29.0, 27.4, 22.8, 29.…
$ hypertension <fct> No Hypertension, No Hypertension, No Hyperten…
$ smoking_status <fct> formerly smoked, never smoked, smokes, never …
$ work_type <fct> Private Sector, Private Sector, Private Secto…
2.2 Checking for missing data on each of the selected variables
Checking the quantitative variable
The following code is present for the purpose of searching for any
extraneous missing data in the quantitative variables. Although missing
cases were previously omitted from the data set, it is important to
check before proceeding with the analyses of this study. A brief numeric
summary is also provided in order to ensure that no impossible values
are included for the bmi
variable.
|> select(bmi) |>
A1_stroke_complete_cases miss_var_summary()
# A tibble: 1 × 3
variable n_miss pct_miss
<chr> <int> <dbl>
1 bmi 0 0
|>
A1_stroke_complete_cases select(bmi) |>
::inspect() mosaic
quantitative variables:
name class min Q1 median Q3 max mean sd n
1 bmi numeric 11.5 25.3 29.1 33.9 49.9 29.9972 6.377575 3285
missing
1 0
It can be seen above that, as expected, there are no missing
bmi
observations in our analytic tibble. Additionally, it
appears that no bmi
values above 50 are included, as
intended.
Checking binary categorical variables
The following code is present for the purpose of searching for any
extraneous missing data in the binary categorical variables. Also
included is a table from the inspect
function of the
mosaic
package, which displays the name, class, levels, n,
number missing, and distribution of factors for each variable.
|>
A1_stroke_complete_cases select(stroke, gender) |>
miss_var_summary()
# A tibble: 2 × 3
variable n_miss pct_miss
<chr> <int> <dbl>
1 stroke 0 0
2 gender 0 0
|>
A1_stroke_complete_cases select(stroke, gender) |>
::inspect() mosaic
categorical variables:
name class levels n missing
1 stroke factor 2 3285 0
2 gender factor 2 3285 0
distribution
1 No Stroke (94.6%), Stroke (5.4%)
2 Female (60.9%), Male (39.1%)
It can be seen above that, as expected, there are no missing
stroke
, or gender
observations in our analytic
tibble. Both variables were properly classified as factors with two
understandably named levels.
Checking multicategorical variables
The following code is present for the purpose of searching for any
extraneous missing data in the multicategorical variables. Also included
is a report from the inspect
function of the
mosaic
package, which displays the name, class, levels, n,
number missing, and distribution of factors for each variable.
|>
A1_stroke_complete_cases select(smoking_status, work_type) |>
miss_var_summary()
# A tibble: 2 × 3
variable n_miss pct_miss
<chr> <int> <dbl>
1 smoking_status 0 0
2 work_type 0 0
|>
A1_stroke_complete_cases select(smoking_status, work_type) |>
::inspect() mosaic
categorical variables:
name class levels n missing
1 smoking_status factor 3 3285 0
2 work_type factor 3 3285 0
distribution
1 never smoked (53.3%) ...
2 Private Sector (65.7%) ...
It can be seen above that, as expected, there are no missing
smoking_status
, or work_type
observations in
our analytic tibble. Both are classified as factors with the intended
number of levels present.
2.3 Printing The analytic tibble
The following code is present for the purpose of renaming the data set containing all of the six variables and printing it to show the final analytic tibble that will be used in all subsequent analyses.
<- A1_stroke_complete_cases
Stroke_data
Stroke_data
# A tibble: 3,285 × 7
id gender stroke bmi hypertension smoking_status work_type
<chr> <fct> <fct> <dbl> <fct> <fct> <fct>
1 9046 Male Stroke 36.6 No Hypertension formerly smoked Private …
2 31112 Male Stroke 32.5 No Hypertension never smoked Private …
3 60182 Female Stroke 34.4 No Hypertension smokes Private …
4 1665 Female Stroke 24 Hypertension never smoked Self-Emp…
5 56669 Male Stroke 29 No Hypertension formerly smoked Private …
6 53882 Male Stroke 27.4 Hypertension never smoked Private …
7 10434 Female Stroke 22.8 No Hypertension never smoked Private …
8 12109 Female Stroke 29.7 Hypertension never smoked Private …
9 12095 Female Stroke 36.8 No Hypertension smokes Governme…
10 12175 Female Stroke 27.3 No Hypertension smokes Private …
# … with 3,275 more rows
List of missing values
The following code provides one final count of missing variables to ensure that subsequent analyses include only complete cases. It should be noted that in this study, all missing data is considered to be missing completely at random (MCAR).
miss_var_summary(Stroke_data)
# A tibble: 7 × 3
variable n_miss pct_miss
<chr> <int> <dbl>
1 id 0 0
2 gender 0 0
3 stroke 0 0
4 bmi 0 0
5 hypertension 0 0
6 smoking_status 0 0
7 work_type 0 0
3 Codebook and Data Description
Description of the subjects of this study
The data used for this study was taken from the electronic health records (EHRs) of 3285 persons aged 10-82 who, within their EHR, had explicitly been noted as either having suffered a stroke in their life or having not suffered a stroke in their life. All EHR data used in this study was de-identified and made public by McKinsey and Company, a consulting firm with oversight over a number of US hospital EHRs. All subjects had complete data for each of the variables listed below.
3.1 Codebook
Description of Variables
The following table contains the name of each variable in the
Stroke_data
data set used for this analysis, the type of
variable it is, and a brief description of the levels for each variable.
The variable types are denoted as such:
ID = Identification number
Quant = Quantitative variables
Binary = Two-category variables
X-cat = Multicategorical variables) where X indicates the number of levels in the variable.
Variable | Type | Description / Levels |
---|---|---|
id |
ID | A unique identification number assigned to each participant |
gender |
Binary | The gender of the subject based on recording in the patient’s EHR. “Male” indicates that the subject was male, “Female” indicates that they were female. |
stroke |
Binary | Whether or not electronic health records indicated that the individual had suffered a stroke at some point in their life. “Stroke” indicates that they suffered a stroke, “No Stroke” indicates that they did not. |
bmi |
Quant | Current body mass index of the patient as indicated by that patient’s EHR. Body mass index is calculated by dividing weight in pounds (lb) by height in inches (in) squared and multiplying that value by a conversion factor of 703. |
smoking_status |
3-Cat | The smoking status of the patient as indicated by that patient’s EHR. “never smoked” indicates that the patient has no history of smoking, “formerly smoked” indicates that they have, but no longer smoke, and “smokes” indicates that the patient currently smokes. |
work_type |
5-Cat | The category of work of the patient as indicated by that patient’s EHR. “Govt_job” indicates that the patient works for the government (public sector), “Private” indicates that the patient is employed and works for a private company (private sector), “Self-employed” indicates that the patient is self employed. |
hypertension |
Binary | The hypertensive status of the patient as indicated by that patient’s EHR. “No Hypertension” indicates that the patient does not have hypertension (high blood pressure) and “Hypertension” indicates that the patient has hypertension (high blood pressure). |
3.2 Analytic Tibble
The following code is present for the purpose of printing the analytic tibble in order to show that it is, in fact, a tibble.
Stroke_data
# A tibble: 3,285 × 7
id gender stroke bmi hypertension smoking_status work_type
<chr> <fct> <fct> <dbl> <fct> <fct> <fct>
1 9046 Male Stroke 36.6 No Hypertension formerly smoked Private …
2 31112 Male Stroke 32.5 No Hypertension never smoked Private …
3 60182 Female Stroke 34.4 No Hypertension smokes Private …
4 1665 Female Stroke 24 Hypertension never smoked Self-Emp…
5 56669 Male Stroke 29 No Hypertension formerly smoked Private …
6 53882 Male Stroke 27.4 Hypertension never smoked Private …
7 10434 Female Stroke 22.8 No Hypertension never smoked Private …
8 12109 Female Stroke 29.7 Hypertension never smoked Private …
9 12095 Female Stroke 36.8 No Hypertension smokes Governme…
10 12175 Female Stroke 27.3 No Hypertension smokes Private …
# … with 3,275 more rows
3.3 Data summary
The following code is present for the purpose of creating a descriptive summary of each of the variables used in the tibble for this study.
::describe(Stroke_data) Hmisc
Stroke_data
7 Variables 3285 Observations
----------------------------------------------------------------------
id
n missing distinct
3285 0 3285
lowest : 10056 10119 10133 10138 10139, highest: 9730 9752 9923 9926 9955
----------------------------------------------------------------------
gender
n missing distinct
3285 0 2
Value Female Male
Frequency 2000 1285
Proportion 0.609 0.391
----------------------------------------------------------------------
stroke
n missing distinct
3285 0 2
Value No Stroke Stroke
Frequency 3106 179
Proportion 0.946 0.054
----------------------------------------------------------------------
bmi
n missing distinct Info Mean Gmd .05
3285 0 318 1 30 7.144 21.00
.10 .25 .50 .75 .90 .95
22.40 25.30 29.10 33.90 39.06 42.20
lowest : 11.5 14.1 15.0 15.7 16.0, highest: 49.3 49.4 49.5 49.8 49.9
----------------------------------------------------------------------
hypertension
n missing distinct
3285 0 2
Value No Hypertension Hypertension
Frequency 2892 393
Proportion 0.88 0.12
----------------------------------------------------------------------
smoking_status
n missing distinct
3285 0 3
Value never smoked formerly smoked smokes
Frequency 1751 811 723
Proportion 0.533 0.247 0.220
----------------------------------------------------------------------
work_type
n missing distinct
3285 0 3
Value Government Job Private Sector Self-Employed
Frequency 505 2159 621
Proportion 0.154 0.657 0.189
----------------------------------------------------------------------
4 Analysis B: Comparing Two Population Means with Independent Samples
4.1 Research Question
In this analysis, the bmi
values in the
Stroke_data
tibble will be compared to the
stroke
status of the patients included in the study, with
the objective of identifying if there is any meaningful difference in
bmi
for patients who were recorded as having a stroke vs
those who were not recorded as having a stroke. This information may be
a useful hypothesis generator for researchers looking to investigate the
possible clinical appearances of patients more likely to suffer a
stroke. For this study, a 90% confidence interval will be used in all
analyses. The research question is as follows:
Is there a meaningful difference in mean BMI between patients that have been noted as having suffered a stroke in their electronic medical history and those who have not been noted as having suffered a stroke in their electronic medical history?
4.2 Describing the Data
In this analysis, two variables will be examined. One is stroke
status, which is a binary categorical variable named stroke
with the levels “No Stroke” and “Stroke”, representing whether or not
electronic medical records indicated that the individual had suffered a
stroke at some point in their life. “Stroke” indicates that they
suffered a stroke, “No Stroke” indicates that they did not.
The second variable is BMI, which is a quantitative variable describing the current body mass index of the patient as indicated by that patient’s EHR.
Numerical summary
The following code is present for the purpose of displaying a brief
numerical summary of the bmi
data between patients that
have suffered a stroke and patients who have not.
::favstats(bmi ~ stroke, data = Stroke_data) |>
mosaickbl(caption = "Numerical summary of BMI by stroke status",
digits = 2) |>
kable_styling(font_size = 20) |>
kable_paper("hover", full_width = F)
stroke | min | Q1 | median | Q3 | max | mean | sd | n | missing |
---|---|---|---|---|---|---|---|---|---|
No Stroke | 11.5 | 25.3 | 29.0 | 33.9 | 49.9 | 29.97 | 6.40 | 3106 | 0 |
Stroke | 16.9 | 26.6 | 29.7 | 33.9 | 48.9 | 30.51 | 6.02 | 179 | 0 |
It can be seen in the table above that the two stroke
classified groups have very similar values for max, mean, median, and
interquartile range. However the minimum value for the “No Stroke” group
was considerably lower than that of the “Stroke” group. It should also
be noted that there are many more patients within the “No Stroke” group
than the “Stroke” group.
Graphical assessment of Symmetry and Normality
In order to visualize the overall distribution of data in terms of
BMI, a histogram was constructed comprising the entire range for
bmi
of the Stroke_data
data set, with portions
of the histogram highlighted to show the distribution of the two
stroke
groups.
ggplot(Stroke_data, aes(x = bmi, fill = stroke, color = 2)) +
geom_histogram(bins = 75) +
guides(color = "none") +
labs(title = "Histogram of BMI Values For All Patients Assessed for Stroke",
x = "Body Mass Index (BMI)") + theme(
panel.background = element_rect(fill = "lemonchiffon1",
colour = "lemonchiffon1",
size = 0.5, linetype = "solid")) +
theme(plot.background = element_rect(fill = "azure2"))
The above plot reveals a distribution that appears to be fairly
normal with some right skew. It can be seen that the normal distribution
is present in both categories of stroke
, but that the “No
Stroke” category was more pronounced. This was expected, as in the
numerical summary it was demonstrated that the “No Stroke” category had
almost 3000 more observations than the “Stroke” category.
A box and violin plot juxtaposed with normal QQ plots for
bmi
in each stroke
category is also useful for
visualizing the distribution of values in this data. The following code
displays such a plot:
<- ggplot(Stroke_data, aes(x = stroke, y = bmi, fill = stroke)) +
p1 geom_violin(alpha = 0.3) +
geom_boxplot(width = 0.3, notch = TRUE) +
guides(fill = "none") +
labs(title = "BMI by Stroke Status",
x = "EHR indication of a stroke",
y = "Body Mass Index (BMI)") +
theme(plot.background = element_rect(fill = "azure2"))
<- ggplot(Stroke_data, aes(sample = bmi, col = stroke)) +
p2 geom_qq() + geom_qq_line() +
facet_wrap(~ stroke, labeller = "label_both") +
guides(col = "none") +
theme_bw() +
labs(y = "Body Mass Index (BMI)",
title = "Normal QQ plot for BMI") +
theme(plot.background = element_rect(fill = "azure2"))
+ theme(
p1 panel.background = element_rect(fill = "lemonchiffon1",
colour = "lemonchiffon1",
size = 0.5, linetype = "solid")) +
+
p2 theme(plot.background = element_rect(fill = "azure2")) +
theme(panel.background = element_rect(fill = "lemonchiffon1",
colour = "lemonchiffon1",
size = 0.5, linetype = "solid"))
The two violin plots demonstrate a relatively normal distribution and
few outliers for both of the stroke
categories. The normal
QQ plots help to visualize the degree of skew seen in the data, which,
as suggested by the histogram is to the right, slightly.
In order to better visualize the distribution of bmi
for
each stroke
category and specifically to visualize the
highest density areas of population for each `bmi, a rain cloud plot ma
be useful. The following code is present for the purpose of displaying a
rain cloud plot.
ggplot(Stroke_data, aes(stroke, bmi)) +
::stat_halfeye(adjust = .5, width = .3,
ggdist.width = 0, justification = -.3,
point_colour = NA, fill = "darkturquoise") +
geom_boxplot(width = .1, outlier.shape = NA, fill = "tomato1" ,
outlier.color = "black") +
stat_summary(fun = "mean", geom = "point",
shape = 23, size = 3, fill = "white") +
theme(
panel.background = element_rect(fill = "lemonchiffon1",
colour = "lemonchiffon1",
size = 0.5, linetype = "solid")) +
labs(
title = "BMI by stroke status",
subtitle =
"For Patients Who Have Stroke Evaluation Information in Their EHR",
y = "Body Mass Index (BMI)",
x="Stroke Status") +
theme(plot.background = element_rect(fill = "azure2"))
It can be seen in the rain cloud plot above, similarly to the violin
plot depicted above, that the mean and median values of bmi
are very close for both stroke
categories. It can also be
seen that the greatest number of people in each group have
bmi
values slightly above 30, which the CDC would consider
to be overweight.
Numerical Assessment of Skew
In order to determine the type of test used to figure out if there is
a statistically meaningful difference between the bmi
values of patients who had a stroke and those who did not, the normality
of the distribution of bmi
values must be assessed. While
the visualizations are useful in determining obvious skew, it is also
useful to ensure that the data is normal using a nonparametric skew
analysis. In this analysis, nonparametric skew is calculated by taking
the difference between the mean and the median, and dividing it by the
standard deviation. This particular type of skew is based off of
Pearson’s notion of median skewness, with values falling between -1 and
+1 for any distribution. Generally, when this measure of skew is used,
values greater than + 0.2 indicate fairly substantial right skew, while
values below -0.2 indicate fairly substantial left skew. If the skewness
value for bmi
is between -0.2 and +0.2, the data will be
considered to be normally distributed.
|> group_by(stroke) |>
Stroke_datasummarize(skew = round_half_up((mean(bmi) - median(bmi))/sd(bmi), 3)) |>
kbl(caption = "Assessment of Skew by Stroke Status", digits = 5) |>
kable_styling(font_size = 20) |>
kable_paper("hover", full_width = F)
stroke | skew |
---|---|
No Stroke | 0.151 |
Stroke | 0.134 |
Above, it can be seen that there is no substantial skew in either of
the stroke
categories for bmi
, only some mild
right skew. For this reason the data will be considered to be normally
distributed.
Assessment of population variance
The population variance is also an important measure to consider when
deciding which method to utilize to compare the means of two independent
samples. The following code is present for the purpose of displaying the
population variance in bmi
for each stroke
category.
|> group_by(stroke) |>
Stroke_data summarize(n = n(), mean = mean(bmi), variance = var(bmi)) |>
kbl(caption = "Assessment of Population Variance", digits = 5) |>
kable_styling(font_size = 20) |>
kable_paper("hover", full_width = F)
stroke | n | mean | variance |
---|---|---|---|
No Stroke | 3106 | 29.96784 | 40.92328 |
Stroke | 179 | 30.50670 | 36.26804 |
Although the “Stroke” and “Not Stroke” categories have similar
variances, they are still nearly 5 bmi
points away, and for
this reason they cannot be considered to be equal.
Overall Assessment of Plots and Numeric Summaries
The plots and numeric summaries depicted above were all in
agreement that the data was reasonably normally distributed with some
mild, but unsubstantial right skew in both stroke
categories. Additionally, it was observed that there was not similar
variance in BMI values between the two stroke status
categories.
Sample information
It can be deduced from the above numerical summaries and plots that
the bmi
data for the Stroke_data
set
is normally distributed among both stroke
categories. Only a small amount amount of right skew was observed in the
distribution, which was determined by a nonparametric skew analysis to
be unimportant. The presence of a normal distribution allows the means
of these independent samples to being compared accurately using an
approach based on a t-distribution. With the presence of a normal
distribution, a bootstrap re-sampling approach is not necessary to
compare the bmi
means in the two stroke
categories. Additionally, because the objective of the research question
was to compare the samples based on their mean values, they are not best
compared using a rank-based approach like that seen in a
Wilcoxon-Mann-Whitney rank-sum test. This is because this category of
tests does not generate a CI based on the means of the samples, but for
calculated pseudo-medians, which are not analogous to the means of the
samples if the samples are not symmetrically distributed.
Given that equal population variance was not able to be established for the two independent samples, a pooled t-test may not be used here. Instead, a t-test will be completed with Welch’s approach, which does not assume equal variance. Welch’s approach has two additional conditions, that the samples in each group are drawn independently of each other and that the samples in each group represent a random sample of the population of interest. Our two groups that we are comparing means for are independent samples, with no matching or pairing present and both are drawn from wide ranging de-identified EHR data, which is widely used around the United States in hospitals that evaluate patients for strokes. Thus, these samples can be considered to be random samples.
4.3 Main Analysis
The following code is present for the purpose of comparing the means
of the two stroke
categories using a
t-test . A Welch’s t-test approach will be used because
equal variance could not be established. It is important to note that
Welch’s t-test carries two additional assumptions to the normality
assumption: that the samples in each group are drawn independently of
each other and that the samples in each group represent a random sample
of the population of interest. Both of those can be considered to be
true in this case for reasons mentioned above. A summary of the 90%
confidence interval for the difference (No stroke - Stroke) of the
population means is included as well.
<- t.test(bmi ~ stroke, data = Stroke_data, conf.level = 0.90)
wtt
wtt
Welch Two Sample t-test
data: bmi by stroke
t = -1.16, df = 201.85, p-value = 0.2474
alternative hypothesis: true difference in means between group No Stroke and group Stroke is not equal to 0
90 percent confidence interval:
-1.3064782 0.2287433
sample estimates:
mean in group No Stroke mean in group Stroke
29.96784 30.50670
|>
wtt tidy() |>
select(estimate, conf.low, conf.high) |>
kbl(caption = "Welch's T-test", digits = 5) |>
kable_styling(font_size = 20) |>
kable_paper("hover", full_width = F)
estimate | conf.low | conf.high |
---|---|---|
-0.53887 | -1.30648 | 0.22874 |
4.4 Conclusions
From the above analysis, which uses a Welch’s T-test approach to
compare the mean bmi
values for the two categories of the
stroke
variable in the Stroke_data
data set,
two point estimates were able to be obtained. The point estimate for the
mean BMI in the “No Stroke” group was 29.96784 and the point estimate
for the mean BMI in the “Stroke” group was 30.50670. The 90% confidence
interval for the difference (No stroke - Stroke) of the population means
was found to be from -1.3064782 to 0.2287433. Because the t-tests’
resulting 90% confidence interval includes zero in its range, it can be
concluded that the true potential difference in mean BMI values between
the “No Stroke” and “Stroke” groups could be positive, negative, or
zero. Thus we can conclude that, with an alpha level of 0.10 and the
sample size used, there is not a meaningful difference in mean BMI
between patients that have been noted as having suffered a stroke in
their electronic medical history and those who have not been noted as
having suffered a stroke in their electronic medical history using the
Stroke_data
set.
Although this particular study was not able to establish a difference
in mean bmi
values among the two stroke
categories, there are a number of potential limitations to this
particular study. One of these limitations is that the type of stroke
that a patient suffered was not mentioned. Because strokes can be caused
by ischemia or hemorrhage in a multitude of different areas of the brain
and can have a wide range of etiologies, treating all strokes as the
same when considering the BMI of a patient may not accurately describe
the true relationship between stroke and BMI. Also, the number of
strokes that a patient suffered and the length of time since that
patient’s last stroke were not mentioned, each of which may have
different effects on BMI.
Future studies looking to determine whether or not there is a
difference in mean bmi
values based on stroke status may
opt to dig deeper into the electronic health records and create a data
set that includes information on the type of stroke a patient suffered,
the number of strokes they had, and how recently their most recent
stroke was. Information on how medical staff intervened with the
patient’s stroke may also be useful in determining whether or not BMI
values differ based on stroke status.
5 Analysis C: Comparing Three Means with Independent Samples
5.1 Research Question
In this analysis, the mean bmi
values in the
Stroke_data
tibble will be compared by the
smoking_status
of the patients included in the study, with
the objective of identifying if there is any meaningful difference in
bmi
for patients who were recorded in their EHR as being
either a smoker, former smoker, or person who has never smoked. The
three smoking_status
categories are all independent of one
another and none of the results are paired or matched. This information
may be a useful hypothesis generator for researchers looking to further
investigate the physiological changes among smokers, non-smokers and
former smokers. The research question is as follows:
Is there a meaningful difference in mean BMI values between patients who smoke currently, those who have smoked in the past, and those who do not smoke, among patients who have a stroke evaluation in their electronic health record?
5.2 Describing the Data
In this analysis, two variables will be examined. One is smoking
status, which is a three level categorical variable named
smoking_status
with the levels “never smoked”, “former
smoker”, and “smokes”, representing the smoking status of the patient as
indicated by that patient’s EHR. “never smoked” indicates that the
patient has no history of smoking, “formerly smoked” indicates that they
have, but no longer smoke, and “smokes” indicates that the patient
currently smokes.
The second variable is BMI, which is a quantitative variable describing the current body mass index of the patient as indicated by that patient’s EHR.
Numeric summary
The following code is present for the purpose of displaying a brief
numerical summary of the bmi
data between patients in each
of the three smoking_status
categories.
::favstats(bmi ~ smoking_status, data = Stroke_data) |>
mosaickable(caption = "BMI Numerical Summary by Smoking Category",
digits = 2) |> kable_styling(font_size = 20) |>
kable_paper("hover", full_width = F)
smoking_status | min | Q1 | median | Q3 | max | mean | sd | n | missing |
---|---|---|---|---|---|---|---|---|---|
never smoked | 11.5 | 25.0 | 28.7 | 33.40 | 49.8 | 29.69 | 6.41 | 1751 | 0 |
formerly smoked | 15.0 | 26.1 | 29.8 | 34.35 | 49.8 | 30.55 | 6.32 | 811 | 0 |
smokes | 15.7 | 25.4 | 29.0 | 34.25 | 49.9 | 30.13 | 6.32 | 723 | 0 |
It can be seen in the table above that the three
smoking_status
categories have very similar values for max,
mean, median, and interquartile range. It can also be seen that the
“formerly smoked” and “smokes” categories each have less than half of
the number of observations as the “never smoked” category.
Graphical assessment
In order to visualize the overall distribution of data in terms of
BMI, a histogram was constructed comprising the entire range for
bmi
of the Stroke_data
data set, with portions
of the histogram highlighted to show the distribution of the three
smoking_status
categories. To better individually highlight
the distributions of the three categories, an additional three
histograms were made.
<- ggplot(Stroke_data, aes(x = bmi, fill = smoking_status, color = 2)) +
p3 geom_histogram(bins = 75) +
guides(color = "none") +
labs(title = "Histogram of BMI results For All Smoking Statuses",
x = "BMI") +
theme(panel.background = element_rect(fill = "lemonchiffon1",
colour = "lemonchiffon1",
size = 0.5, linetype = "solid")) +
theme(plot.background = element_rect(fill = "azure2"))
<- ggplot(Stroke_data, aes(x = bmi)) +
p4 geom_histogram(aes(fill = smoking_status), bins = 30, col = "white") +
theme_bw() +
facet_wrap(~ smoking_status) + guides(fill = "none") +
labs(title = "Individual histograms of BMI by smoking status",
y = "count",
x = "Smoking Status") +
theme(panel.background = element_rect(fill = "lemonchiffon1",
colour = "lemonchiffon1",
size = 0.5, linetype = "solid")) +
theme(plot.background = element_rect(fill = "azure2"))
/p4 p3
It can be seen above that each of the three
smoking_status
categories appear to have a normal
distribution with some slight right skew. All three categories appear to
have the greatest number of people at a BMI value around 30, which is
consistent with the numeric summary.
In order to better visualize the distribution of bmi
for
each smoking_status
category and specifically to visualize
the highest density areas of population for each, a rain cloud plot may
be useful. The following code is present for the purpose of displaying a
rain cloud plot.
ggplot(Stroke_data, aes(smoking_status, bmi)) +
::stat_halfeye(adjust = .5, width = .3,
ggdist.width = 0, justification = -.3,
point_colour = NA, fill = "darkturquoise") +
geom_boxplot(width = .1, outlier.shape = NA, fill = "tomato1" ,
outlier.color = "black", notch = TRUE) +
stat_summary(fun = "mean", geom = "point",
shape = 23, size = 3, fill = "white") +
theme(
panel.background = element_rect(fill = "lemonchiffon1",
colour = "lemonchiffon1",
size = 0.5, linetype = "solid")) +
labs(
title = "BMI by Smoking Status",
subtitle =
"For Patients Who Have Stroke Evaluation Information in Their EHR",
y = "Body Mass Index (BMI)",
x="Smoking Status") +
theme(plot.background = element_rect(fill = "azure2"))
It can be seen in the above rain cloud plot that the mean and median
values of bmi
are very close for all three of the
smoking_status
categories. It can also be seen that the
greatest number of people in each group have bmi
values
slightly below 30, which the CDC would consider to be in a healthy range
(meaning neither overweight or underweight). The data also appear to
have relatively similar variance in each smoking_status
category.
The distribution of data can be further investigated using a box and
violin plot juxtaposed with normal QQ plots for bmi
in each
smoking_status
category. The following code allows for
that.
<- ggplot(Stroke_data,
p1 aes(x = smoking_status, y = bmi, fill = smoking_status)) +
geom_violin(alpha = 0.3) +
geom_boxplot(width = 0.3, notch = TRUE) +
guides(fill = FALSE) +
labs(title = "BMI by Smoking Status",
x = "Smoking Status", y = "Body Mass Index") +
theme(plot.background = element_rect(fill = "azure2"))
<- ggplot(Stroke_data, aes(sample = bmi, col = smoking_status)) +
p2 geom_qq() + geom_qq_line() +
facet_wrap(~ smoking_status) +
guides(col = "none") +
theme_bw() +
labs(y = "Observed BMI values",
title = "Normal QQ Plot") +
theme(plot.background = element_rect(fill = "azure2"))
+ theme(
(p1 panel.background = element_rect(fill = "lemonchiffon1",
colour = "lemonchiffon1",
size = 0.5, linetype = "solid")) ) + (p2 +
plot_annotation(title = "Overall title") + theme(
panel.background = element_rect(fill = "lemonchiffon1",
colour = "lemonchiffon1",
size = 0.5, linetype = "solid")) )
The three violin plots demonstrate a relatively normal distribution
and few outliers for each of the smoking_status
categories.
The normal QQ plots help to visualize the degree of skew seen in the
data, which, as suggested by the histogram is to the right, slightly for
all three smoking_status
categories.
Numerical Assessment of Skew
A nonparametric skewness assessment was done to help establish
whether or not the data was normally distributed. In this analysis,
nonparametric skew is calculated by taking the difference between the
mean and the median, and dividing it by the standard deviation. This
particular type of skew is based off of Pearson’s notion of median
skewness, with values falling between -1 and +1 for any distribution.
Generally, when this measure of skew is used, values greater than + 0.2
indicate fairly substantial right skew, while values below -0.2 indicate
fairly substantial left skew. If the skewness value for bmi
is between -0.2 and +0.2, the data will be considered to be normally
distributed.
|> group_by(smoking_status) |>
Stroke_datasummarize(skew = round_half_up((mean(bmi) - median(bmi))/sd(bmi), 3)) |>
kbl(caption = "Assessment of Skew", digits = 2) |>
kable_styling(font_size = 20) |>
kable_paper("hover", full_width = F)
smoking_status | skew |
---|---|
never smoked | 0.15 |
formerly smoked | 0.12 |
smokes | 0.18 |
It can be seen in the table above that the “smokes” category of the
smoking_status
variable demonstrated the greatest amount of
right skew of all of the three categories. However, none were above 0.2,
indicating that there was no substantial skew, only some mild right
skew. For this reason, the data will be considered to be normally
distributed.
5.3 Main Analysis
The assessment of skewness, histograms, box plots, and QQ diagrams of
bmi
data for each of the three smoking_status
categories suggested that they all have a fairly normal distribution. In
addition to normality, the samples in each of the
smoking_status
categories were independently sampled and,
as mentioned in analysis B, the data used in this study is reflective of
a random sample of the population of interest. Moreover, equal variance
is seen across all three smoking_status
categories. Given
this information, an ANOVA (analysis of variance) test, which is based
on a pooled t-test, will be used to compare the mean BMI for each of the
smoking_status
categories.
<- lm(bmi ~ smoking_status, data = Stroke_data)
Stroke_data_m1
anova(Stroke_data_m1) |>
kbl(caption= "Analysis of Variance to Compare Mean BMI by Smoking Status",
digits = 2) |>
kable_styling(font_size = 20) |>
kable_paper("hover", full_width = F)
Df | Sum Sq | Mean Sq | F value | Pr(>F) | |
---|---|---|---|---|---|
smoking_status | 2 | 423.77 | 211.88 | 5.22 | 0.01 |
Residuals | 3282 | 133147.87 | 40.57 | NA | NA |
From the above analysis of variance (ANOVA), it appears that the null
hypothesis, which was that there was no difference in mean BMI values
between patients who smoke currently, those who have smoked in the past,
and those who do not smoke is not consistent with our data for
bmi
among smoking_status
categories. The eta
squared value is calculated as follows:
(423.77)/(423.77 + 133147.87) = 0.0032
The value for eta squared shows that smoking_status
accounts for around 0.32% of the variation in bmi
.
Tukey multiple comparisons of means
Although the ANOVA was able to elucidate whether or not the three
bmi
means across the smoking_status
categories
are different, it does not explain where the differences lie. To explore
the potential differences further, Tukey’s Honestly Significant
Differences approach to pairwise comparisons of means was used. The
following code uses this technique to create 90% confidence intervals
for the differences in mean BMI for each of the possible
smoking_status
pairs.
TukeyHSD(aov(Stroke_data$bmi ~ Stroke_data$smoking_status), conf.level = 0.90)
Tukey multiple comparisons of means
90% family-wise confidence level
Fit: aov(formula = Stroke_data$bmi ~ Stroke_data$smoking_status)
$`Stroke_data$smoking_status`
diff lwr upr p adj
formerly smoked-never smoked 0.8576613 0.3022364 1.413086 0.0043846
smokes-never smoked 0.4401911 -0.1378742 1.018256 0.2617963
smokes-formerly smoked -0.4174702 -1.0863104 0.251370 0.4057613
The 90% confidence intervals derived from the Tukey’s HSD analysis show that there is likely a considerable difference between the “formerly smoked” and “never smoked” groups, but not between any of the other possible pairs.
Plot of Tukey multiple comparisons of means
The following code is present for the purpose of plotting the 90% confidence intervals associated with the above Tukey’s HSD analysis.
For the below plot, the following shorthand was used
Abbreviation | Category of smoking_status |
---|---|
S | smokes |
NS | never smoked |
FS | formerly smoked |
par(bg = "azure2")
$smk.new <-
Stroke_datafct_recode(Stroke_data$smoking_status,
"S" = "smokes", "NS" = "never smoked",
"FS" = "formerly smoked")
<- c(5,6,4,2) + 0.1
mar.default par(mar = mar.default + c(0, 4, 0, 0))
plot(TukeyHSD(aov(Stroke_data$bmi ~ Stroke_data$smk.new),
conf.level = 0.90, las = 1))
It can be seen above that the 90% confidence intervals derived from
the Tukey’s HSD analysis of mean bmi
by smoking status
group show that there is a statistically meaningful difference between
the “formerly smoked” and “never smoked” groups, but not between any of
the other possible pairs. Given that the difference between the
“formerly smoked” and “never smoked” groups is positive, it can be
concluded that the mean bmi
value for the “formerly smoked”
group is greater than that of the “never smoked group. There is no
evidence that there is a meaningful difference between mean
bmi
values for any of the other pairs, and it cannot
explicitly concluded with the current sample size and an alpha level of
0.10 that any of the other means are meaningfully different from each
other.
5.4 Conclusions
The research question that guided the design of this analysis was “Is
there a meaningful difference in mean BMI values between patients who
smoke currently, those who have smoked in the past, and those who do not
smoke, among patients who have a stroke evaluation in their electronic
health record?”. From the ANOVA test and analysis of Tukey’s HSD for the
mean BMI values of each of the categories of
smoking_status
, it was able to be seen that there was only
a meaningful difference in BMI values between patients who have formerly
smoked and those who have never smoked. When the differences between
other pairs of smoking_status
categories were calculated,
90% confidence intervals were not able to establish an explicitly
positive or negative value and were thus considered to not be
meaningful, as it was not able to be determined which category had a
larger or smaller true mean BMI value.
Although this particular study was not able to establish a difference
in mean bmi
values among all of the
smoking_status
categories, there are a number of potential
limitations to this particular study. One of these limitations is that
the data used to report the smoking status of a patient did not specify
the type of smoking that the patient partakes in. It may be possible
that while a tobacco smoker does not have an appreciable difference in
mean BMI values based on their smoking status, that a marijuana smoker
does. Furthermore, the status of each patient as a heavy, mild and light
smoker was not specified, which may also be a modulator of the patient’s
BMI value based on their smoking status.
Future studies looking to determine whether or not there is a difference in mean BMI values among the three smoking status categories may seek to adjust for the type of smoking a person partakes in and/or their status as a heavy, mild, or light smoker. This would, however, require additional data to be collected that is not included in the data set used in this analysis.
6 Analysis D: Analyzing a 2x2 table
6.1 Research Question
For this analysis, the focus will be on the gender
and
stroke
variables, which are each two level categorical
variables. Particularly, this analysis will be considering whether or
not the gender of a patient that has been evaluated for a stroke has an
impact on whether or not they did have a stroke. In this analysis, a 90%
confidence level will be used and a Bayesian augmentation of +2 success
and +2 failures will be used. The research question is as follows:
Is there a relationship between the gender of a person (male or female) and whether or not that person has had a stroke among patients who have a stroke evaluation in their electronic health record?
6.2 Describing the Data
For this analysis two variables in the Stroke_data
data
set are considered: gender
and stroke
. Both of
these are binary categorical variables. Gender refers to the gender of
the subject based on recording in the patient’s EHR. “Male” indicates
that the subject was male, “Female” indicates that they were female.
Stroke refers to whether or not electronic health records indicated that
a patient had suffered a stroke at some point in their life. “Stroke”
indicates that they suffered a stroke, “No Stroke” indicates that they
did not.
The following code is present for the purpose of creating a two by two table to analyze the potential relationship between the two chosen variables. In this table, which is displayed in standard epidemiological format, the `gender status “Male” is considered to be the exposure of interest. The purpose of this table is to visualize the distribution of patients, and for this reason, a Bayesian augmentation was withheld. However, it will be added during the main analysis.
# Re-leveling the factors so that the tabyl will be in standard epidemiological format:
<- Stroke_data |>
Stroke_data_analysisD mutate(stroke = fct_relevel(stroke, "Stroke", "No Stroke")) |>
mutate(gender = fct_relevel(gender, "Male", "Female"))
|> tabyl(gender, stroke) |>
Stroke_data_analysisD adorn_percentages(denom = "row") |>
adorn_pct_formatting(digits = 1) |>
adorn_ns(position = "front") |>
adorn_title(col_name = "Stroke status by gender",
row_name = "Gender") |>
kbl(caption = "Stroke Status by Gender", digits = 2) |>
kable_styling(font_size = 20) |>
kable_paper("hover", full_width = F)
Stroke status by gender | ||
---|---|---|
Gender | Stroke | No Stroke |
Male | 75 (5.8%) | 1210 (94.2%) |
Female | 104 (5.2%) | 1896 (94.8%) |
6.3 Main Analysis
The following code is present for the purpose of creating a
two-by-two table analysis, displaying the a point estimate and 90%
confidence interval for the probability of having a specific
stroke
status between the two gender
categories, a relative risk of stroke given that the gender is male (vs
female), and the odds ratio describing the odds of stroke use given that
the gender is Male (vs female).
A Bayesian augmentation was added to the analysis table, adding two successes and two failures in accordance with Alan Agresti’s suggestion of using (x +2)/(n +4) as a success rate estimate for categorical data analysis rather than (x))/(n)
twobytwo(75+2, 1249+2, 104+2, 1939+2,
"Male", "Female", "Stroke", "No Stroke",
conf.level = 0.90)
2 by 2 table analysis:
------------------------------------------------------
Outcome : Stroke
Comparing : Male vs. Female
Stroke No Stroke P(Stroke) 90% conf. interval
Male 77 1251 0.0580 0.0483 0.0695
Female 106 1941 0.0518 0.0443 0.0605
90% conf. interval
Relative Risk: 1.1197 0.8813 1.4225
Sample Odds Ratio: 1.1271 0.8748 1.4521
Conditional MLE Odds Ratio: 1.1270 0.8635 1.4677
Probability difference: 0.0062 -0.0068 0.0199
Exact P-value: 0.4377
Asymptotic P-value: 0.4375
------------------------------------------------------
It can be elucidated from the above table that the individual probability of being a male with a positive stroke status was around 0.058 and that the individual probability of being a female with a positive stroke status was around 0.0518, each with confidence intervals from around 0.04 to 0.06.
It can also be seen that the relative risk of having suffered a stroke given being male vs. having suffered a stroke given being male, is estimated to be 1.1197. While the point estimate indicates that there is a greater risk of suffering a stroke for males than females, the 90% confidence interval for this relative risk value is 0.8813 - 1.4225, which is not detectably different from 1 at an alpha level of 0.10. This indicates that there is not evidence to suggest that being male is either a risk factor or a protecting factor for stroke in this data set.
Additionally, an odds ratio describing the odds of having suffered a
stroke given a patient is male vs. female was found to be 1.1271 with a
90% confidence interval of 0.8748 - 1.4521. Although the odds ratio
suggests that males have a greater odds of having suffered a stroke than
females, the confidence interval suggests that the odds ratio is not
detectably different from 1 at an alpha level of 0.10. This indicates
that a larger sample size would likely be required in order to suggest
that the odds of suffering a stroke are higher or lower for males when
compared to females in the Stroke_data
data set.
Finally, the difference in probability of having suffered a stroke
given a patient is male vs. female was estimated to be 0.0062.The 90%
confidence interval for this relative risk value is -0.0068 - 0.0199,
which at an alpha level of 0.10, is not detectably different from zero.
This indicates that there is not enough evidence to suggest that there
is a difference in the probability of having suffered a stroke greater
than, less than, or equal to zero between the two gender
categories in the Stroke_data` data set.
6.4 Conclusions
The research question that helped guide the design of this analysis was “Is there a relationship between the gender of a person (male or female) and whether or not that person has had a stroke among patients who have a stroke evaluation in their electronic health record?”. Although point estimates suggested that being male was associated with a greater risk of suffering a stroke (based on the calculated risk ratio) and greater odds of suffering a stroke (based on the calculated odds ratio), the confidence intervals for all of the estimates included both positive and negative values, demonstrating that at the current sample size, a relationship between the gender of a person (male or female) and whether or not that person has had a stroke among patients who have a stroke evaluation in their electronic health record could not be determined with 90% confidence.
This analysis had a number of limitations associated with it, largely based on the quality of information provided for the stroke status of the patient. In the data analyzed, the only information about the stroke status of the patient was whether or not that patient had suffered a stroke. There is no mention of the type of stroke or care received after stroke which may be affected by the gender of the patient. Future analyses looking to find if there is a relationship between the gender of a person and whether or not that person has had a stroke may take information about the specific type of stroke and care given into account to establish a more realistic view of any relationship or lack thereof.
7 Analysis E: Analyzing a 2x3 table
7.1 Research Question
For this analysis, the focus will be on the hypertension
and work_type
variables, which are each categorical
variables. Particularly, this analysis will be considering whether or
not the type of employment of a patient that has recorded in their EHR
has an impact on whether or not they also have hypertension recorded in
their EHR. In this analysis, a 90% confidence level will be used. The
research question is as follows:
Is there a relationship between the type of employer a person has and whether or not that person also has hypertension among patients who have a stroke evaluation in their electronic health record?
For this analysis, only persons who are employed will be considered.
The reason for this is that there is immense variability among the types
of circumstances that could cause somebody to be unemployed. For this
reason, only three work_type
categories are used
(“Govt_job”, “Private”, and “Self-employed”).
7.2 Describing the Data
In this analysis, two variables are observed: the hypertension status
of a patient, a binary categorical variable titled
hypertension
, and the category of employment of a patient,
a categorical variable with three levels titled work_type
.
In this study, hypertension
represents the hypertensive
status of the patient as indicated by that patient’s EHR. “No
Hypertension” indicates that the patient does not have hypertension
(high blood pressure) and “Hypertension” indicates that the patient has
hypertension (high blood pressure).work_type
represents the
category of work of the patient as indicated by that patient’s EHR.
“Govt_job” indicates that the patient works for the government (public
sector), “Private” indicates that the patient is employed and works for
a private company (private sector), “Self-employed” indicates that the
patient is self employed.
The following code is present for the purpose of organizing a table
displaying the distribution of hypertension status among the three
categories of work_type
.
<- Stroke_data |>
Stroke_data_analysisE as_tabyl() |>
filter(complete.cases(hypertension, work_type))
$work_type <- recode_factor(Stroke_data_analysisE$work_type,
Stroke_data_analysisE"Govt_job" = "Government Job",
"Private" = "Private Sector",
"Self-employed" = "Self-Employed")
<- Stroke_data_analysisE |>
Analysis_E_table tabyl(hypertension, work_type)
|>
Stroke_data_analysisE tabyl(hypertension, work_type) |>
adorn_totals(where = c("row", "col")) |>
adorn_title( col_name = "Employer") |>
kbl( caption = "Hypertension status by Employment Type", digits = 2) |>
kable_styling(font_size = 20) |>
kable_paper("hover", full_width = F)
Employer | ||||
---|---|---|---|---|
hypertension | Government Job | Private Sector | Self-Employed | Total |
No Hypertension | 444 | 1936 | 512 | 2892 |
Hypertension | 61 | 223 | 109 | 393 |
Total | 505 | 2159 | 621 | 3285 |
It can be seen in the above table that greatest proportion of
hypertensive individuals is seen in the “Self-Employed”
work_type
category, with nearly double the proportion of
hypertensive individuals than the “Government job” category and a more
than 50% greater proportion of hypertensive individuals than the
“Private Sector” category.
To further visualize the information in the above table, a bar chart
was constructed, displaying the distribution of work_type
among the two levels of hypertension
.
ggplot(Stroke_data_analysisE, aes(x = hypertension, fill = work_type)) +
geom_bar(position = "fill") +
scale_fill_manual(values=c("chartreuse4", "salmon2", "steelblue1")) +
theme(
panel.background = element_rect(fill = "lemonchiffon1",
colour = "lemonchiffon1",
size = 0.5, linetype = "solid")) +
labs(
title = "Hypertension by Employment Type",
subtitle = "Based on EHR data of patients evaluated for stroke",
y = "Proportion of study population",
x="Hypertension status") +
theme(plot.background = element_rect(fill = "azure2"))
7.3 Main Analysis
It can be seen in the two by three table above that none of the cells contain zero observations and all cells have at least five observations, as expected. For this reason, each of the Cochran conditions can be considered to have been met and a Pearson chi-square test is appropriate for assessing homogeneity .
A chi-square test for the data in the above table will help to elucidate whether or not there is a relationship between the employment category of a person and their hypertensive status by determining whether or not the two variables are independent of one another. The chi-square test of independence for this analysis is a hypothesis test with the following conditions:
Null hypothesis: The employment category of a person and their hypertensive status are independent of one another.
Alternative hypothesis: There is a relationship between the employment category of a person and their hypertensive status.
The following code is present for the purpose of running a chi-square test of independence.
chisq.test(Analysis_E_table)
Pearson's Chi-squared test
data: Analysis_E_table
X-squared = 23.901, df = 2, p-value = 6.457e-06
It can be seen above that the chi-square test statistic is 23.901 on
two degrees of freedom. The chi-square test of independence yielded a
p-value of 6.457e-06. This information suggests that the difference in
hypertension status among the three employment categories of the
work_type
variable is less likely to be due to chance and
more likely to be attributed to a relationship between hypertension
status and employment type. It is important to note that additional
testing must be done to determine the true presence of a relationship,
as a small p-value can only suggest that there is a small probability
that the observed difference is due to chance, not that there is a
relationship between the two variables. Nonetheless, the results of this
chi-square test of independence encourage the null hypothesis to be
rejected.
7.4 Conclusions
The research question that guided the design of this analysis was “Is
there a relationship between the type of employer a person has and
whether or not that person also has hypertension among patients who have
a stroke evaluation in their electronic health record?” The chi-square
test of independence was able to determine that there is a high
probability that there is a relationship between the type of employer a
person has and whether or not that person also has hypertension based on
information from the hypertension
and
work_type
variables in the Stroke_data
data
set. It was not, however, able to determine the true nature of the
relationship between hypertension and work type, only that there is
likely to be a relationship present.
Future studies looking to investigate a possible relationship between the type of employer a person has and whether or not that person also has hypertension may consider utilizing a data set that includes a continuous quantitative variable to determine hypertension, such as the blood pressure in mmHg. The presence of hypertension as a binary categorical variable is a key limitation to this analysis as it prevents a detailed understanding of the relationship of hypertension and employment type from being visualized. A continuous quantitative variable may better be able to suggest the type of relationship that is present between the employment type and hypertension status of a patient and answer the research question in more detail.
Session Information
::session_info() sessioninfo
─ Session info ─────────────────────────────────────────────────────
setting value
version R version 4.2.1 (2022-06-23)
os macOS Big Sur ... 10.16
system x86_64, darwin17.0
ui X11
language (EN)
collate en_US.UTF-8
ctype en_US.UTF-8
tz America/New_York
date 2022-12-19
pandoc 2.19.2 @ /Applications/RStudio.app/Contents/MacOS/quarto/bin/tools/ (via rmarkdown)
─ Packages ─────────────────────────────────────────────────────────
package * version date (UTC) lib source
abind 1.4-5 2016-07-21 [1] CRAN (R 4.2.0)
assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.2.0)
backports 1.4.1 2021-12-13 [1] CRAN (R 4.2.0)
base64enc 0.1-3 2015-07-28 [1] CRAN (R 4.2.0)
bookdown 0.30 2022-11-09 [1] CRAN (R 4.2.0)
broom * 1.0.1 2022-08-29 [1] CRAN (R 4.2.0)
bslib 0.4.1 2022-11-02 [1] CRAN (R 4.2.0)
cachem 1.0.6 2021-08-19 [1] CRAN (R 4.2.0)
car * 3.1-1 2022-10-19 [1] CRAN (R 4.2.0)
carData * 3.0-5 2022-01-06 [1] CRAN (R 4.2.0)
cellranger 1.1.0 2016-07-27 [1] CRAN (R 4.2.0)
checkmate 2.1.0 2022-04-21 [1] CRAN (R 4.2.0)
cli 3.4.1 2022-09-23 [1] CRAN (R 4.2.0)
cluster 2.1.4 2022-08-22 [1] CRAN (R 4.2.0)
cmprsk 2.2-11 2022-01-06 [1] CRAN (R 4.2.0)
colorspace 2.0-3 2022-02-21 [1] CRAN (R 4.2.0)
crayon 1.5.2 2022-09-29 [1] CRAN (R 4.2.0)
data.table 1.14.6 2022-11-16 [1] CRAN (R 4.2.0)
DBI 1.1.3 2022-06-18 [1] CRAN (R 4.2.0)
dbplyr 2.2.1 2022-06-27 [1] CRAN (R 4.2.0)
deldir 1.0-6 2021-10-23 [1] CRAN (R 4.2.0)
digest 0.6.31 2022-12-11 [1] CRAN (R 4.2.1)
distributional 0.3.1 2022-09-02 [1] CRAN (R 4.2.0)
dplyr * 1.0.10 2022-09-01 [1] CRAN (R 4.2.0)
ellipsis 0.3.2 2021-04-29 [1] CRAN (R 4.2.0)
Epi * 2.47 2022-06-26 [1] CRAN (R 4.2.0)
equatiomatic * 0.3.1 2022-01-30 [1] CRAN (R 4.2.0)
etm 1.1.1 2020-09-08 [1] CRAN (R 4.2.0)
evaluate 0.18 2022-11-07 [1] CRAN (R 4.2.0)
fansi 1.0.3 2022-03-24 [1] CRAN (R 4.2.0)
farver 2.1.1 2022-07-06 [1] CRAN (R 4.2.0)
fastmap 1.1.0 2021-01-25 [1] CRAN (R 4.2.0)
forcats * 0.5.2 2022-08-19 [1] CRAN (R 4.2.0)
foreign 0.8-83 2022-09-28 [1] CRAN (R 4.2.0)
Formula 1.2-4 2020-10-16 [1] CRAN (R 4.2.0)
fs 1.5.2 2021-12-08 [1] CRAN (R 4.2.0)
gargle 1.2.1 2022-09-08 [1] CRAN (R 4.2.0)
generics 0.1.3 2022-07-05 [1] CRAN (R 4.2.0)
GGally * 2.1.2 2021-06-21 [1] CRAN (R 4.2.0)
ggdist * 3.2.0 2022-07-19 [1] CRAN (R 4.2.0)
ggforce * 0.4.1 2022-10-04 [1] CRAN (R 4.2.0)
ggformula * 0.10.2 2022-09-01 [1] CRAN (R 4.2.0)
gghalves * 0.1.4 2022-11-20 [1] CRAN (R 4.2.0)
ggmosaic * 0.3.4 2022-12-10 [1] Github (haleyjeppson/ggmosaic@fb42e7b)
ggplot2 * 3.4.0 2022-11-04 [1] CRAN (R 4.2.0)
ggrepel 0.9.2 2022-11-06 [1] CRAN (R 4.2.0)
ggridges 0.5.4 2022-09-26 [1] CRAN (R 4.2.0)
ggstance 0.3.5 2020-12-17 [1] CRAN (R 4.2.0)
glue 1.6.2 2022-02-24 [1] CRAN (R 4.2.0)
googledrive 2.0.0 2021-07-08 [1] CRAN (R 4.2.0)
googlesheets4 1.0.1 2022-08-13 [1] CRAN (R 4.2.0)
gower 1.0.0 2022-02-03 [1] CRAN (R 4.2.0)
gridExtra 2.3 2017-09-09 [1] CRAN (R 4.2.0)
gtable 0.3.1 2022-09-01 [1] CRAN (R 4.2.0)
haven 2.5.1 2022-08-22 [1] CRAN (R 4.2.0)
highr 0.9 2021-04-16 [1] CRAN (R 4.2.0)
Hmisc 4.7-1 2022-08-15 [1] CRAN (R 4.2.0)
hms 1.1.2 2022-08-19 [1] CRAN (R 4.2.0)
htmlTable 2.4.1 2022-07-07 [1] CRAN (R 4.2.0)
htmltools 0.5.4 2022-12-07 [1] CRAN (R 4.2.0)
htmlwidgets 1.5.4 2021-09-08 [1] CRAN (R 4.2.0)
httpuv 1.6.6 2022-09-08 [1] CRAN (R 4.2.0)
httr 1.4.4 2022-08-17 [1] CRAN (R 4.2.0)
interp 1.1-3 2022-07-13 [1] CRAN (R 4.2.0)
janitor * 2.1.0 2021-01-05 [1] CRAN (R 4.2.0)
jpeg 0.1-9 2021-07-24 [1] CRAN (R 4.2.0)
jquerylib 0.1.4 2021-04-26 [1] CRAN (R 4.2.0)
jsonlite 1.8.4 2022-12-06 [1] CRAN (R 4.2.0)
kableExtra * 1.3.4 2021-02-20 [1] CRAN (R 4.2.0)
knitr * 1.41 2022-11-18 [1] CRAN (R 4.2.0)
labeling 0.4.2 2020-10-20 [1] CRAN (R 4.2.0)
labelled 2.10.0 2022-09-14 [1] CRAN (R 4.2.0)
later 1.3.0 2021-08-18 [1] CRAN (R 4.2.0)
lattice * 0.20-45 2021-09-22 [1] CRAN (R 4.2.1)
latticeExtra 0.6-30 2022-07-04 [1] CRAN (R 4.2.0)
lazyeval 0.2.2 2019-03-15 [1] CRAN (R 4.2.0)
lifecycle 1.0.3 2022-10-07 [1] CRAN (R 4.2.0)
lubridate 1.9.0 2022-11-06 [1] CRAN (R 4.2.0)
magrittr * 2.0.3 2022-03-30 [1] CRAN (R 4.2.0)
MASS 7.3-58.1 2022-08-03 [1] CRAN (R 4.2.0)
Matrix * 1.4-1 2022-03-23 [1] CRAN (R 4.2.1)
mgcv 1.8-41 2022-10-21 [1] CRAN (R 4.2.0)
mime 0.12 2021-09-28 [1] CRAN (R 4.2.0)
modelr 0.1.10 2022-11-11 [1] CRAN (R 4.2.1)
mosaic * 1.8.4.2 2022-09-20 [1] CRAN (R 4.2.0)
mosaicCore 0.9.2.1 2022-09-22 [1] CRAN (R 4.2.0)
mosaicData * 0.20.3 2022-09-01 [1] CRAN (R 4.2.0)
munsell 0.5.0 2018-06-12 [1] CRAN (R 4.2.0)
naniar * 0.6.1 2021-05-14 [1] CRAN (R 4.2.0)
nlme 3.1-160 2022-10-10 [1] CRAN (R 4.2.0)
nnet 7.3-18 2022-09-28 [1] CRAN (R 4.2.0)
numDeriv 2016.8-1.1 2019-06-06 [1] CRAN (R 4.2.0)
patchwork * 1.1.2 2022-08-19 [1] CRAN (R 4.2.0)
pillar 1.8.1 2022-08-19 [1] CRAN (R 4.2.0)
pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.2.0)
plotly 4.10.1 2022-11-07 [1] CRAN (R 4.2.0)
plyr 1.8.8 2022-11-11 [1] CRAN (R 4.2.1)
png 0.1-7 2013-12-03 [1] CRAN (R 4.2.0)
polyclip 1.10-4 2022-10-20 [1] CRAN (R 4.2.0)
productplots 0.1.1 2016-07-02 [1] CRAN (R 4.2.0)
promises 1.2.0.1 2021-02-11 [1] CRAN (R 4.2.0)
purrr * 0.3.5 2022-10-06 [1] CRAN (R 4.2.0)
R6 2.5.1 2021-08-19 [1] CRAN (R 4.2.0)
RColorBrewer 1.1-3 2022-04-03 [1] CRAN (R 4.2.0)
Rcpp 1.0.9 2022-07-08 [1] CRAN (R 4.2.0)
readr * 2.1.3 2022-10-01 [1] CRAN (R 4.2.0)
readxl 1.4.1 2022-08-17 [1] CRAN (R 4.2.0)
reprex 2.0.2 2022-08-17 [1] CRAN (R 4.2.0)
reshape 0.8.9 2022-04-12 [1] CRAN (R 4.2.0)
rlang 1.0.6 2022-09-24 [1] CRAN (R 4.2.0)
rmarkdown * 2.18 2022-11-09 [1] CRAN (R 4.2.0)
rmdformats * 1.0.4 2022-05-17 [1] CRAN (R 4.2.0)
rpart 4.1.19 2022-10-21 [1] CRAN (R 4.2.0)
rstudioapi 0.14 2022-08-22 [1] CRAN (R 4.2.0)
rvest 1.0.3 2022-08-19 [1] CRAN (R 4.2.0)
sass 0.4.2 2022-07-16 [1] CRAN (R 4.2.0)
scales 1.2.1 2022-08-20 [1] CRAN (R 4.2.0)
sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.2.0)
shiny 1.7.3 2022-10-25 [1] CRAN (R 4.2.0)
simputation * 0.2.8 2022-06-16 [1] CRAN (R 4.2.0)
snakecase 0.11.0 2019-05-25 [1] CRAN (R 4.2.0)
stringi 1.7.8 2022-07-11 [1] CRAN (R 4.2.0)
stringr * 1.5.0 2022-12-02 [1] CRAN (R 4.2.0)
survival 3.4-0 2022-08-09 [1] CRAN (R 4.2.0)
svglite 2.1.0 2022-02-03 [1] CRAN (R 4.2.0)
systemfonts 1.0.4 2022-02-11 [1] CRAN (R 4.2.0)
tibble * 3.1.8 2022-07-22 [1] CRAN (R 4.2.0)
tidyr * 1.2.1 2022-09-08 [1] CRAN (R 4.2.0)
tidyselect 1.2.0 2022-10-10 [1] CRAN (R 4.2.0)
tidyverse * 1.3.2 2022-07-18 [1] CRAN (R 4.2.0)
timechange 0.1.1 2022-11-04 [1] CRAN (R 4.2.0)
tweenr 2.0.2 2022-09-06 [1] CRAN (R 4.2.0)
tzdb 0.3.0 2022-03-28 [1] CRAN (R 4.2.0)
utf8 1.2.2 2021-07-24 [1] CRAN (R 4.2.0)
vctrs 0.5.1 2022-11-16 [1] CRAN (R 4.2.0)
viridisLite 0.4.1 2022-08-22 [1] CRAN (R 4.2.0)
visdat 0.5.3 2019-02-15 [1] CRAN (R 4.2.0)
webshot 0.5.4 2022-09-26 [1] CRAN (R 4.2.0)
withr 2.5.0 2022-03-03 [1] CRAN (R 4.2.0)
xfun 0.35 2022-11-16 [1] CRAN (R 4.2.0)
xml2 1.3.3 2021-11-30 [1] CRAN (R 4.2.0)
xtable 1.8-4 2019-04-21 [1] CRAN (R 4.2.0)
yaml 2.3.6 2022-10-18 [1] CRAN (R 4.2.0)
zoo 1.8-11 2022-09-17 [1] CRAN (R 4.2.0)
[1] /Library/Frameworks/R.framework/Versions/4.2/Resources/library
────────────────────────────────────────────────────────────────────