data analysis. The student teams were supported by a group of
mentors—a blend of data scientists, clinicians, health
professionals, and government health informatics specialists
from a local health district in Sydney, Australia [49]. The data
scientists and the panel of authors of the Health Gym project
(ie, Kuo et al [21]) elaborated on the technical aspects and
navigated the participants through the intricacies of data
analysis, including the assumptions we made to use the data
(eg, time 0 corresponded to the date of ART initiation, the
laboratory tests occurred before modifications in therapy).
Meanwhile, clinicians and health professionals provided their
expertise to guide students toward meaningful research questions
(eg, discussing VL and CD4 count monitoring, drug-drug
interactions, and metabolic toxicity [50]). Government health
informaticians, experienced in electronic medical records and
real-world population health application and impact, evaluated
the usefulness of the students’ findings.
This collaborative effort facilitated a comprehensive learning
experience, encompassing the development of analytical models,
data visualization, and effective communication of research
outcomes. Using our synthetic data sets, participants gained
valuable insights into working with data sets that emulate
real-world health scenarios, thereby providing a bridge between
theoretical academia and practical execution.
We summarize the findings of the 2 participating teams below.
Detailed reports for Team 1 and Team 2 can be found in Section
D and Section E of Multimedia Appendix 1, respectively. In
addition, the associated codes for the 2 teams can be found in
Section A of Multimedia Appendix 1.
Findings of Team 1
Team 1 investigated the effectiveness of medications,
categorized by antiretroviral class, in achieving HIV
suppression. Utilizing survival analysis, they assessed the time
between the initiation of ART to the first occurrence of viral
suppression, defined as VL below 1000 copies/mL [36]. They
also assessed the time to CD4 cell count exceeding 500
cells/mm
3
[51], which indicates a healthy immunological status.
With Cox proportional hazards models [52] featuring
time-varying covariates, the team identified particular
antiretroviral agents associated with viral suppression. These
findings were purely associative due to data set limitations,
which did not account for factors such as age, socioeconomic
status, comorbidities, and concurrent medications (of other
illnesses).
Findings of Team 2
Team 2 focused on predicting the necessity of altering an
individual’s ART regimen over a 5-year time span, factoring
in disease flare-ups, resistance, or side effects. They formulated
a “sliding search” function that generated individual records for
each 12-month period, with predictions for antiretroviral
modification and adherence to therapy in the subsequent year
by using neural networks. The team’s methodology produced
promising results, with an accuracy rate of 78% in predicting
antiretroviral modification and 93% in predicting adherence to
therapy. The algorithm detected trends in CD4 and VL results
across the 12-month periods, which appeared to be the key
predictive features. In addition, the team suggested that there
could be potential benefits from exploring recurrent neural
networks (eg, long short-term memory [53]).
Serving as UNSW Coursework Materials
Beyond their utility in the Datathon, our synthetic data sets
contribute to UNSW courses in the Master of Science in Health
Data Science Program [54], namely, HDAT9800 Visualization
& Communication and HDAT9510 Machine Learning II.
HDAT9800 teaches future health data scientists the skills to
visually communicate complex data effectively to diverse
audiences. The course emphasizes the significance of clear data
visualization and advocates for transparency and reproducibility
in scientific work. It employs R [55] and Python [56] to
demonstrate best practices in data analysis and visualization.
Our synthetic data sets provide rich resources to enhance the
learning in this setting. For instance, Marchesi et al [57] used
our data sets to present patient states via t-distributed stochastic
neighbor embedding visualization techniques [58].
Meanwhile, HDAT9510 explores advanced modern ML
algorithms and methods such as convolutional neural networks
[59], autoencoders [60], and reinforcement learning (RL) [61].
As the synthetic data sets consist of time-series variables,
students can develop both feedforward and recurrent neural
networks. See example models built using our data set in
Marchesi et al [57] with recurrent neural networks and even
decision trees [62] and hidden Markov models [63], as in a
similar data set suggested by Wu et al [64]. Furthermore, with
the presence of nonnumeric variables, students can learn about
embedding [65]—transforming nonnumeric levels into
real-valued vectors so that similar levels that are closer in the
vector space carry more analogous meaning. The presence of
missing data in the synthetic data sets also encourages students
to formulate plausible assumptions about the structure of the
clinical data set prior to data modelling.
We provide 3 adaptable worked examples using our ART for
HIV data set, suitable for workshops and lectures. The associated
codes for the worked examples can be found in Section A of
Multimedia Appendix 1. Our synthetic data set supports a
variety of student engagements, from understanding complex
data structures to developing advanced RL algorithms for
optimizing clinical interventions. Moreover, the low patient
disclosure risk associated with our data sets (refer to Section B
in Multimedia Appendix 1) eliminates the need for ethics
approval [66]. This makes these data sets ideal for a range of
settings—from small seminars to larger lecture groups.
Worked Example 1
The first exercise, focused on data visualization using Python,
compares VL trends over time among patients who commenced
their ART with different base drug combos, against the general
trend in all patients. The results of our worked example are
depicted in Figure 4.
This multifaceted exercise requires students to create sub–data
sets based on specific starting base drug combos (ie, FTC +
TDF [emtricitabine + tenofovir disoproxil fumarate] and 3TC
+ ABC [lamivudine + abacavir]), extract data for defined
JMIR Med Educ 2024 | vol. 10 | e51388 | p. 6https://mededu.jmir.org/2024/1/e51388
(page number not for citation purposes)
Kuo et alJMIR MEDICAL EDUCATION
XSL
•
FO
RenderX