The Big Data minefield as AI shapes the future of health care

As Artificial Intelligence technologies expand at an unprecedented rate, charting the unexplored frontiers of health-care AI has never been more urgent. In this three-part series, we explore the nascent legal landscape of health-care AI, appraise the value of patient data and question the appropriate use of AI. Read the first article, Legally Blind, here

Why pearl-clutch over artificial intelligence (AI) in health care when we greeted the Internet with open arms, gleefully eschewing paper charting for dictation?

Patient advocate Ron Beleno knows the answer: “Data is your story based on moments in time.”

Just as every step you take sends a data point into a smartwatch’s fitness tracker, every clinical encounter generates phonebooks full of information. These datasets form the building blocks in training AI algorithms. The more complex the AI model, the more data is needed – to be devoured, processed and generated again.

The lives of countless patients and billions of dollars lie at stake. We are entering the age of Big Data, where every byte is worth its weight in gold.

“The majority [of patients] don’t know the value of their data,” Beleno states. In his 10 years advocating for patient rights in the health-care technology sector while juggling Alzheimer’s caregiving duties for his father, he sees health-care AI as a net benefit for patients and their families … with some caveats. “The minority who do [see issues] are usually concerned about privacy.”

Mohamed Alarakhia, Chief Executive Officer of the Centre of eHealth Excellence and practising family physician, says: “We have the regular framework in the legislature about patient privacy … the challenge is with AI, it is a new frontier in terms of what can be done with the data, and how these systems learn from the data.”

Currently, many AI models retrospectively scrape large datasets from existing databases, with and without permission, to varying consequences. Recently, the New York Times sued OpenAI and its backer, Microsoft, for allegedly training its large language model (LLM) ChatGPT on millions of copyrighted articles without permission. In health care, even the Mayo Clinic has struck licensing deals with 16 AI companies for access to de-identified patient data without notification or consent. Others, like Memorial Sloan Kettering Cancer Centre came under fire for conflict of interest after it granted 25 million patient pathology slides to Paige.AI even though board members of the cancer centre held equity stakes in the AI company.

In Canada, the failed Consumer Privacy Protection Act (Bill C-11) unceremoniously died in Parliament, thus leaving no federal privacy protections for de-identified data. Such data is thus no longer considered personal information, rendering it unprotected from privacy and protection provisions such as the Freedom of Information and Protection of Privacy Act. As a result, big data transactions in the healthcare sphere remain unreined, with datasets free to be brokered commercially without need for patient consent.

However, advances in AI calls for a re-examination of the increasingly blurry boundaries between personal health information and anonymized aggregated data.

The most obvious risk is a data leak in the event of a cybersecurity incident. The more insidious risk that only grows as AI engines become more powerful is the increasing possibility of re-identifying individual patients from previously de-identified aggregate data. Data triangulation has been proven in multiple studies to successfully re-identify individuals, with one 2018 study using machine learning algorithms to re-identify up to 85 per cent of participants out of a pool of 14,451 people.

Although technically illegal to re-identify this data, the potential of this technological advance calls into question the boundaries between personal health information and de-identified aggregate data. Consent now becomes the last line of defence against potentially disastrous consequences of uncontrolled re-identified data distribution. In the U.S., insurance companies Humana, Cigna and UnitedHealthcare already are facing class-action lawsuits regarding their use of AI to prematurely deny claims. Without regulatory protections, we may see deleterious consequences such as insurance premiums based on medical history, workplace discrimination and even predictive behavioural analytics with applied potential in everything from marketing to forensics.

“It takes one bad story to easily turn people away,” Beleno warns.

Despite these risks, courts appear to be in favour of AI companies over user protections – at least for now. A court in the United Kingdom threw out a lawsuit against Google’s DeepMind, brought about for alleged patient data misuse and inadequate privacy protections for 1.6 million patients whose medical records were transferred without permission by the UK National Health Service (NHS). A similar lawsuit brought against the University of Chicago Medical Centre and Google was thrown out by a federal judge. The theme behind these tossed cases centres on the failure to adequately demonstrate tangible evidence of harm caused by data sharing, a difficult feat in such a nascent field.

For now, in the absence of clear legal protections and precedents, physicians are exploring grassroots workarounds.

In the absence of clear legal protections and precedents, physicians are exploring grassroots workarounds.

Alarakhia implements pilot programs for Ontario physicians to trial and review AI scribe programs. He notes that while there is a diversity of AI vendor approaches to data privacy, the most cautious vendors favoured by some physicians currently do not collect patient interaction data for training purposes, treating the data generated from each specific clinical encounter as a dead end. Although this practice can circumvent the thorny questions around data privacy for now, AI vendors with dead end data limitations will inevitably lose out in the data-hungry AI arms race.

Another physician, Jaron Chong, is speaking up as a subject matter expert for various national AI advisory groups. One solution he proposes adapting to AI is the tissue donation model for consent, where consent is pre-emptively and explicitly obtained for specific uses, with possible commercial applications to be disclosed beforehand. Whether this modest proposal catches on, time will tell.

However, even if we optimize and map out the regulations around health data collection, how do we ensure data contributions will be fairly compensated and used for patient benefit?

Outside of health care, archival footage has never been auctioned for higher prices. OpenAI recently reached a deal with the Associated Press for access to decades of press archives to train its large language models (LLMs), while Apple paid between $25 million-$50 million to Shutterstock to access its visual treasure trove. Reddit is projected to make more than $203 million from Google for its sale of more than 17 billion user posts/comments for AI training, drawing controversy from American regulators over whether it has a right to commercially license user-generated content without giving the creators a cut.

Cybersecurity incidents provide additional insights on the value of health-care data. Between 2022-2023, American health-care data breaches led as the costliest of data breaches at an average of USD $10.93 million lost per breach. This value was dwarfed last November when a $480 million class-action lawsuit was launched against a group of southwestern Ontario hospitals after 270,000 patients had their data sold by hackers on the dark web, with legal proceedings currently underway.

Another, less catastrophic approach to quantifying data value is the benefit generated from its applications to the economy in terms of saved health-care costs. For Canada, the economic benefit is unquestionable; McKinsey & Co. estimated a net savings opportunity of $14 billion-$26 billion per year with broad applications of AI at scale in the health-care sector.

Regardless of valuation approaches, this gold rush raises the awkward question: Who owns health data?

“Data belongs to patients,” opines Rosemarie Lall, family physician and early adopter to AI scribe technology in Ontario. Given the countless hours of (often unpaid and on overtime) administrative labour put in by physicians to make progress notes later used to train LLMs, and the fact that patient data often is collected literally from blood, sweat and tears, one wonders about the price owed for our digital pound of flesh.

Instinctually, one might jump to the solution of paying people for their health data, in a model similar to royalties, whenever their data is used. However, several ethical considerations arise. Socioeconomically disadvantaged groups would be disproportionately targeted by companies, trading personal privacy for discounts or benefits. With individual – often monetary – incentives, selection bias and behavioural modification to fit desirable datasets’ eligibility criteria may occur, confounding data accuracy. A society-wide expectation of payment for data also disproportionately advantages wealthier commercial organizations while making innovation inaccessible for those with tighter purse-strings – such as smaller startups, academic institutions and public hospitals – whose research ironically may actually align more with the public good.

Instead of individual payments, collective restitution may prove more equitable in redistributing the value created from public data. Quid pro quo solutions such as “free data for free service,” or adjusting corporate taxation based on the quantity of patient data collected and the social good of the AI application can incentivize socially responsible practices.

In this jungle of data giants, unfinished regulations, legal minefields, and bleeding-edge algorithms, the path forward demands a concerted effort from physicians to take ownership of their unique role bridging patients and the health-care system.

“Physicians should be the guardians of patient information, not large corporations or third-party companies that will profit from our patient’s data,” Lall states firmly. Such guardianship comes in many forms. Most visibly, lobbying and advocacy by physician groups is essential to ensure health-care ethics, patient protections and alignment with patient care.

Furthermore, physicians play a key role in collecting the health data feeding the algorithms. “The key bottleneck [in scaling AI models] is the information you provide,” Chong says. “Whoever has access to the data is who will be powerful.”

This may entail carving out new roles in the health-care ecosystem, such as patient data advocates, akin to current Power of Attorney models, and further exploring the nuances of data ownership.

Not only is access something that physicians are at a critical juncture to influence, but quality control from a professionally trained eye is not to be underestimated. There is a common saying in computer science: “Garbage in, garbage out,” meaning that poor quality data inputs will yield poor quality faulty outputs. Currently, there is considerable heterogeneity in the skill levels of data annotators employed by various AI vendors, ranging from hires with graduate degrees, to less-educated, outsourced hires from lower income countries specifically trained in a narrow subset of data screening. For high-impact, high-risk industries like health care, high-quality outputs are critical, and physicians play an integral role in auditing errors and applying clinical and research expertise to ensure internally and externally valid data inputs.

Ongoing clinical feedback from everyday use of AI models forms the basis for AI’s self-learning and continuous improvement. Physicians will have to realize the agency – and the responsibility – they hold in interacting with this feedback loop.

“If something goes wrong, call it out. If something goes right, publicize its success.” Chong says. Such actions ensure algorithms move in a direction aligned with physician needs, patient goals and evidence-based medicine.

Data is the lifeblood of all AI. As a result, Chong urges physicians to realize their agency: “To leave your voice out of the equation is a major disservice to the ecosystem.”

As the stewards of patient data in this brave new world, physicians’ choices now – whether intentional or not – will shape the future of health care, one byte at a time.


Leave a Comment

Your email address will not be published. Required fields are marked *

1 Comment
  • Amy Procter says:

    Love this series, can’t wait for part 3. Really eye-opening, and I love all the links – thanks so much, Anglea.


Angela Dong


Angela (Hong Tian) Dong is an Internal Medicine resident at the University of Toronto. She sits on the CMA Ethics Committee, PARO Leadership Program, and has completed a diploma in Global Health Education Initiative (GHEI) at the University of Toronto. Angela has a passion for bridging medicine with policy and innovation. She has led multiple health advocacy Days of Action with the CFMS, founded the MP-MD Apprenticeship to teach medical students hands-on health policy, and is an active member in the healthcare-AI and the synthetic biology communities.

X: @AngelaHDong and Medium: @angela.h.dong

Republish this article

Republish this article on your website under the creative commons licence.

Learn more