Ƶ

AI and Mental Health Care: Issues, Challenges, and Opportunities

QUESTION 2: How can we ensure and monitor the safety of AI in mental health care?

Back to table of contents
Project
AI and Mental Health Care

Background

LLM chatbots and other digital tools are entering mental health settings faster than the formal oversight and regulatory systems meant to govern them are being developed. Given the vulnerability of many patients and the sensitivity of mental health data, these tools demand close scrutiny, not only for what they do well but for where they may fall short.

While these tools show promise, research has also flagged safety concerns, especially for at-risk populations. A recent study revealed that, in many user-chatbot interactions outside a clinical context, LLM companions failed to recognize or appropriately address indicators of mental health crises.31 Another analysis reported that generative chatbots used off-label as therapists frequently missed signs of suicidal ideation and occasionally delivered insensitive or harmful responses, potentially exacerbating user distress through inaccurate or culturally inappropriate advice.32 Additionally, a systematic review found significant gaps in safety evaluations, noting that only two out of numerous trials explicitly monitored adverse events, though more recent studies have started to address this oversight.33 Still, many trials continue to rely on self-reported user satisfaction or symptom improvement without systematically tracking potential harms.

Many wellness-oriented chatbots operate within regulatory gray areas, exempt from formal health agency oversight and governed by ambiguous responsibilities regarding their duty of care.34 While the FDA classifies certain AI mental health apps as medical devices, general-purpose or wellness-focused tools frequently avoid such designation, even when applied in contexts closely resembling clinical care. As of 2024, no uniform federal guidelines delineate when such tools cross the threshold into clinical decision support. Although professional bodies such as the American Psychological Association have developed guidelines addressing AI-driven mental health interventions, the majority of these technologies operate independently of such guidance.35

By comparison, regulatory systems in the United Kingdom and European Union take a more prescriptive approach that establishes clear accountability for developers and imposes structured oversight throughout the product lifecycle. In the United Kingdom, mental health tools must comply with the DCB-0129 clinical safety standard. The European Union’s AI Act goes further, classifying mental health applications as high-risk technologies subject to premarket review and postmarket monitoring.36

In the United States, the January 2025 Executive Order “Removing Barriers to American Leadership in Artificial Intelligence” signals a departure from this model. Framed as driving global competitiveness, the order encourages deregulation in medical AI, including fast-track pathways and reduced scrutiny for tools categorized as “general wellness.” Reactions have been mixed. Many experts have pointed to the risk of underregulating complex, adaptive systems that have the potential to introduce clinical error or embed bias.37 Deprioritizing ethics and equity in regulatory design may also widen disparities in care quality and access. More broadly, the lack of clear standards may not accelerate innovation as intended but instead deter the development of rigorously validated tools, particularly those aimed at high-risk populations.

Responses

A photo of Arthur Kleinman, a person with light skin, gray hair, and a gray beard and mustache, wearing a brown jacket and blue shirt, and smiling at the viewer.

Arthur Kleinman
 

AI must be treated like any health care intervention. New AI-driven procedures should be evaluated by the U.S. Food and Drug Administration (FDA) and professional association standards. These should include assessment of untoward effects, such as suicide, provoked psychosis, and the worsening of symptoms of depression, anxiety, and posttraumatic stress disorder. Tools currently used to ensure and monitor psychiatric and psychological interventions should be applied. That application needs to include input from human mental health experts so that we don’t simply end up with a potentially dangerous cycle of AI interventions evaluating themselves by AI, not clinical, standards.

The same regulatory bodies that assess pharmacological and psychotherapeutic interventions need to be active here, including appropriate federal, state, and local regulatory agencies (e.g., FDA) and professional organizations (e.g., both the American Psychological Association and the American Psychiatric Association). The measures of safety used by these organizations for the monitoring of pharmacological and psychotherapeutic outcomes should also be applied to AI-driven interventions.

Maintaining transparency and accountability within the strictures of confidentiality and privacy already adopted by national, regional, and local organizations will be critical. As is apparent in the enormous attention given to AI in the media, privacy is going to loom very large as a concern. Non-purpose-built AI tools such as ChatGPT cannot alone monitor safety and outcome. Human beings who are experts in the mental health field will have to oversee the use and evaluation of AI interventions. Informed consent and user autonomy are essential components of ethical care throughout our health care system. Monitoring of AI interventions should routinely assess both of these principles of ethical practice. The ethical and legal requirements for AI-driven mental health interventions should be no different from those that do not employ AI.

Besides informed consent and autonomy, other well-established ethical principles such as non­maleficence, beneficence, acknowledgment and affirmation of the person, respectfulness, and empathic responses also are crucial. Emmanuel Levinas writes that face-to-face ethics should include acknowledgment, affirmation, and empathy as core to the human experience of interacting with others. Should these not also be core to the AI-driven experience of patients? How these qualities of ethical care will be provided by AI-driven interventions is, however, an entirely open question at present.

AI-driven interventions will need to be created and implemented in a way that provides sufficient feedback for evaluation. To cope with the vast number of legal implications expected to arise once a regulatory framework has been developed, a registry of interventions should systematically record legal problems. Because ethical and legal best practices for AI must be developed, input will be needed from legal and ethics experts on this topic.

The story of technological interventions in health care generally has been that their use and development have been appropriated by the political economy of health care so as to increase profits and thereby benefit those who have more financial resources while penalizing those who are poorer. I am skeptical that AI-driven interventions can alter a fundamental political-economic reality of our society.

In the same way, health bureaucracy emphasizes efficiency over care as its primary value. How will AI-driven interventions in mental health institutions avoid the trap of inadvertently contributing to greater efficiency while reducing quality of care? A cautionary example is the Electronic Medical Record (EMR), a technology that, Atul Gawande shows, held great promise for improving care but has actually contributed only to improving billing.38

The mental health field has been filled with examples of corruption and other illegal forms of behavior aimed at taking advantage of users. This, by the way, is no different than what goes on in the rest of health care. Here the development of standards needs to be enforced with real human monitoring and evaluation with a concern for detecting abuse and misuse. We live in a society where scamming is so common that we have to assume it will be as problematic for AI-driven interventions as for all other digital interventions. Therefore, we need to put in place safeguards that anticipate unintended consequences and abusive behavior. I am most concerned in this regard with the use of chatbots that are not routinely monitored by human experts and might well be employed as scams or in other criminal ways. This has been the case with almost every technological intervention in society, and AI will need to prove itself by how it prevents such illegality.39

 

A photo of Daniel Barron, a person with light skin and short brown hair, wearing a dark business suit and white shirt and smiling at the viewer.

Daniel Barron
 

Safety is always defined in relation to a specific task and the parameters for that task’s success and failure (see Table 1). Ensuring that AI in mental health care is safe, and stays safe, depends on first clearly defining its specific clinical job and then thinking through the risks of failure or error in that particular job. An AI failing at medication reconciliation carries different risks from one confusing CST for PST in appointment scheduling. AI safety is not absolute; it depends entirely on the context of the task. In some cases, the risks of AI will outweigh the benefits. In other cases, the benefits trump the risks.

Attempting to regulate AI as a monolith is nonsensical. You might as well try to regulate “therapeutics” or “drugs” as single categories. Instead, regulators should evaluate AI tools based on their defined task and the risk profile associated with the specific job they were designed and tested for. Evaluation frameworks, like clinical trials or real-world data analysis, must be chosen based on the nature of the AI’s task and its associated risks. Some AI applications (e.g., AI as an appointment-scheduling assistant) may not require any formal regulation; real-world analysis would seem sufficient (e.g., can the proposed AI actually help patients schedule appointments?).

Safety metrics must be defined according to the potential harms arising if the AI fails at its specific job. For example, Shifat Islam and colleagues note that AI-based triage systems (a specific job) can outperform traditional methods if implemented correctly, but the adoption of such systems hinges on addressing transparency and ethical considerations for that task.40 Transparency is key: the AI’s operational domain (the task it performs), the data it uses for that job, its limitations, and potential consequences of task failure must be clearly communicated. Anastasiya Kiseleva and colleagues propose a multilayered system of accountabilities for AI’s job performance, emphasizing transparency for safety and informed choices.41 The use of non-purpose-built AI for clinical tasks they were not designed for carries unmitigated safety risks, underscoring the importance of this task-focused approach.

 

A photo of Alison Darcy, a person with light skin and long gray-brown hair, wearing a green top and facing the viewer.

Alison Darcy
 

The answer to this question is defined by the intended use of the tool that deploys AI, the setting, the population, and the extent of the data that supports the deployment of the tool therein.

While the area of therapeutic chatbots is unhelpfully gray, by embedding safety into product design and development and then ensuring ongoing safety governance, we took an unusually risk-mitigating posture in the development and deployment of our conversational agent to support individuals’ mental health.

Safety by design

Safety and ethical design must be embedded into every aspect of product design and development. But, while various laws and certifications govern good data protection (e.g., HIPAA and HITRUST) and manufacturing standards (e.g., ISO 13485:2016), in health care settings, which design practices will best ensure safety are not always obvious when a technology is new. We found that a first-principles approach to product design was beneficial. When approaching privacy, for example, we were inspired by Europe’s General Data Protection Regulation, which states that individuals should own their data and have the ability to delete it. We thus designed Woebot so people could simply ask it to delete their data while conversing with it, rather than having to seek out complicated settings or pursue lengthy exchanges with customer support. We also firmly believe that consent for any use of an individual’s data should be sought only after offering a plain-language description of each and every use.

Another hugely important safety principle is ensuring that individuals understand the nature of the service, its intended use, and its limitations—otherwise known as informed consent. In mental health, especially for new kinds of services, eliminating misunderstandings, however small, about the nature of the service is a central safety issue. In 2017, when we launched Woebot, chatbots often tried to pass as human beings. Woebot thus proudly declares itself to be a software “robot” not only during its first conversation with a user but many times after. In fact, we rather belabored the point using the mechanism of fictional character development. Woebot had a cartoon robot appearance, robot friends, and robot bad habits (drinking too much hot oil when stressed). Lest it be mistaken for an entity that can intervene, Woebot also (over)stated that it was not a crisis service and that no human would be reading what the user writes in real time. We also reminded people of this (in addition to including standard warnings and offers to redirect users to appropriate emergency services if needed) when Woebot detects language that could be considered concerning, redirecting them to human-based services.

Safety governance architecture for a regulated product

The core objective of safety governance is to ensure the ongoing safety of all individuals who come into contact with the technology and to do so with a broad enough scope that it also directly or indirectly captures unintended or unanticipated harms as a result of the experience.

We established a significant safety infrastructure, operationalized and documented in standard operating procedures that were housed in a quality management system that meticulously managed document control procedures and organization-wide training and official signing practices. The safety infrastructure focuses on the comprehensive identification, documentation, assessment, reporting, and management of safety events, including adverse events (AEs), serious adverse events (SAEs), and unanticipated adverse device events (UADEs).

A pivotal organizational structure was the safety assessment committee (SAC), which met regularly and consisted of the chief clinical officer (a clinical psychologist by training) representing clinical care; our vice president of regulatory affairs (a physician by training), who represented product and strategy; and an external (i.e., otherwise unaffiliated with the company) physician chair who was compensated for their time. The SAC, whose members were trained in regulatory-grade safety procedures, were involved in educating the rest of the organization in device vigilance standards, safety monitoring training, standard operating procedures, and ongoing surveillance of all safety events observed during any deployments of Woebot, whether commercial or in the context of a clinical trial, and regardless of whether the principal investigator was internal or external to the company.

In addition, a doctoral-level head of device vigilance (DV) who was responsible for ongoing monitoring of intervention performance conducted post hoc reviews and evaluations of safety events; received notification of new AEs and SAEs; sent reports to the SAC and any appropriate regulatory authorities; provided the initial documentation of safety-related events; escalated and informed all appropriate parties and committees where necessary; entered data into a safety database; ensured quality control of data entry and narrative; reviewed and approved cases in the safety database; and prepared analyses of similar events for evaluation by the SAC.

Both the SAC and head of DV relied on an extensive staff, including a lead biostatistician who oversaw safety-data analysis for review by the SAC. In addition, research assistants and project and program managers assisted in the extensive documentation and reporting responsibilities, and a clinical operations lead oversaw much of the day-to-day operations of safety monitoring, vendor management, and clinical trials.

In summary, key aspects of an ongoing safety monitoring approach include:

  • Proactive safety procedures. All users are thoroughly informed about program limitations and safety procedures during app onboarding. Any screening processes that are necessary to exclude individuals who are deemed unsuitable occur prior to interaction with the program.

  • Multichannel safety monitoring. Safety events can be identified through spontaneous user reports (via email, phone, app support), solicited self-reports via follow-up surveys, and retrospective reviews of “language detection protocol” transcripts. A safety event can be anything from a negative experience caused by a software bug to significant worsening of symptoms detected in the context of a study.

  • Defined roles and responsibilities. For example, clear roles were established for the SAC, head of DV, and supporting staff. For all studies and trials, the principal investigator, sponsor (i.e., Woebot Health), decentralized trial vendor, and SAC were also involved in managing safety events.

  • Systematic event processing. Upon detection of a potential safety event, a structured process is used to make an initial determination of seriousness and causality, followed by review and approval by relevant staff members. This must occur promptly since SAEs and UADEs have expedited reporting timelines (twenty-four hours from awareness).

  • Documentation and reporting. All safety events are meticulously documented in electronic case report forms. Regular reports, including biweekly summaries and periodic safety reports, are generated for relevant staff, committees, and regulatory authorities as appropriate.

  • Risk mitigation. Procedures are in place to minimize risks, such as misunderstandings of the application’s capabilities, data breaches, and potential emotional upset among participants.

  • Continuous monitoring and reconciliation. Safety data are continuously monitored, and SAE reconciliations of the safety database and EDC data are performed to ensure data quality and accuracy. Signal detection and management processes are in place to identify potential safety concerns.

  • Compliance and quality control. Adherence to applicable guidelines, regulations, and quality system documents is emphasized, with key performance indicators monitored for vendor performance and deviation resolution.

The safety governance outlined above was considered appropriate for a “non–significant risk device.” Scientists and academic readers will notice that it goes far beyond the standard safety and data protections required by IRBs in the context of much human subjects research. In study contexts, the same elements exist—data monitoring and safety boards, for example, operate similarly to an SAC—but nowhere to the degree that is required here.

Woebot engaged with more than 1.5 million people, and no SAEs or UADEs were detected. We were able both to monitor which individuals triggered the “concerning language detection” algorithm and also assess the precision and recall (sensitivity and specificity) of the algorithm in each research setting by reviewing and labeling the data.

While we strongly advocate for healthy safety governance and the objectives it hoped to secure, in practice both were often fundamentally at odds with the current pace and realities of AI development and opportunities to innovate. I now see a risk in failing to “right size” such efforts. A quick glance at the qualifications of individuals on the SAC or involved in the DV groups will confirm that this is not an inexpensive endeavor. When doing the right thing is prohibitively expensive and painfully slow, people may feel encouraged to operate as if they were not regulated. This is the counterproductive nature of safety governance that is not fit for purpose.

 

A photo of Nicholas Jacobson, a person with red hair and a red beard and mustache, wearing glasses and a gray suit and blue shirt,  and smiling at the viewer.

Nicholas Jacobson
 

First and most important is that AI be held to a high bar: Both safety and efficacy can and should be quantified through trials. Direct oversight is important to show that the tools are actually safe and effective.

What individuals or organizations are or should be responsible for ensuring safety? What tools might they use to do so?

Initially, the developers and researchers creating the AI tool bear the primary responsibility for safety. This involves meticulous design, training on high-quality, evidence-based data (not just scraping the Internet), incorporating clinical expertise throughout development (we involved psychologists and psychiatrists extensively), and building in explicit safety features (e.g., crisis detection models that link to resources like 911 or hotlines). Tools include rigorous internal testing, adversarial testing to find failure points, and close human supervision during clinical trials to monitor interactions and intervene if necessary. Ongoing postdeployment monitoring is also essential.

What role should regulatory bodies and independent audits play in verifying AI safety and performance?

Regulatory bodies like the FDA have a critical role, particularly when tools make therapeutic claims. Such bodies should establish clear guidelines for evaluating safety and efficacy and require robust evidence from clinical trials before allowing market access or specific claims. Independent audits by third parties could verify AI performance, safety protocols, data privacy, and algorithmic bias, adding a crucial layer of accountability. The current regulatory landscape is lagging behind the technology’s rapid advancement.

What evaluation frameworks, such as clinical trials or real-world data, are most appropriate for assessing the safety of AI tools?

RCTs are the most appropriate framework for assessing both the safety and efficacy of AI tools intended for clinical use, just as they are for other medical interventions. Our Therabot trial utilized an RCT design. Post-approval collection of real-word data is also vital for ongoing safety monitoring and for understanding effectiveness across diverse populations and contexts not perfectly captured in trials. Continuous monitoring within trials, including review of AI-user interactions by trained staff, is crucial for immediate risk mitigation.

How should safety metrics be defined and tailored for different use cases and clinical conditions?

Safety metrics must be defined clearly and tailored to the specific use case and clinical condition. This includes tracking the frequency and nature of inappropriate or harmful AI responses (we logged these), monitoring for any worsening of symptoms, assessing the effectiveness of crisis detection and response protocols, and evaluating potential biases in the AI’s interactions. For different conditions (e.g., eating disorders versus depression) and specific risks (e.g., reinforcing harmful weight-loss behaviors), each condition requires its own metric. Human oversight is important in determining whether safety metrics are applied successfully.

What levels of transparency and accountability are necessary?

Users need to provide informed consent and demonstrate an understanding that they are interacting with an AI. Accountability rests with the developers and deploying organizations, which must ensure the AI operates safely and ethically, must address issues promptly, and must be liable for harms caused by negligence or unsafe design. Clear mechanisms for reporting adverse events or problematic interactions are needed.

How should the availability of non-purpose-built tools (such as ChatGPT) figure in regulatory and evaluation assessments?

The availability of general-purpose tools like ChatGPT poses a significant challenge. People may use them for mental health support despite them not being designed, tested, or safe for this purpose. Generic, general-purpose systems can regularly act in ways that are profoundly unsafe. Regulatory and evaluation assessments must clearly differentiate between rigorously developed, purpose-built tools like Therabot and general AI. The public will need to be educated about the risks of using nontherapeutic AI for mental health needs. Work on purpose-built therapeutic AIs should be regulated, with regulations focused on tools marketed with therapeutic claims, while also acknowledging the reality of off-label use of general tools.

 

A photo of Henry T. Greely, a person with light skin, gray hair and gray mustache, wearing glasses and a blue shirt, facing the viewer with his hand on his chin.

Hank Greely
 

My greatest concerns about AI in mental health care are more in the area of “political economy” than in ethics or law, but this question seems to be my best chance to present them.

Proving safety and efficacy is not easy, but requiring such proof is even harder when social forces stand in the way. Ideally, we would know, from research, the safety and effectiveness of particular approaches to AI in mental health care before it is widely used. This could avoid both direct harm to some patients of AI in mental health care and prevent wasted effort on ineffective treatments.

An early safety and efficacy regime also has the advantage of acting when political difficulty is low. As the Collingridge dilemma acknowledges, “attempting to control a technology is difficult . . . because during its early stages, when it can be controlled, not enough can be known about its harmful social consequences to warrant controlling its development; but by the time these consequences are apparent, control has become costly and slow.”42 Once a technology has been widely adopted, vested interests will have developed among those who produce it, use it, or are affected by it. But before it is in substantial use, we may not know what harm, or good, it does.

The best answer seems to be to allow nonresearch uses of risky novelties only after they have been proven safe and effective. In the United States, this is largely limited to FDA medical product regulation, but such regulation is under constant assault from producers, physicians, disease organizations, and patients wanting faster and easier access.

The use of AI in mental health care raises special problems. Some are on the AI side, a tool widely hoped to fix all that ails everyone and everything, the subject of intensive and expensive research, and the basis of high valuations on many huge and powerful firms. And it resides largely in Silicon Valley, with its ethos of “move fast and break things.”

Infotech has long eyed the more than $5 trillion U.S. health care market enviously. Its efforts to break into that market have largely failed. AI offers another chance, with fewer opponents. The FDA faces legitimate difficulties in figuring out how to regulate AI in health care, but the influence of many of the world’s largest companies on Congress and the administration will make its job even harder. The FDA has already announced relaxed and unclear standards for regulating AI.

That AI is to be used in mental health care exacerbates the problems. First, measuring the outcomes of mental health care is more difficult than, say, measuring mortality reductions from a treatment for pancreatic cancer. But mental health, an area largely left behind in recent medical advances, is also filled with particularly desperate patients (and those who care about them). The strong desire for treatments seems likely to make patients, families, and disease organizations eager to promote AI interventions. The broad rejection in many parts of society of scientific and medical expertise will not help.

Creating a useful scheme for regulating AI in mental health care will always be hard; figuring out how to get it implemented—by legislatures, regulators, or otherwise—will be even harder. Finding ways to solve that problem must be a high priority.

Endnotes

  • 31

    Julian De Freitas, Ahmet Kaan Uğuralp, Zeliha Oğuz‐Uğuralp, and Stefano Puntoni, “,” Journal of Consumer Psychology 34 (3) (2024): 481–491.

  • 32

    Zoha Khawaja and Jean-Christophe Bélisle-Pipon, “,” Frontiers in Digital Health 5 (2023): 1278186.

  • 33

    A. A. Abd-Alrazaq, A. Rababeh, M. Alajlani, B. M. Bewick, and M. Househ, “,” Journal of Medical Internet Research 22 (7) (2020): e16021; and Jake Linardon, John Torous, Joseph Firth, Pim Cuijpers, Mariel Messer, and Matthew Fuller‐Tyszkiewicz, “,” World Psychiatry 23 (1) (2024): 139–149.

  • 34

    Julian De Freitas and I. Glenn Cohen, “,” Nature Medicine 30 (5) (2024): 1269–1275.

  • 35

    American Psychological Association, “,” November 21, 2024.

  • 36

    Regulation of the European Parliament and of the Council of 13 June 2024 laying down harmonised rules on artificial intelligence and amending Regulations (EC) No 300/2008, (EU) No 167/2013, (EU) No 168/2013, (EU) 2018/858, (EU) 2018/1139 and (EU) 2019/2144 and Directives 2014/90/EU, (EU) 2016/797 and (EU) 2020/1828 () (Text with EEA relevance).

  • 37

    Linda Malek and Allison Kwon, “,” Crowell, May 5, 2025.

  • 38

    Atul Gawande, “,” The New Yorker, November 12, 2018.

  • 39

    Adriana Petryna, Andrew Lakoff, and Arthur Kleinman, eds., Global Pharmaceuticals: Ethics, Markets, Practices (Duke University Press, 2006).

  • 40

    Shifat Islam, Rifat Shahriyar, Abhishek Agarwala, et al., “,” BMC Medical Informatics and Decision Making 25 (1) (2025).

  • 41

    Anastasiya Kiseleva, Dimitris Kotzinos, and Paul De Hert, “,” Frontiers in Artificial Intelligence 5 (2022).

  • 42

    David Collingridge, The Social Control of Technology (St. Martin’s Press, 1980), 16.