A pilot program by the Pentagon has identified over 800 potential vulnerabilities and biases in the use of large language models (LLMs) for military medicine, according to the U.S. Department of Defense.
The study was conducted by the Chief Digital and Artificial Intelligence Office (CDAO) under the Crowdsourced AI Red-Teaming Assurance program, in collaboration with the Defense Health Agency and the Office of the Executive Program for Medical Systems. The nonprofit technology organization Humane Intelligence also participated in the initiative.
The testing focused on assessing the risks associated with using LLMs for summarizing clinical records and powering a medical advisory chatbot. More than 200 individuals, including medical professionals and defense analysts, participated in the trials, which involved evaluating three popular AI models.
Following the program’s conclusion, the Pentagon announced its successful completion. The insights gathered will inform the creation of test frameworks for evaluating future vendors and tools to ensure they meet performance expectations. Additionally, these findings will underpin the Pentagon’s policy development on the responsible use of generative AI in military medicine.
Matthew Johnson, the head of Responsible AI at CDAO, emphasized that the program would help identify critical issues and evaluate mechanisms for their resolution, thereby advancing and refining generative AI models.
Established in June 2022, the CDAO focuses on testing, expanding, and integrating AI into defense structures. In August 2023, the office launched Task Force Lima to explore emerging technologies, and in December 2024, it created the Rapid AI Deployment Center to scale AI tools in partnership with the Defense Innovation Unit.
The Department of Defense noted that the work carried out under the Crowdsourced AI Red-Teaming Assurance program will accelerate the adoption of AI solutions in military medicine.