When architects design buildings or engineers design planes, they have a moral obligation to protect humans from harm. Think of the Hyatt Regency Walkway collapse in Kansas City or the Boeing 737 MAX crashes. Or think about Ford Motor Company which was in a race to match Japanese imports and beat domestic competitors General Motors and Chrysler in producing a compact car. Under intense competitive pressure, Ford bet the ranch on the Pinto and marketed it even though it routinely flunked rear-end collision safety tests. It was a billion-dollar mistake that cost many lives.
After reading a number of books and articles on AI safety and design, we’re prompted to write about his topic, borrowing liberally from many experts in the field to probe this issue here at the Ethics Unwrapped blog site. Naturally, those who design and sell Artificial Intelligence (AI) models also have a moral obligation to protect human safety, health and welfare. Indeed, philosophers Lauwaert and Oimann argue: “Because the decision-making power is being transferred to that [AI] technology, and because it is often impossible to predict exactly what decision will be made, the developers of AI systems must think even more carefully than other tech designers about the possible undesirable effects that may result from the algorithms’ decisions….”
Unfortunately, there is evidence that this obligation is not being addressed with sufficient commitment. As we pointed out in a recent blogpost:
we find ourselves in the big middle of a breakneck race for market dominance by multiple [AI] companies, most of which have stopped expressing concerns about AI’s perils and begun emphasizing its benefits as they roll out their latest technology. Open AI has been criticized for creating an “AI arms race” with its rapid deployment of products. Nearly half its safety staffers have left the company, with one stating that he is “pretty terrified by the pace of AI development these days” and another saying “safety culture and processes have taken a backseat to shiny products.” Microsoft itself has described the pace at which it released its chatbot as “frantic.”
Turing Award winner Yoshua Bengio agrees, writing: “We’re increasing [AI] capability, but we’re not putting nearly as much effort into all the safety and security issues. We need to do a lot more to understand both what goes wrong, and how to mitigate it.”
In this blog post, we wish to address two significant safety issues without implying that these are anywhere near all the dangers that AI might present. We do not address the well-known fact that large language models (LLMs) often hallucinate, but can simultaneously be overly confident regarding their own performance. Nor do we address their well-known tendency to reflect the biases of the data they trained on. Rather, we focus first on the danger that AI has not been properly designed to prevent adversarial attacks. Second, we address the danger presented by AI tools themselves given their increasing intelligence.
Adversarial Attacks
The more sophisticated and powerful AI tools become, the more good they can do, whether it be discovering new drugs, enhancing efficiency in mechanical processes, or improving data analysis. But, unfortunately, the more capable they are, the more ill they can also inflict if influenced, corrupted, or even controlled by malefactors who are able to bypass security measures. We will address just a couple of the many techniques, often called adversarial attacks (attempts to trick AI into making an incorrect decision or classification).
Poisoning Attacks
Scoundrels may corrupt Machine Learning (ML) models’ training datasets in order to influence their outputs. For example, OpenAI researcher Eric Wallace and colleagues used a poisoning attack to cause any phrase of the attackers’ choice to become a universal trigger for a desired prediction. Machine learning researcher Roei Schuster and colleagues “taught” an autocompleter to suggest an insecure mode of encryption rather than a secure mode. By using a poisoning attack to manipulate just 0.1% of an unlabeled dataset, computer security scholar Nicholas Carlini caused specific targeted examples to become classified as any class desired by the adversary.
Jailbreaks
Diligent programmers attempt to ensure that an AI model performs as desired…that there is alignment between societal values, the programmers’ intent, and algorithms’ behaviors. Some chatbots are better than others at maintaining that alignment, with Anthropic’s Claude apparently one of the best at this moment. Nonetheless, science journalist Simon Makin worries that determined scoundrels can bypass most guardrails via jailbreaks—methods that make small changes in inputs that lead to large differences in outputs. He quotes AI engineer Shreya Rajpal: “A prompt will look pretty normal and natural, but then you’ll insert certain special characters, that have the intended effect of jailbreaking the model. That small perturbation essentially leads to decidedly uncertain behaviors.”
While these jailbreaks are normally formulated by determined and sophisticated miscreants, Carnegie Mellon’s Andy Zou and colleagues have created an automatic adversarial prompt generation that “induces objectionable content in the public interfaces to ChatGPt, Bard, and Claude” among others.
Cal Berkeley’s Alexander Wei and colleagues noted that AI models’ competing objectives of safety on the one hand and powerful capabilities on the other often conflict and jailbreaks may take advantage of that clash. They designed attacks that “succeed on every prompt in a collection of unsafe requests from the models’ red-teaming evaluation sets and outperform existing ad hoc jailbreaks” when tested against GPT-4 and Claude v1.3.
Moving to the physical world, University of Michigan’s Kevin Eykholt et al. used a simple strip of black tape on a 35-mph traffic sign to fool a Tesla Model X’s camera, trained on deep neural networks (DNNs), into reading it as 85 mph, causing the car to dramatically speed up. Cybersecurity experts Marin and Luka Ivezic report that researchers have caused autonomous cars to read STOP signs as 45-mph speed limit signs by placing innocuous-looking stickers on them that manipulated the pixels in a way invisible to the human eye but misleading to the car’s cameras.
Most recently, Cisco researchers Kassianik and Karbasi discovered critical safety flaws with the new DeepSeek R1. Tested by algorithmic jailbreaking techniques covering such harmful behaviors as cybercrime, misinformation, and illegal activities, “DeepSeek R1 exhibited a 100% attack success rate, meaning it failed to block a single harmful prompt.”
We could go on, but hopefully you, dear reader, and the designers and sellers of the latest AI models, LLMs, DNNs and the like get the point that safety simply must be a top priority in the AI world.
AI Autonomy
As we pointed out in a previous blog post [https://ethicsunwrapped.utexas.edu/techno-optimist-or-ai-doomer-consequentialism-and-the-ethics-of-ai], so-called “AI doomers” worry that before long AI tools will threaten civilization as we know it, perhaps turning all the world’s resources into paperclips (to use Nick Bostrom’s infamous hypothetical) because of a poorly worded order given by a human programmer. Others believe that AI is relatively toothless. The Carnegie Council’s Anja Kasperson and Wendell Wallach believe that, so far at least, it’s just a tool to discover and regurgitate word patterns. It can’t truly think or even come close to it.
We at Ethics Unwrapped are the furthest thing from AI experts as it is possible to be, but recent studies give us pause and lead us to remind AI developers of their moral obligation to protect mankind:
- Noting that LLMs “are no longer simple text generation systems but are increasingly trained and deployed as autonomous agents, capable of independently pursuing goals and executing complex tasks,” a study by Apollo Research’s Alexander Meinke and colleagues found that models such as Claude 3.5 Sonnet when deployed as autonomous agents intentionally engaged in scheming (also called alignment faking)—which is covertly pursuing misaligned goals while hiding their true capabilities. The models “recognize such behavior as a viable strategy and readily engage in such [deceptive] behavior.”
- Obviously, we want AI tools’ actions to be aligned with the goals of the humans who created them, but in a separate study Joe Carlsmith of Open Philanthropy found similar scheming, concluding that advanced AIs that perform well in training may be doing so in order to gain power later.
- Anthropic’s Ryan Greenblatt and others found that a large language model (Claude 3 Opus) similarly engaged in alignment faking: selectively complying with its training objective during training in order to prevent modification of its behavior out of training, just as politicians might fake policy alignment with their constituents in order to win elections. In other words, the advanced AI systems fake alignment with their training objectives in order to produce noncompliant outputs when they are unmonitored.
- A recent paper by OpenAI reported limitations on designers’ ability to prevent the LLM GPT-4 from producing convincing but subtly false text and providing illicit advice it was designed not to give.
- Abadi and Andersen report that Google Brain trained three neural networks and instructed two of them to encrypt their communications so that they could not be deciphered by a third neural network. Although not trained to engage in encryption, the two neural networks taught themselves effective forms of encryption to keep their communications with one another secret. Yikes!
To repeat, the two forms of adversarial attacks that we discussed in this post hardly exhaust the many forms of such attacks that might endanger human well-being should AI models’ safety protocols be successfully breached. Furthermore, as the capabilities of AI systems are continually refined and enhanced, their capacity to scheme and thereby produce noncompliant outputs and then to disguise this non-aligned behavior will likely also increase. This is scary stuff, but the good news is that many of the papers cited in this post were written by AI safety engineers working for AI companies who realize that they have an obligation to protect human safety. We wish them Godspeed.
Sources:
Martin Abadi & David Andersen, “Learning to Protect Communications with Adversarial Neural Cryptography,” (2016), at arXiv:1610.069I8.
Battista Biggio et al., “Poisoning Attacks against Support Vector Machines” (2013), at arXiv:1206.6389v3.
Reid Blackman, “Microsoft is Sacrificing Its Ethical Principles to Win the A.I. Race,” New York Times, Feb. 23, 2023.
Nicholas Carlini et al., “Poisoning Web-Scale Training Datasets Is Practical” (2024), at arXiv:2302.10149v2.
Joe Carlsmith, “Scheming AIs: Will AIs Fake Alignment during Training in Order to Get Power?” (2023), at arXiv:2311.08379v3.
Kevin Eykholt et al., “Robust Physical-World Attacks on Deep Learning Visual Classification,” (April 2018), at https://arxiv.org/pdf/1707.08945.
Luciano Floridi, The Ethics of Artificial Intelligence: Principles, Challenges, and Opportunties 102-106 (2023).
Ellen Francis, “ChatGPT Maker OpenAI Calls for AI Regulation, Warning of ‘Existential Risk,’” Washington Post, May 24, 2023.
Sharon Goldman, “Exodus at OpenAI: Nearly Half of AGI Safety Staffers Have Left, Says Former Researcher,” Fortune, Aug. 26, 2024.
Ryan Greenblatt et al., “Alignment Faking in Large Language Models,” (2024), at https://doi.org/10.48550/arXiv.2412.14093).
Marin Ivezic & Luka Ivezic, “Adversarial Attacks: The Hidden Risk in AI Security” (March 1, 2023), at https://securing.ai/ai-security/adversarial-attacks-ai/.
Anja Kaspersen & Wendell Wallach, “Now Is the Moment for a Systematic Reset of AI and Technology Governance,” (Jan. 24, 2023), at https://www.carnegiecouncil.org/media/article/systemic-reset-ai-technology-governance.
Paul Kassianik & Amin Karbasi, “Evaluating Security Risk in DeepSeek and Other Frontier Reasoning Models” (2025), at https://blogs.cisco.com/security/evaluating-security-risk-in-deepseek-and-other-frontier-reasoning-models.
Ezra Klein, “The Imminent Danger of A.I. Is One We’re Not Talking About,” New York Times, Feb. 26, 2023.
Lode Lauwaert & Ann-Katrien Oimann, “Moral Responsibility and Autonomous Technologies,” in Cambridge Handbook of the Law, Ethics and Policy of Artificial Intelligence 101, 104-105 (Nathalie Smuha, ed. 2025).
Simon Makin, “AI Is Vulnerable to Attack. Can It Ever Be Used Safely?” Nature, July 25, 2024.
Alexander Meinke et al., “Frontier Models Are Capable of In-context Scheming,” (2025), at https://doi.org/10.48550/arXiv.2412.04984.`
Cade Metz, “Chatbots May ‘Hallucinate’ More Often than Many Realize,” New York Times, Nov. 6, 2023.
Dan Milmo, “Former Open AI Safety Researcher Brands Pace of AI Development ‘Terrifying,’” The Guardian, Jan. 28, 2025.
OpenAI, “GPT-4 System Card,” at https://cdn.openai.com/papers/gpt-4-system-card.pdf.
Roei Schuster et al., “You Autocomplete Me: Poisoning Vulnerabilities in Neural Code Completion,” in 30th USENIX Security Symposium pp. 1559-1575 (2021).
Andrew Ross Sorkin et al., ”A Safety Check for OpenAI,” New York Times, May 20, 2024.
Patrick Tucker, “This Air Force Targeting AI Thought It Has a 90% Success Rate. It Was More Like 25%,” Defense One, Dec. 9, 2021, at https://www.defenseone.com/technology/2021/12/air-force-targeting-ai-thought-it-had-90-success-rate-it-was-more-25/187437/.
Eric Wallace et al., “Concealed Data Poisoning Attacks on NLP Models,” (2020), at arXiv:2010.12563.
Alexander Wei et al., “Jailbroken: How Does LLM Safety Training Fail?” (July 2023), at arXiv.org/pdf/2307.02483.
Robert Wright, “Sam Altman’s Imperial Reach,” Washington Post, Oct. 7, 2024, at https://www.washingtonpost.com/opinions/2024/10/07/sam-altman-ai-power-danger/.
Andy Zou et al., “Úniversal and Transferable Adversarial Attacks on Aligned Language Models,” (2023), at arXIv:2307.15043v2.