The Ethical Hawk-Eye: How Custos AI Complements Anthropic's Vision for Safe AGI
Finding common ground with Amodei and Askell's frameworks for beneficial AI systems
In previous pieces here, I have explored the necessity of a transparent and impartial verification mechanism for public decisions that will increasingly rely on artificial intelligence. I introduced and developed the concept of Custos AI, imagining it not as an entity wielding public authority, but as an external auditing tool, a kind of "ethical Hawk-Eye" capable of submitting critical calls, those shaping lives, economies, and the destinies of nations, to a "challenge," a rigorous check against defined, democratically legitimized ethical and legal principles. The fundamental question that has always driven this concept is the ancient, ever-relevant one: Quis custodiet ipsos custodes? (Who watches the watchmen themselves?). Who will oversee these "thinking machines," especially when their power grows to influence so profoundly, to ensure they operate fairly, transparently, and genuinely in service of the common good?
I find deep resonance in this personal quest when engaging with the visions of those at the forefront of advanced AI safety research. In particular, the ideas of Dario Amodei, a central figure in AI safety and currently the CEO of Anthropic, presented in his essay "Machines of Loving Grace: How AI Could Transform the World for the Better," a thought provoking text from October 2024 and available here: https://www.darioamodei.com/essay/machines-of-loving-grace, strongly echo this hope. Amodei does not shy away from the risks, including existential ones.
Still, he outlines the achievable possibility, through massive, focused safety and alignment work, that artificial general intelligence (AGI), once it surpasses human capabilities, could become an intrinsically benevolent force for humanity, a true "machine of loving grace." His primary focus, and that of his team at Anthropic, is on engineering alignment: ensuring that powerfully capable AI systems genuinely want what we humans want, acting consistently with our deepest values even in unforeseen circumstances.
However, achieving this intrinsic "wanting" is a significant challenge, one that bumps against fundamental obstacles illuminated lucidly, already years ago, by Amanda Askell, then a researcher on AI safety and policy at OpenAI and now, tellingly, part of Amodei's team at Anthropic. In her 2018 talk "AI Safety Needs Social Scientists," given at Effective Altruism Global, Amanda highlighted a crucial difficulty that mirrors my reflections on Custos AI: the problem of specifying human values for an AI system. Even principles we consider intuitive or universally agreed upon, like "do not needlessly harm someone," are incredibly hard to codify for an AI, which requires extreme algorithmic precision. Human understanding is guided by context, intuition, and countless implicit rules, translating this into explicit instructions for every plausible scenario an AI might encounter, especially one far exceeding our abilities, feels like trying to enumerate the infinite. How does the AI know when pushing someone is saving them or needlessly harming them? It needs to learn the underlying reasons.
Amanda pinpointed another key limitation: the non-scalability of direct human feedback for training advanced AI. While useful for systems at or below our level, asking humans to label actions as "good" or "bad" becomes impractical when AI operates in domains too complex for our full comprehension. Our judgment, based on limited observation, may no longer accurately reflect whether the AI is truly aligned at a deeper level. As AI capabilities grow, the "reward function predicted by humans" might diverge sharply from the "actual" desired outcome, just as Amanda's graph visually illustrated.
To overcome these hurdles of specification and feedback scalability, Amanda suggested focusing not just on what we want, but on why. This involves decomposing complex problems into simpler components and, importantly, learning the reasons underlying human preferences and ethical judgments. Instead of endlessly specifying values or judging countless complex behaviours, we could aim to teach the AI the deeper principles from verifiable information and simplified scenarios.
Amanda's "Debate Method for AI Safety" is a prime example of this direction: two AI agents debate a question by making simple, verifiable statements. A human judge, rather than deciding the complex question directly, evaluates which agent provides more truthful or useful information in each step of the debate process. This allows the AI to explore an "argument tree" and learn the underlying "reasons" that make certain statements persuasive to a human judge based on simpler, verifiable facts.
It is precisely at this point of convergence, Amodei's grand vision for beneficial AGI, Askell's lucid analysis of the challenges of alignment and her proposal for methods like learning reasons, and my concept of Custos AI, that I see a clear, supporting relationship emerging. All these ideas highlight the necessity for AI to function in support of humans, in a kind of dialogue that leverages its capabilities without replacing essential human judgment and ethical deliberation.
Custos AI, as an institutionalised "ethical Hawk-Eye," could be the practical system that utilises advances in AI interpretability and methods like the debate process for public accountability. It would not build the "Machines of Loving Grace" itself, that is the monumental task for researchers like Amodei, but it would be the crucial oversight layer. It could trigger an audit, asking potentially very advanced AI systems to present the justifications, data trails, and underlying logic of a decision, perhaps using an AI-assisted debate format to break down the complexity into arguments and "facts" understandable by a human judge. This empowers society to perform a specific, verifiable compliance check, ensuring that even the most powerful "machines of loving grace" are operating according to codified principles in sensitive public domains. The AI here acts as a powerful assistant, clarifying information, presenting logical paths, and revealing potential issues for human review.
However, this framework, as both Amanda and the fundamental question of Quis custodiet? imply, critically depends on the reliability of that human judgment. If the efficacy of the AI-facilitated verification rests on human judges, their competence, resistance to bias, and ability to discern truth from persuasive deception become the new, central challenge, returning the question of Quis custodiet ipsos custodes? (Who watches the watchmen themselves?) squarely back to the human element in the system.
Amanda Askell's call for social scientists underscores the vital, empirical nature of this dependency. How capable are humans as judges by default? Can we reliably distinguish good judges? Can we train them to be better, perhaps mitigating known cognitive biases? These are not philosophical conjectures but empirical questions requiring experimental investigation by experts in fields like experimental psychology, cognitive science, and behavioural economics.
It is precisely this point of convergence that I find so compelling and, I admit, humbly validating. Bold visions for AI's potential good (Amodei), insightful analysis of its inherent alignment difficulties and proposed methods (Askell), and practical concepts for public oversight (Custos AI) all underscore the critical, indispensable role of human ethics and judgment. The overarching message is clear: AI, even advanced and aligned, is envisioned not as a replacement for human ethical reasoning or deliberation, but as a powerful tool in support of it, operating in a verifiable dialogue. Its ultimate safety and beneficence, particularly in the public sphere, depend on our capacity, as individuals and as a society, to cultivate ethical awareness, rational judgment, and a willingness to verify and take responsibility.
Custos AI, as that vital "ethical Hawk-Eye" for public decisions, can help ensure that the "machines of loving grace" act according to our standards. But the robustness of Custos AI relies on our ability to be effective, informed judges ourselves, a capacity we need to understand, trust, and perhaps actively cultivate, guided by the essential insights that social scientists can provide.
Thank you for joining me in this reflection, weaving together personal inquiry, ambitious visions, practical methods, and the fundamental role of human reliability in building a safer and more beneficial AI future. If you are interested or have not had a chance to read them yet, you can explore my ideas on the concept of Custos AI in these articles:
"Why I Believe We Need an Ethical Hawk-Eye: The Custos AI Concept," which you can read here: https://cristianoluchinivedanta.substack.com/p/why-i-believe-we-need-an-ethical-hawk-eye-the-custos-ai-concept
"Custos AI Operationalised: Inspired by Process, Driven by Ethics," which you can read here: https://cristianoluchinivedanta.substack.com/p/custos-ai-operationalized-inspired
Discover more stories and reflections in my books.
You can also connect with my professional journey on LinkedIn.
I value your input. Reach out with feedback, suggestions, or inquiries to: cosmicdancerpodcast@gmail.com.
Grateful for your time and readership.