Neurotechnology is Critical for AI Alignment
TL;DR
This post makes a two-part argument that neurotechnology development is probably necessary, though not sufficient, to achieve AI alignment.
The two parts are:
- Mechanistic understanding of human cognition is critical for achieving AI alignment.
- Neurotechnology is critical for getting this understanding.
I don’t think merging with AI or uploading people to the cloud are solutions to AI alignment on their own.
I do think neurotechnology for understanding human cognition can be developed on timescales relevant to AI alignment.
Assumptions
AI alignment is a problem we need to solve
AI alignment is the project of building successors we’re happy about.
Someday, no matter how much AI safety work succeeds in slowing it down, humanity is probably going to build AI systems that are more powerful than us in every way and have agendas of their own.
Maybe they’ll be nice to us. Maybe they won’t. They might appear in a second or gradually over decades. They might emerge this week or in the year 3000. But whenever and however it happens, we’re only guaranteed one shot to build our successors right.
AI alignment is different from AI safety
There’s lots of important AI safety work that isn’t focused on alignment. Things like:
- Regulatory restrictions on AI architectures/hardware/data
- Governance and coordination of AI organizations
- Forecasting and threat modeling
- Building AI boxes
- Auditing and monitoring AI systems
I want there to be more work on all the above, urgently. But we still need to solve the alignment problem at some point.
Merging with AI or uploading our brains aren’t solutions to AI alignment per se
Some proposed neurotechnological solutions to the alignment problem involve building intelligent systems that are superficially more human-like. These include uploads, emulations, “merged” human-computer systems, or humans so genetically engineered that we are to them as chimps are to us.
This post is not about these proposed solutions, because I think they contain a fundamental contradiction: even superficially more human-like intelligences, like uploads, are still just different flavors of AI. All the worries we have about unaligned superintelligent AIs apply to these proposed solutions as well, until proven otherwise. This means building them can’t be an alignment solution per se.
What if, as some have proposed, we use neurotechnology to do some moderate augmentation of human capabilities to help us solve AI alignment? This could potentially be useful, and I’m interested in learning more about these proposals.
But there is no hard line where augmented human ends and AI starts: it’s a matter of degrees. So such proposals would need to be undertaken carefully, informed by the research proposed below in this post.
Neurotechnology is Critical for AI Alignment
Proposition 1. Mechanistic understanding of human cognition is critical for achieving AI alignment
Note: “mechanistic understanding” here means having predictive, explanatory models defined in terms of human-comprehensible, reliably measurable phenomena.
Mechanistic understanding comes in degrees. Currently our mechanistic understanding of human cognition is pretty poor. We have words for different emotions and can read them decently off faces, we can measure your IQ and guess what kind of job you might have, and we know object permanence develops before language ability. But we can’t reliably detect when someone is lying, or identify which type of treatment will help with someone’s depression, or agree on whether humans have subconscious minds or what those subconscious minds might be doing.
Reason 1: We need to address the unreliability of human moral judgment.
Human moral judgments are unreliable. If we want to use alignment strategies like Reinforcement Learning from Human Feedback (RLHF) that learn from human moral judgments, we need to characterize this unreliability and adjust our alignment strategies accordingly. Otherwise we risk installing these unreliabilities in the AIs we build.
At minimum, we need to determine what data we want to train value learning approaches like RLHF on. Are you comfortable giving your quietly suicidal or secretly racist friends access to the RLHF survey system? Do you know which of your friends are quietly suicidal or secretly racist? Are you, perhaps, these things, sometimes or in some circumstances? I submit that you probably don’t know.
But human moral judgment is unreliable in subtler ways, too. A nice review of known examples is here — things like people’s responses to moral questions changing depending on their mood or the word choice used in the question.
I suspect there are many other types of bugs in moral reasoning (and reasoning in general) we haven’t discovered yet, and these are more worrying to me. A detailed mechanistic understanding of moral reasoning is key to identifying them and characterizing when and how they occur.
Of course it may not be easy to decide what is a bug and what is a feature in human reasoning. Understanding human cognition won’t solve every philosophical problem. (Though some think it would solve many of them.) But better to debate these issues with as much information in hand about our moral reasoning algorithm as possible.
In short: this subsection amounts to the argument for mechanistic AI interpretability research, just applied to human moral reasoning. If we better understand the algorithm humans use to produce their moral judgments, we can hopefully spot and address the bugs in it before installing it wholesale into our AI successors.
As an aside, it’s interesting to me that the Rationalists assembled an entire community based on the idea that we shouldn’t trust most of what we think, and now some of them seem to think the best way to install values in an AI is to just ask people what they think. Surely they should at least demand all RLHF voters go through CFAR camp first.
Reason 2: If we want to build lie detectors (or similar) for AIs, we first have to define lying, and human cognition is the only system on which we can reliably test such a definition.
When we talk about AIs lying, hallucinating, deceiving, empathizing, being uncertain, or having agency, we’re anthropomorphizing.
Concepts like these, which are central in AI alignment research, are defined by analogy to human cognition. And critically, they aren’t concepts we can define just in terms of external behavior, unlike a concept like correctness, which we can. Part of their definition involves the internal cognitive operations of the AI or human.
For example, it would be great to have an AI lie detector, or a way of building AIs that were unable to lie. But what is lying? Lying is defined by intent. Every response omits information, and any response can be mistaken — whether something is lying depends on the intent of the speaker. Defining intent in turn requires us to define concepts like knowledge, planning, and perhaps desire. Similarly, the difference between a hallucination and a bad joke depends on notions of agency and self-control.
All of these are complicated concepts, bordering on the philosophical. But if we want to assess AI systems for their (degree of) presence or absence, we need to operationalize them somehow. And the most important, or perhaps only, test of whether the operationalizations we come up with are correct is whether these operationalizations work on human cognition and pick out the same concepts.
You can ask GPT-4 to first tell you the truth and then lie to you and see whether your AI lie detector goes off in the second case but not the first. But no amount of testing like this will serve as definitive evidence that the AI is doing the thing we think of when we say “lying”. For that, we need to
- operationalize lying in terms of cognitive phenomena primitive enough that we can identify them in both AIs and humans
- validate this operationalization using humans who we can instruct to lie in the way we mean that word, and then
- port this operationalization to AIs.
Normal human cognition isn’t the only system we should study to come up with these operationalizations. Non-human minds and unusual or pathological human minds are important too. Our operationalizations of fundamental cognitive concepts should be applicable to a human brain, a psychopath brain, a dog brain, and GPT-4. We should have rules for whether they can be applied to an amoeba or avocado. But human cognition is the example system we’re extrapolating most of these concepts from. Hence the importance of understanding human cognition in as much detail as possible.
Of course, it’s likely our current vocabulary doesn’t even contain words for all the human-cognition-inspired concepts we care about our AI successors having or not having. But new definitions and ontologies are usually inspired by examples, so a more granular understanding of human cognition will likely be the first step in coming up with these.
Reason 3: If we want to test theories about intelligence in general, we had better be able to check that they hold for human intelligence.
Is a particular form of the Natural Abstractions hypothesis true? Shard Theory? The assumptions behind Brain-like AGI Safety? Can we combine mouse or corvid brains using Iterated Distillation and Amplification/Recursive Reward Modeling or AI Safety via Debate and end up with human-level reasoning or morality?
Testing such theories on human cognition is the most important evidence we will be able to muster for their validity. And this will require a more fine-grained understanding of human cognition than we currently possess (and of mouse or corvid cognition for that latter example).
Similar arguments are made in more detail here and here.
Proposition 2. Neurotechnology is critical for getting a mechanistic understanding of human cognition
You can agree with everything in the previous section and not agree with this section. I can’t prove that it’s impossible to obtain a perfect theory of human cognition without ever looking inside a human skull. Maybe if we’re lucky GPT-5 will be able to write down a flawless model of human cognition for us after reading Reddit.
But humanity has had 1,000+ years to come up with reliable, mechanistic models of human cognition from external human behavior and language alone, and hasn’t made too much progress. This is especially noticeable in the realm of human values, a critical subset of human cognition relevant to achieving AI alignment. From philosophizing to psychology experiments to law to raw intuition, humanity has spent millenia trying to write down its values and square them with our behavior and has thus far failed.
Our best hope of a mechanistic understanding of human cognition comes from neuroscience. That is: reductionist models of human cognition defined in terms of simpler, reliably measurable physical phenomena like neural spikes, synaptic connections, or proteomes.
The main reason neuroscience hasn’t produced these models yet is that we can’t observe most of what goes on in the brain. Neuroscience is bottlenecked by neurotechnology, where “neurotechnology” is defined as tools that directly, exogenously observe and manipulate the state of biological nervous systems. (Brain-computer interfaces are neurotechnologies, but neurotechnology is a broader category.) “Progress in science depends on new techniques, new discoveries, and new ideas, probably in that order” as Brenner put it.
Understanding infectious disease required the microscope. Discovering quantum mechanics required spectrometers. Developing relativity required precise astronomical measurement techniques. Etc.
More than observation alone, though, one needs the ability to intervene in a system to unambiguously test theories about how it operates. You can’t rigorously test a theory unless you can perturb the system about which it makes predictions and observe whether the theory’s predictions were correct.
Questions you might have
“Is neurotechnology development tractable in a timescale relevant to AI alignment?”
Potentially yes.
A key point is that the neurotechnologies this post advocates building do not need to achieve widespread commercial success or be approved as medical devices or therapeutics. What we need to build are research tools, for which the development feedback times can be much faster. A single institute running experiments with small, diverse clinical cohorts of informed volunteers could achieve the understanding of human cognition I’m arguing we need. Or perhaps a community of institutes and/or startups and/or Focused Research Organizations.
Also important is that AI can be used to accelerate neurotechnology development. AI improvements in chip design, bioengineering, and data interpretation are all relevant and are being pursued today. What matters is that neurotechnology, and our understanding of human cognition derived from it, develops apace with the most powerful AI capabilities. Using less-than-all-powerful AI systems to develop neurotechnology can help win this race. AI safety work that selectively slows down the advancement of the most dangerous AI capabilities will help too, of course.
In addition, note that there’s relatively little neurotechnology development happening today compared to what there could be. Effort can make a huge difference in the neurotechnology space. Just compare the neurotechnology landscape before and after Neuralink was founded. And Neuralink’s main bottleneck at present seems to be navigating FDA approval, which as mentioned above nearly all neurotechnology research relevant to AI alignment does not need to do.
Neurotechnology may seem opaque or unapproachable. But you don’t need to be a neuroscientist or neurotechnologist to be extremely useful in advancing neurotechnology. Neurotechnology development requires software engineering, mechanical engineering, hardware engineering, operations, entrepreneurship, and more. Email me if you want help finding ways to get involved.
So yes, if we build our AI successors this year, I doubt neurotechnology will have helped align them. But the slower the AI timelines, the more relevant neurotechnology becomes. Some regulation to buy a decade would be great, and isn’t unlikely, and a lot could be done in that time.
“We can’t even understand LLMs, and we can observe every parameter and activation in them. How will we understand human brains, which are more complex?”
Human brains are more complex to observe and perturb right now. But we don’t actually know whether they’re more complex computationally, or at what level we need to observe them to gain what level of understanding.
One doesn’t need to observe every floating point operation in a deep network to reason about how gradient descent works. Likewise one may not need to observe every synaptic change in the brain to understand how human motivational drives work. More thoughts on this here. Neurotechnologies being prototyped today could help us understand human brains to the level we understand LLMs. We won’t know until we try.
Also as mentioned above, there are some experiments we can only do in humans, or can do better in humans. E.g. we can prompt a human to lie to us in an experiment and be pretty sure they’re doing the thing we think of as lying, as opposed to trying this with an AI.
“Okay, but like exactly what neurotechnology do you want to build, and what experiments do you actually want to do?”
More on both questions soon.
But off the cuff, for the former some human connectomes would be a nice start, as would putting everything we already know about small-molecule-induced alterations of cognition on a firm mechanistic basis, as would being able to better see what proteins are in what neurons.
For the latter, off-the-cuff ideas include:
- Use all our mechanistic AI interpretability tools in humans and see whether they translate.
- E.g. one could straightforwardly run the experiment in Discovering Latent Knowledge in Language Models Without Supervision even with existing fMRI.
- Experiments with merging and splitting identities
- Merging and splitting are likely to be common phenomena in the AI realm where models can be copied and have layers added/removed, and we need to understand it.
- One set of experiments could be safe, reversible “callosotomies” with localized ultrasound anesthetic uncaging. These could help us more clearly define the boundaries of individualhood/agenthood.
- Map out human reward systems.
- Test how much we can manipulate the illusion of explanatory depth.
- Can we implant false memories? Make people believe false beliefs are true, or the reverse, at the mechanistic level? Can we tighten up our definition of hallucination based on this?
- Test partial uploads and substrate dependence
- E.g. anesthetize a small region of brain and mimic its input-output behavior with an AI system such that an awake human subject can’t tell the difference.
- Understand human moral reasoning processes
- Can we predict responses to moral reasoning problems in human subjects directly from neuroimaging data? How early in cognitive processing can we do this? Once we have a model for this, can we apply mechanistic AI interpretability tools to it? Can we use this to find adversarial examples?
“Can’t neurotechnology help achieve AI alignment in other ways too, like making us smarter?”
As mentioned in the Assumptions section, I’d like to learn more about proposals in this vein.
Some ways neurotechnology might help with AI alignment include:
- Augmenting the intelligence of AI alignment researchers, maybe helping humanity not get outstripped by AI quite so fast
- Though one has to be careful to not augment humans so much or so recklessly that they become unaligned AIs themselves. This is not so far-fetched. (Recall that alcohol is a neurotechnology.)
- Curing AI researchers’ neuropsychiatric disorders so they can be more productive — not a hypothetical problem!
- Generally making humans behave better, like increasing willingness to coordinate or ability to update on new evidence faster.
- Or at least having a psychopath detector, which might be handy for some AI governance use cases.
- Learning enough about how suffering works to help us avoid torturing the AI systems we’re building.
But I don’t think these applications of neurotechnology are necessary for solving alignment the way I think understanding human cognition is.
“Neurotechnology could help with alignment, but don’t the downside risks of developing it outweigh the potential benefits?”
Neurotechnology absolutely poses risks. Heroin is a neurotechnology, after all. But considering the importance of the alignment problem, the downsides of neurotechnology development would have to be enormous compared to the benefits proposed above to make it not worth adding to humanity’s portfolio of AI alignment approaches.
The biggest risk I can think of is that a better understanding of human cognition would facilitate people building more powerful AI systems, thereby hastening the arrival of the AI we’re trying to figure out how to align.
I don’t think this risk disqualifies the line of research proposed in this post, though. The argument “this line of research might inspire more powerful AI systems” can be applied to any area of alignment research. Or many areas of scientific research, for that matter. And based on the past decade it seems perfectly plausible that humanity-dominating AI systems can be developed with no more understanding of human cognition than we have today.
So there would need to be strong reasons why ideas from human cognition are riskier than other alignment approaches or the already-pretty-dangerous status quo. The historical record doesn’t provide such reasons: despite the name “neural networks”, specific mechnanistic ideas from human cognitive science and neuroscience played close to zero role in the pivotal AI advances of the past decade. There could be other reasons, though, and I’m eager for there to be more thought about them compared to the risks of other alignment approaches.
Another worry people have about neurotechnology is that it might offer AIs additional “attack surface” by which to influence human actions and values. But this post isn’t advocating for installing neurotechnology in every human head, just using it to run experiments. So such an attack surface would need to come in the form of the AI having knowledge of how to manipulate humans rather than hardware with which to directly manipulate them. And an AI powerful enough to strategically exploit such knowledge and use it to overpower humanity would seem to also be powerful enough to just obtain the knowledge itself, with or without us having it first.
Some also worry that investing in neurotechnology for alignment could pull talent or funding away from other types of AI safety work. Personally I’m happy to let people decide for themselves where their time and money will be most useful. But even ignoring this, there’s lots of talent and funding relevant to developing neurotechnology that is not working and probably will never work on alignment in any other way. The Alignment Forum is not overflowing with bioengineers, nurses, neurosurgeons, medical device regulatory experts, etc. And the funder overlap between neurotechnology and current AI alignment research is de minimis. So it seems more likely that attracting talent to neurotechnology would increase the percentage of humanity’s resources going towards AI-alignment-relevant work.
Acknowledgements
Special thanks to Alexey Guzey for his vigorous pressure testing of all the above ideas.
Thanks also to:
- Niko McCarty
- Niccolò Zanichelli
- Justin Lin
- Allen Hoskins
- Cate Hall
- Sasha Chapin
- Vishal Maini
- Adam Marblestone
- Willy Chertman
- Stephen Malina
for their feedback.
Have feedback? Find a mistake? Please let me know!