Artificial Selection in AI Development

Derek Larson
5 min readAug 17, 2023

This was a quick post inspired by this tweet. It’s rough and I hope to polish it up as I read more relevant research and think through the ideas!

Discussions of AI doom and boom have reached the mainstream: extended family have begun asking me pointed questions. This ends up a bit frustrating — not because of the questioning, but because of the futility of providing quality, scientific answers. We’re in this tough state where neither party (the public nor the experts) are in a good position to dive into the topic together. There are well-known, trusted voices across the spectrum, from speaking about sub-decade humanity-ending scenarios to scoffing at recent accomplishments with a “nothing to see here” vibe.

This article aims to help normalize part of the discussion, by addressing one facet of AI risk: the hidden costs of progress. With an analogy to artificial selection, I believe we can grasp (and predict) how problems may show up as we develop stronger AI. Where we saw health problems show up in purebred dogs, and nutrient loss in modern crops, what will happen in artificial intelligence?

Immeasurable Complexity

The core ingredient in a recipe for unintended consequences is a system with very many variables that can’t be easily measured. Genetics certainly meets this requirement, where organisms like humans, dogs, and corn each have something like 10k–100k genes. And we’re making slow progress on understanding them (recall the junk DNA debate). Not only are there many genes, but the mechanisms for their expression is complicated (see epigenetics). Furthermore, it would be nice if genes were narrow in their function or effect, but this isn’t necessarily the case — genes can be pleiotropic, having multiple apparently unrelated functions.

Cognitive systems, such as the human brain and the aspirational targets of AGI, are arguably even more complex and we understand them even less. Compare 3 billion base pairs in the human genome to 100 billion neurons and 100 trillion synapses in the human brain. And our current language models are pushing trillions of parameters! With the human brain, we have rough map of functionality — e.g. the amygdala is central to emotion. I believe we mostly lack such vision into neural networks, as we’ve just begun to categorize some of the behavior (such as induction heads). Intriguingly, our early AI systems demonstrate a pleiotropic equivalent with neuron polysemanticity. All to say, these models are certainly complex enough that we will struggle to understand and measure fine-grained function.

Optimizing the System

It may seem both easy and hard to go about “improving” an organism. The echoes of 20th century conflict around eugenics still occasionally reverberate, promising various levels of human advancement. With so many “flaws” in the human genome, surely we can just selectively patch them up? Meanwhile, those who have thought through the implications of genetic enhancement resist, often on a morality basis. Less common are the arguments that we lack the scientific maturity to engineer humans, but I find these just as compelling.

First, let’s consider three degrees of genetic development: evolution, artificial selection, and gene editing. Humans owe our genetic gifts to evolution, and it sets a baseline for comparison. Evolution operates extremely slowly, and its fitness target (reproductive suitability) can conceivably tie in most genes. As such, over a long enough period, and subject to a relatively stable environment, we would expect evolution to yield statistically “positive” results.

Artificial selection is as old as human civilization. By only breeding plants and animals with desired traits, we have quickly guided their genes towards some desired goal: dogs that retrieve, crops that grow quickly. Compared to evolution, however, we’ve relaxed the fitness criteria: we’ve focused on specific improvements, what about other characteristics? For crops, there’s evidence that while improving things like hardiness and yield (targets), nutrient content has diminished (side effect).

Gene editing, thankfully, has been a cautious pursuit — it’s not easy. Curing hereditary blindness, while likely passing the moral hurdle, has a tougher time with the scientific maturity one. With selective breeding, we at least move at the pace of generational change and leverage the consequential viability check. Germline gene editing, say for crops, has no innate speed limit and it’s on us to perform the required checks — potentially long-term, large-scale trials.

Selecting AI

What can we learn from genetics that we could apply to the future of AI? First off, I think we’re kind of in the “primordial soup” phase. The various model architectures and datasets being developed are like the early single-celled organisms and proteins floating around, rapidly iterating. But let’s focus on taking foundation model and using reinforcement learning with human feedback (RLHF) to guide it towards deployability. This resonates, I feel, with artificial selection: humans are observing an iteration of the entity and nudging it towards a few positive metrics. In an example by Anthropic’s research (see figure 2 in CAI paper), we find a tradeoff between improving harmlessness and helpfulness for a given model and feedback method. New models and methods can push out that surface, but we must also take care that we have enumerated the quantities we care about. Overt bias might be easy to see, but what about subtle forms like dog whistling? Are answers remaining consistent across lightly modified prompts?

One example of unintended consequences for RLHF is increased sycophancy: models just say what the prompter seemingly wants to hear, regardless of factuality. This was pretty interesting to me as it’s the kind of thing we wouldn’t immediately care about, and perhaps requires a broader pattern to notice. It makes me wonder what other, deeper forms of behavior may be hidden away. With longer contexts, can we elicit frustration or boredom, or even forgetting? One of the issues here is that we’re potentially already asking these systems to go beyond human ability in areas we haven’t psychologically explored in humans.

Enter Interpretability Research

Scaling oversight of AI will be tough. We’ve already enlisted the help of AI models to review each other, with the complications that entails. Methods like RLHF (with aid from AI) act on a large surface: interpreting model output. In some ways, just like we can’t predict all the ways a new gene affects an organism, it will be hard to generate all of the needed inputs and analyze their outputs to track model alignment. If we crack open the black box, we can begin to shrink that surface.

This is one of the ways I see interpretability benefitting AI development. If we can begin to categorize internal functionality of LLMs — find their amygdalas — then we can imagine replacing large-scale prompting tests with fine-grained internal diagnostics. Imagine it like, for a human, skipping the hours of behavioral observation and therapy for a brain MRI.

--

--