Assumed audience: People thinking about the interaction between classical software engineering “foundations” and large language models.
This past Thursday (September 5, 2024), I spoke at the excellent StaffPlus New York conference — one of the best-run conferences I have attended or spoken at — on the subject of LLMs and how they intersect with our engineering foundations. You can watch the video here:
Below are my slides and script! What follows is not a word-for-word transcript but my speaker notes and slides.
Before the content itself, though, a meta note about this talk: I am, as a lot of folks out there know, not exactly bullish on LLMs in general. I think most of the hype is nonsense — even while I think some of their capabilities are really astonishing! — but I do think they’re here to stay. So:
First of all, if we’re going to be working in a world where people are authoring code with LLMs (Copilot etc.) and using them to “do things” as agent-based systems, the world of inputs to the models matters, as do the environments in which they run. They matter a lot!
Second, I think a lot of engineering organizations pretty consistently underinvest in foundations, because of the simple realities that many kinds of foundations projects take longer and have a less direct/measurable connection to profitability. They just do, no way around it.
Third, then, what are the ways that engineering leaders — including managers, but especially Staff+ engineers — can mitigate the risks of LLMs and improve the chances that they are net neutral or even positive, rather than negative, for software quality, UX, and business outcomes?
Secretly (and yes, I am saying it out loud now!) it is predicated on my belief that these are the things we should be doing anyway… but this is a moment when we have a really obvious reason how and why these things matter, with which we can make that case to decision-makers.
And because these are the things we should be doing anyway, an organization which uses this LLM-hype-moment to improve the foundations on which we run LLM-related systems… will be in a better spot two years from now even if the LLM ecosystem craters. Seems like a win.
Good afternoon! It’s almost lunch-time, so let’s get into it!
Show of hands: How many of you think that all the C and C++ on GitHub is free of memory vulnerabilities and correctly implements thread safety?
Again, show of hands: who do you trust to write memory- and thread-safe code in C? Yourself? A new junior on your team who has never written C before? GitHub Copilot?
Same question, but now we’re talking about (safe) Rust. Now I can trust all three of these, because the compiler will check it.
This is not a talk about C and Rust. It is a talk about engineering systems, and how we think about them in a world where LLMs are pervasive, for good and for ill. Because my thesis is that our engineering systems are simply not ready for a world of pervasive LLMs.
If you are a big fan of LLMs: this talk is about making them better.
If you (like me) are a bit more skeptical of LLMs: this talk is about good safeguards — about giving ourselves a shot (not a guarantee!) at not just making things much worse with LLMs. Because we really cannot afford that. Software quality is bad enough as it is!
One of the most common phrases we have all heard batted around about AI over the past few years is “prompt engineering” — the idea of learning how to get the best results out of a large language model by way of the specific ways you prompt it —
Specific choices in wording.
The amount of context to include.
The scope of the task you are giving it.
Creativity levels, or “temperature”.
What not to bother with because it tends to go sideways there.
Meta prompts, like my favorite: including “no blabbing”. (Try it, if you haven’t! It’s way less chatty and annoying this way!)
No matter how good your prompting skills are, though, there’s a problem: LLMs have been trained on real-world code. There is no other way to train them. Now, some of you might be thinking, “Wellll, we can train them on their own outputs, or with adversarial cycles with other LLMs” — and yes: there are many techniques which can help somewhat, but. At the end of the day, all the code LLMs have been trained on is… in that bucket. No amount of prompt engineering changes what C is, or what the real-world C codebases are like. Telling Copilot “No vulnerabilities please” will. not. make it. true! (It might help! But it won’t solve the problem.)
So: prompt engineering is all well and good, but it will always be constrained by the foundational limits of these systems. It is not enough.
More than that: It will never be enough.
That’s a strong claim. Why do I think prompt engineering will never be enough?
To answer that, I am going to take a detour through a definition:
A substrate is a layer that sits below the thing we are (notionally) interested in.
In biology, it is the material an organism grows on, maybe what it eats, where it lives. The organism is the active agent, but everything about its existence is shaped by its substrate.
Likewise, in chip manufacturing, the substrate is the silicon wafer that all the actual circuitry gets printed on. The circuitry is active and the silicon is passive — it “does nothing” — but good luck making a useful circuit without an appropriate substrate.
With large language models, the substrates are their training data and our engineering systems — their inputs and their outputs — the context in which they actually operate. That means that those two factors absolutely dominate any amount of prompt engineering when it comes to the results we will get from LLMs.
There is another problem, too, at the intersection of automation and attention.
Decades of research on automation, including aircraft autopilot and automated driving, have taught us an incredibly important lesson: human beings cannot pay attention only when we need to. When autopilot works really well, people disengage. And in fact, many of the worst accidents are situations where people are perfectly competent, but lose focus. But there are also long-term impacts: being able to rely on autopilot degrades pilots’ competence unless they go out of their way to stay in practice.
In fact: as a rule, the better automation works, the less we attend to whatever is being automated. I am going to repeat that: the better automation works, the less we attend to whatever is being automated. When the self-driving works really well, that is when terrible car accidents happen. You take your eyes off the road because the car is handling things just fine. And then you need to pay attention… and it’s too late.
There is absolutely no reason to think this does not apply to writing software. And yes, the consequences of taking a prompt result from ChatGPT or Copilot will tend to be less immediate. But that does not mean they will be less serious. In fact, it increases the difficulty, especially when we combine it with another reality we know empirically about software development.
Namely: that reviewing code is hard. Writing code is much, much easier than reading code later and understanding it, still less spotting bugs in it. Spotting a bug is a challenge even when you know there is a bug: that’s what debugging is, and it’s hard. And when you prompt Copilot or Cody, and use the code it provides, code review is exactly what we are doing.
And when you put those factors together, it has a result that might be counterintuitive: The better LLMs get — the more they boost velocity by generating working code — the harder it will be notice when they get things wrong.
Our software foundations need to get much better, across the board, or else widespread use of LLMs is going to make our software quality much worse. Our only chance is to double down on defense in depth: in making all of our software foundations better.
It also means making firm judgment calls about where LLMs should and should not be allowed. To wit: an LLM has no business anywhere near configuration for medical hardware, because of the challenges with attention and code review. But it also means: internal engineering tools are one thing — maybe a way you can get an internal refactor done that you would never get funded otherwise. Decision-making that affects users is something else. Offering health advice? Offering legal advice? Making legal decisions? These are off the table!
Finally, we need to talk about the elephant in the room: the big problem with LLMs: hallucination.
And that’s because “hallucination” is not a “solvable problem.”
In fact, “hallucination” is the wrong word. It suggests that LLMs sometimes don’t hallucinate. And that is exactly backwards.
Hallucination is what LLMs are. It’s the whole thing. When ChatGPT gives you the right answer, and when it gives you the wrong answer… it’s doing the same thing.
And that’s fine, actually: as long as we understand it!
Because if we understand that reality, then we actually can (and must!) build software systems appropriately — and we can build our socio-technical systems appropriately, too.
So: let’s do some substrate engineering.
Whether that’s authoring code with them or deploying them as “agent systems” where they go do tasks for us, we need to design our engineering systems with these constraints in mind: the overwhelming influence of substrates, the effect of automation on attention, and how LLMs actually work.
The territory for substrate engineering is, well, all of our engineering foundations. And yeah, it would be interesting to think about these things! How does“correct by construction” API design help? What does it mean if we are generating our tests instead of using them for design? Should our package managers be able to run arbitrary code on installation? (No!) Can we leverage “capabilities” systems in our operating systems? (Yes!)
But that would be a very long talk, so I am going to to focus on the design of our tooling and configuration, and trust you to apply the same kinds of thinking to the rest of them!
I am focusing on tooling and configuration for two reasons.
First because they have a disproportionate impact on both our user and our developers, and therefore on business outcomes — because tooling and configuration sit at critical points in our software systems. They are key parts of developer workflows, and they also sit at points that determine whether our systems stay up or go down. And that’s true regardless of LLMs!
Second, these kinds of tools and configuration languages are both extremely amenable to use with LLM-based systems and also extremely vulnerable to the failures modes of LLMs.
They are amenable to them because the configuration systems are often extremely regular and predictable. Claude or Copilot or any other tool like that will probably generate them correctly, because there are so many valid GitHub Actions configs out there for them to have been trained on.
But tooling and configuration are vulnerable to the kinds of problems LLMs have because of how easy it is to get configuration wrong. “Hallucination” means we can never be confident in the outputs from those tools on their own.
So let’s think about how to make them better.
Here is a GitHub Actions config, written in YAML. It has a single job named “Test”; there two steps in that job; and it specifies that it should run on the “push” and “release” events.
This config is wrong. But from just looking at it, I cannot see how. (I mean, I can now, because I wrote this slide, but two weeks ago? Nope.)
But if our editor can underling the error and give us an error describing what’s wrong — ‘“jobs” section is sequence node but mapping node is expected’ — we have a shot at actually fixing it.
I mean: it’s not a great error messages, but… [thinking face, muttering to myself about sequence node and mapping node], oh, okay, the fix is that it needs to be a key-value pair, not a list. Cool. I can fix it.
Dare I say it: we know how to do better with programming languages than JSON Schema-based validation for YAML configuration files!
One move, which people are already doing for other reasons, is to use full-on programming languages here! Pulumi lets you use most any programming language for “infrastructure as code” and, for that matter, you have been able to write your build tooling end to end in F♯’s FAKE DSL for years!
But the upside and the downside of full programming languages is that they can do anything… which means we have all the peril and all the problems that come with that. At the disposal of LLMs.
For example: they can go into infinite loops during install. Neither is “undefined is not a function” during deploy. Or NullPointerExceptions during CI runs.
What we want, then, are properties that are relatively rare in most mainstream programming languages. The first is soundness. This is the property that means that if your program type checks, you don’t end up with “undefined is not a function” or null pointer exceptions. Logic bugs yes, type errors, no.
The second property we want is termination: the ability to guarantee that your program finishes running. If you’re thinking “But the Halting Problem!”: well, that’s only a problem we have to care about if we have a “full” programming language. We actually can prove that many programs do halt… if we restrict what the programming language can do.
To get termination, we need the functions in our language to have one property, totality, and not to have another, general recursion. Totality means that every input has an output — so if you’re doing division, you need to handle the divide-by-zero case; that way you can be sure the rest of your program’s logic is not invalidated by a panic. Totality is much easier if we also have purity, which means you get the same inputs for the same outputs, and no other state in the system gets mutated (no “side effects”). And then we need to get rid of general recursion — while true or anything equivalent, any recursion we cannot prove is total itself. Then we can be sure that our program will stop, with meaningful results.
The third property we want is related to the first one: we want a rich type system. More like F♯ or Swift or TypeScript than like Java or Go historically have been. We want to be able to express things like “Individual contributor or manager” directly and clearly (without subtyping shenanigans). We also want type inference to just work, so that it is as easy to write a language like this as it is a language like Python or Starlark — but with all the help and affordances that a good compiler can provide. We want error messages like the ones from Elm and Rust.
That last point is suggestive: we don’t want a language which has too rich of a type system. That cuts against other goals: ease of use, predictability, the ability to guarantee that inference always works, and so on.
So if we had a language like that, what would it give us? Well, right up front, better and faster feedback loops for people writing configuration. If you think DevOps is good, that means every engineer. And this is true with or without LLMs in the mix! Second, though, a language trained on this kind of thing would be much more likely to be correct consistently, because it could not be incorrect in certain key ways.
Now: it is no magic bullet. It does not solve the “we put in a number, but the number was insensible here” problem. But it could give us tools which do help with that, and make them easier to use, and therefore more viable. Having the right tools in the toolbox matters!
There are, to my knowledge, only two languages taking any serious swings in this space. So let’s talk about them and grade them on these goals! First up, Starlark, Bazel and Buck’s build language. It doesn’t do types — Facebook’s work-in-progress maybe-someday project notwithstanding — so soundness is out. On termination, it does better than a normal programming language by forbidding recursion, but since it doesn’t have types, it doesn’t have a way to statically say “please handle division by zero” or similar. So: it’s better than just using Python, but with some significant caveats.
The second interesting language here is Dhall — it compiles to JSON, YAML, etc. so you can use it as a more robust way to author existing configuration targets like GitHub Actions. It does a lot better than Starlark on these criteria: it has a sound type system and requires totality, so it can do pretty well on termination — though nothing can save you from looping from one to a billion — , and it has a nicely-balanced type system with good inference but not too many features. It’s pretty interesting and good, and I encourage you to try it out if you haven’t.
Of course, there’s another huge difference between these, which is adoption. I would guess at least 10× as many of you have heard of Starlark as have heard of Dhall before just now. And I think “a Python-like language that came from Google” is probably an easier sell than “A Haskell or Elm-like language from some folks in open source” for a lot of organizations. But I would like to get the things Dhall has on offer! So I think folks should think about trying Dhall and investing in its success.
But if we take this space seriously — and we should! — I think there is room to build a language that has a lot of what Starlark has going for it in terms of familiarity and accessibility, but which also pulls in the good ideas from Dhall’s type system. And then we could see: does it work well? Where does it fall down? And does it in fact help with providing useful guardrails for using LLMs? I don’t know: it’s a hypothesis. But we should try it!
I think LLM-based tools are here to stay. But we are the ones responsible for how they get deployed. It is on software engineers, especially the kinds of engineering leaders in this room, to make sure that we are deploying them safely and wisely — if we deploy them at all.
That means putting all of our software engineering on better foundations: better tooling , better frameworks, better languages.
Because, to say it one more time: these are the substrates for all of our software development. That means that thinking well about how to make better substrates for where and how we deploy LLMs will have a compounding effect. It will make our software better for the people building the software. Even more importantly, it will make our software better for the people using the software.