The Before Times
We built computers for certainty.
For seventy years, a function meant give me input, I’ll give you the right output, every time. When it didn’t, you had a bug, but it was only a matter of time and talent to find and fix it.
The story was simple: the machine is deterministic, the human is messy.
We still had complexity, but it applied to the very high end of computing, when things started to creak a bit: the Googles and Facebooks with more users than most countries have inhabitants (which is why I smile / smirk a bit when someone asks for ‘hyperscale skills’ in a country of less than 200 million). Anyway, as hardware got better, even the boundary of what’s doable with everyday tools expanded.
A few non-deterministic algorithms also exist, but those are far and in between and not really in practical daily use.
From Spam to Maybe
In the 2010s, things started changing. The first mainstream probabilistic (“machine learning”) use cases were spam classification and ad optimization. They were a real alternative to keyword filtering and regex blocklists. They weren’t perfect, but better than the alternative, so we accepted them.
The tech was still not good enough to tell apart certain kinds of dog from chocolate chip cookies. Machine translation was okay for some languages and funny for others.
In the last 8-10 years, we realized that training early machine learning models on more data will make them develop general-purpose skills from the learning material. Researchers fed in a lot of text and the models started understanding language. When fed enough source code, the machine started to write better and better software. This was different from old-school programming when we had to define what the machine had to do exactly - given enough data, the models have abilities no one explicitly programmed, specified or asked for.
Maybe the machines are just finding patterns, but that’s ostensibly what humans do as well.
It’s like building a function that takes any input and - instead of an exception - always returns something. Before, input validation was an important part of making sure users don’t feed the code something it does not understand. Now anything is a valid input, at least as far as it will get some sort of response.
We can ask Claude to write code, but also poetry. And it does an ok-ish job for both. We can ask a terminal coding agent to write code, and it does. But the same agent can also review a contract or compose music; something the product’s designers did not intend to do.
This makes building new products on these models quite tricky: not only you don’t know what they can do (anything?) you have no idea what your customers will use it for. It makes it next to impossible to test (at least for all use cases) or ensure it will do what you promise consistently.
(it’s also a goldmine for marketers who can now really promise anything and they may not be very wrong)
How do we solve hallucinations? Do we need to solve it?
The little secret of LLMs is that we created hallucinations by adding randomness to the output during inference. (those are the p- and k-top values) The models are static after training, so they would return the same response to a prompt. But the lack of randomness would collapse the model to bland boringness and would not make the answer any more correct; that depends on the training data and all the things we don’t actually understand yet in how the models work.
Also, it’s difficult to check correctness. A response that sounds reasonable and logical does not make it correct (if you ever spoke about sales with an engineer, you know it.) And some questions, usually the more interesting ones, don’t actually have a correct answer. Things like “which city should I move to” are difficult even if you know everything about the person asking and the whole world. “What does the best website for our company look like?” is not something you can figure out programmatically either.
The most interesting and valuable questions are not that well-defined either. If you’ve ever been an engineer discussing something with sales or marketing, you know that. I used to call it “fuzzy thinking,” but later I realized that’s just the real world. (and some fuzziness)
Is randomness useful?
With this randomness / stochasticity, our new tools can produce results we (or the tools) haven’t thought of, and that’s valuable. They can also fail in ways we did not anticipate. Sometimes we won’t know which one just happened. It’s like the million monkeys with typewriters: one of them has a Shakespeare play, but we may or may not find that. (the hope is, this is exactly how this happened with humans too, we just didn’t all have typewriters)
We just need to find a way to get the good results and discard the bad (and learn to differentiate between the two).
Oh, and you pay for both.
Building products with genies inside
If you worked with computers before, you’ll find it new and discomforting. If you ever had employees, it’s nothing new.
If you are building products on these language models, one of the reasons your customers will get frustrated is the price. This is the first time when the marginal cost of a digital product is much larger than zero. This is also the first time you cannot exactly tell what your user did well or badly.
Photoshop has a million buttons and sliders and parameters, but with some practice, it’s possible to figure out what happens if you adjust them. Sure, ther is a learning curve, but after you paid for the software, each button press was free and the response was consistent.
The AI equivalent of this so far is a genie that gives you results dramatically faster. The results will also be somewhat different if you ask slightly differently. Sometimes even when you ask the same question twice. There’s no ‘make it a bit lighter, but not there’ button. But every press of the button will cost you money, regardless of the result.
There is no manual, but it may come with a ‘book of spells’ when corporate lawyers can let go of the words ‘best practice.’
As a product builder, you could try to constrain responses to create the illusion of control. That was the promise of an early “genius” LLM framework: the model could do whatever, but the result was one of four hardcoded responses.
There are use cases when there are truly a very limited set of options to choose from: you don’t get a refund, get a partial refund or get all your money back. That’s it. You cannot get a pink elephant nor may you get to be the king of France. I found there are already good ways to solve these without magic tools.
For this new magic, the more you try to control the model, the more you’ll nerf it, ultimately damaging the product itself.
To live with uncertainty
This uncertainty thing is only new to computer-types who see it at work for the first time. For most people, this is just how other people work. They say they will do (or did) something and they won’t (or didn’t). They have good days and bad days, good work products and bad, while they are still the same person.
We tried to mitigate (I don’t want to say ‘manage’, everyone hates it when done to them) the uncertainty with humans for thousands of years already. We have processes inside companies that act as guardrails. They even work some of the time. We have other humans check things to make sure we don’t have too many mistakes. Heck, even programmers pair program so they get ideas off from each other and produce something better than either of them could have done otherwise.
I see people discovering building with AI feels less like programming and more like management. Products get less static and may do things you did not intend or even thought of. You can set direction, you can review the work, you can create guardrails, but you can’t control every detail. That’s uncomfortable, but it’s also kind of freeing. You stop thinking about “is this right every time?” and start thinking “is this good enough, often enough, to be useful?”
The other thing is workflow. You rarely want AI to “do it all.” The better use is sliding it into a process where its fuzziness is acceptable — summarizing, brainstorming, suggesting — while humans decide when it’s good enough to ship.
Uncertainty has a few interesting consequences for engineers though. It’s not possible to test all use cases yet - or any more. Not knowing how customers will or do use the product (or whatever we substitute the product with) it’s not possible to check for sure if they still work after an update. A good example of this is the backlash OpenAI got when they rolled out the GPT5 update - it was a better model, but not that nice. Many people preferred it to be nice.
Most people agree you should learn from how customers use the product. With AI, you may even get a better idea for what your product is. In the Before Times “we’ll test it in prod” was definitely a bad idea, but this may just be normal from now on, at least for certain kinds of products. So collecting and making sure of all that data becomes even more important. I’m not sure the rather static tools we have are good at this yet.
So maybe that’s the real shift: we’ve let the genie into our products. They don’t always follow instructions, they don’t always give what we expect, but sometimes they hand over something we didn’t know we wanted. The trick isn’t to trap the genie back in the lamp — it’s to build products, workflows, and trust systems that make living with one feel useful instead of chaotic.