LLMs can write code, but they can't design systems

Over the past two years, Large Language Models (LLMs) have evolved from research experiments into everyday tools for software engineers. They are great at generating code, drafting documentation, and accelerating prototypes. And yet, despite all the hype, LLMs hit very real boundaries, especially in domains that require deep systems-level thinking. The gap becomes very obvious when you ask them to design or reason about real systems.

I still run into the exact same patterns of failure whenever the problem extends beyond CRUD apps or routine boilerplate.

Case Study 1 - A small Kafka ingestion system

I recently started building a small internal service that accepts HTTP requests and produces messages into Kafka. I asked an LLM to help sketch out an architecture.

Hallucinated with GitHub libraries that don't exist.
Described Kafka features that don't exist.
Ignored Kafka's core features like error handling, order guarantees, batch sizes, etc.

And when corrected, it contradicted its previous answers.

It behaved exactly like a junior developer who can write syntactically correct code but lacks the experience required to build ingestion pipelines.

Deep Kafka work requires an understanding of broker internals, consumer group protocols, partitioning, and ordering guarantees. This is not trivial knowledge.

Case Study 2 - A privacy-auditing browser extension

In parallel, I was working on a side project: a browser extension that audits the privacy footprint of any webpage.

The idea was simple: If I visit a page, tell me what it is pulling from my browser. This involved scanning beyond the obvious data sources, such as cookies and local storage. It also considered data points such as window/screen data, WebGL capabilities, user agent fingerprinting, geolocation attempts, installed fonts, and device memory.

Once again, I asked an LLM to help map out the architecture: create structure, handling permission, script injection and message passing.

Mixed up extension APIs. The LLM used deprecated APIs, invented APIs that didn't exist and confused with content scripts and extension scripts.
Didn't understand the runtime separation between scripts.
Misidentified which data is accessible. LLMs would say things like “you can read the user's IP using the browser network API,” which is incorrect.
Didn't understand the inference chain. A big part of privacy auditing is not what's explicitly retrieved, but what can be derived.

A browser extension is a tightly sandboxed system with strict context boundaries. Content scripts run in the webpage context, injected scripts run in the actual page environment, and service workers/background workers run separately.

Why did it fail?

LLMs don't understand systems. They generate descriptions of architectures, not actual architectures. In my first example, they don't perceive the impact of poor partitioning or an overloaded producer queue.
They can't verify correctness. An engineer is supposed to check if a GitHub repository exists before recommending it. LLMs will confidently invent one.
They are great with syntax but shallow with semantics. They can fix a loop, join an array, or generate boilerplate for a SaaS app. But they struggle the moment a task requires deep domain intuition.
They can't invent solutions, just imitate patterns. Engineers need to deal with unfamiliar shapes, not just familiar ones.

No execution, state or causality

LLMs are statistical sequence models that predict the next token using attention over previous tokens. Internally, attention weights decide which parts of the input matter: these are combined with token embeddings to capture semantic relationships, which the model uses to predict the next token through a probability distribution.

However, nowhere in this architecture is an execution engine, compiler, or state machine.

What's next? How LLMs will/should evolve?

Future models should use tools such as compilers, linters, reasoning modules, and code execution sandboxes. This transforms the model from a text generator into something closer to a developer assistant.

They also need system awareness to understand project structure, build steps, dependencies, and architecture diagrams. They should check, not just guess.

Finally, explicit reasoning traces are essential, so that humans can inspect and challenge the model's reasoning.

Final thoughts

LLMs aren't replacing engineers anytime soon; they are just reshaping the workflow. Today, they are fantastic at the easy parts but unreliable for the hard ones.

The judgment, tradeoffs, debugging, and architectural sense only come from shipping real systems, and this remains human territory for now. Honestly, I think that's a good thing.