For a while, NemoRouter ran LiteLLM the way most people do: as a separate service on its own port, with our backend calling it over HTTP. It worked. It also meant every LLM request crossed an extra network boundary, we operated two processes instead of one, and there was a port on the network whose only job was to talk to the process right next to it. In 2026 we mounted LiteLLM in-process as an ASGI app inside the backend. This post is why, and the honest trade-off it introduced.

The separate-service shape

BEFORE
client → Nemo Backend (:8090) ──HTTP──► LiteLLM (:4000) → providers
                                  ▲
                          a network hop between two processes
                          on the same host, every request

The backend did auth, credit reservation, and guardrails, then made an HTTP call to LiteLLM on :4000 for the actual provider dispatch. That hop bought us nothing — both processes lived on the same host — but cost us latency, a second process to deploy and monitor, and an extra listening port to secure.

The in-process shape

AFTER
client → Nemo Backend (:8090, LiteLLM mounted as ASGI sub-app) → providers
                          ▲
                  handoff is a function call, not a network request

LiteLLM is a pinned PyPI dependency, and we mount its ASGI app inside our FastAPI backend. The handoff from our pre-flight (auth + reserve credits + guardrails) to LiteLLM's dispatch is now an in-process ASGI call, not an HTTP round trip. There is no :4000 anymore.

What we gained

	Separate service	In-process mount
Handoff	HTTP round trip	ASGI function call
Processes to run	2	1
Ports to harden	`:8090` + `:4000`	`:8090`
Deploy units	2 (version skew possible)	1 (always in sync)
Latency	+ hop per request	none added

The deploy-units point is subtle but real: two services can drift to incompatible versions; one process can't disagree with itself. Mounting the dependency means the gateway and the proxy ship as a single, always-consistent unit.

The backend is still always in the path

In-process doesn't mean the frontend talks to LiteLLM. Every request still enters through the Nemo Backend, which does auth, credit reservation, and guardrails before handing off to the mounted LiteLLM. The architecture boundary is unchanged — we removed a network hop, not a control point.

The honest trade-off: the master key

Mounting LiteLLM in-process meant the backend process now holds the LITELLM_MASTER_KEY — because mounted LiteLLM reads it at startup for admin-token validation and encryption salts. So the master key now lives in the backend's environment, where before it lived only behind the separate service. That's a real expansion of blast radius: compromise the backend and the master key is in scope.

We accepted it deliberately, with an offset. We removed an entire service process to harden (:4000 is gone), shrinking the attack surface in one dimension while the master key grew it in another. And the thing that actually matters — customer LLM traffic — still uses only virtual keys, which are rate-limited, budgeted, and revocable. The master key never touches customer inference; it's management-only. We wrote the trade-off down rather than pretending the mount was free, because security decisions you don't document are security decisions you forget you made.

Why a pinned dependency, not a fork

Mounting an upstream project in-process only works if you control its version precisely. LiteLLM is pinned to an exact version in our dependency manifest; bumping it is a deliberate, tested step, not an automatic drift. We read the upstream source for reference but don't fork it — forking would mean inheriting maintenance of a fast-moving project forever. Pinning gives us the in-process benefits with a clean upgrade path. (More on why we build on LiteLLM OSS.)

The takeaway

A separate service for a dependency that lives on the same host is a network hop and an operational unit you may not need. Mounting LiteLLM in-process as an ASGI app removed both — one process, one port, no added latency, always-consistent versions — at the cost of the backend now holding the master key, which we offset by deleting a whole service to harden and keeping customer traffic on virtual keys only. The lesson generalizes: question the hops that exist only because of how you happened to deploy, not because the architecture needs them.

Mounting LiteLLM In-Process: One Fewer Network Hop

The separate-service shape

The in-process shape

What we gained

The honest trade-off: the master key

Why a pinned dependency, not a fork

The takeaway

More from Engineering

Hydration-Safe Rendering for Money and Time

Canary Deploys and Auto-Rollback by SLO

Platform Fee

Platform Fee

Credit Ledger Parity Checks: Catching Drift Early