Get all your news in one place.

100's of premium titles.
One app.

Start reading

Get all your news in one place.

100's of premium titles. One news app.

Start reading

inkl

FORGOTTEN HISTORY: Keycloak on aws the difference between it works and it holds up - The Untold Story

AWS MFA

Keycloak

SSO is one of those things that looks like a checkbox until it becomes a dependency for everything. Marketing sites, customer apps, internal dashboards, partner portals. Suddenly, login is the front door and the fire escape.

That’s why a solid Keycloak deployment on AWS isn’t about spinning up a container and calling it a day. It’s about getting the boring parts right: redirects that don’t break, sessions that survive restarts, upgrades that don’t turn into incident calls at 2 a.m.

When teams don’t want to babysit IAM infrastructure (and, honestly, most don’t), it helps to bring in people who live in AWS all week. The Perfsys cloud engineering team is built around that kind of work: production setups, migration cleanup, and the unglamorous “make it reliable” stuff.

Why Keycloak keeps getting picked (even by teams that could buy something)

Keycloak isn’t popular because it’s pretty. It’s popular because it’s useful.

It speaks the languages apps already use: OIDC and SAML. It can plug into LDAP/AD, social logins, external IdPs, or sit as the primary identity layer. It’s flexible enough to support both “employee SSO” and “customer identity” without forcing everything into one awkward shape.

And there’s another reason nobody says out loud: control. Many orgs are tired of building their auth strategy around someone else’s pricing model and roadmap.

AWS is a great place to run Keycloak… if it’s treated like a platform component

Keycloak isn’t a stateless API. It has a database heartbeat. It has caches. It has session behavior that changes depending on configuration and version. It’s also the kind of service that causes immediate user pain when it degrades.

On AWS, the win is clear: managed services, repeatable infra, scaling options, solid networking primitives. The risk is also clear: it’s easy to assemble a “working” architecture that quietly carries landmines.

A good AWS setup answers a few uncomfortable questions early:

What happens when one node dies mid-login?
Where do secrets live, and who can rotate them without breaking prod?
How is traffic routed so cookies, redirects, and hostnames behave?
What’s the upgrade plan when Keycloak releases security fixes?

If those answers are fuzzy, the project isn’t done yet.

The architecture pattern that tends to survive real traffic

There are dozens of valid builds, but the setups that don’t crumble under load usually share the same shape.

Compute: containers are the default for a reason

Running Keycloak on ECS or EKS is common because deployments become repeatable, scaling becomes normal, and rollback becomes a button instead of a late-night manual fix.

ECS is often the “less drama” option. Fewer moving parts.
EKS makes sense when Kubernetes is already the standard and the team actually operates it well (not “we once installed it”).
EC2 still shows up in regulated environments or where ops teams want total control, but it shifts more responsibility onto the team.

Database: treat it like the engine, not a sidecar

Keycloak leans heavily on its database. Not just for users and configs, but for sessions and operational behavior depending on how it’s set up.

A typical production choice is Amazon RDS for PostgreSQL with Multi-AZ, automated backups, and sane monitoring. The important part isn’t the service name. It’s the discipline: capacity planning, tuning, and understanding that the DB is part of auth latency.

If the DB stalls, login stalls. Simple math.

Load balancing and “why are redirects broken?”

A lot of Keycloak pain comes from proxy and hostname settings. ALB, CloudFront, custom domains, TLS termination, path-based routing - these pieces don’t naturally “just work” unless Keycloak is told the truth about how it’s being reached.

Symptoms are familiar:

endless redirect loops
cookies not sticking
“invalid redirect URI” surprises
users getting logged out randomly because session cookies behave differently behind the proxy

It’s fixable. It just needs careful setup, not guesswork.

Logs and metrics: auth problems are loud, but the root cause isn’t

Keycloak failures create immediate noise (users can’t log in). Finding out why is another story.

The teams that stay calm during incidents usually have:

centralized logs with search that doesn’t feel like punishment
dashboards that show latency and error rates, not just CPU
alerts that trigger on real impact, not every harmless spike
visibility into DB health, because that’s often the bottleneck

Multi-app SSO: where “simple” turns into actual design work

SSO isn’t one app. It’s many apps agreeing on a contract.

That contract includes token lifetimes, claims, scopes, redirect URIs, client types, and role mappings. Get it wrong and it’s not just “oops.” It’s broken logins across half the org.

A few decisions matter more than teams expect.

Realm strategy: one realm isn’t always better, and many realms isn’t always smarter

One realm can be clean for a single organization and unified user base.
Multiple realms can isolate tenants, brands, or security boundaries.

But more realms also means more configs, more drift, more chances for “it works in staging, why not in prod?” The best setup is usually the one that stays understandable six months later.

Client configuration: the place where security and UX collide

Web apps, SPAs, mobile apps, backend services - each behaves differently. Treating them as the same type of client is how insecure shortcuts happen (or how the UX becomes unbearable).

A healthy setup pays attention to:

public vs confidential clients
refresh token strategy and rotation
redirect URI discipline (tight allowlists, no wildcards unless there’s a strong reason)
token size (stuffing tokens with every group and role is a classic performance footgun)

Authorization: Keycloak can help, but it can’t do everything

Keycloak can store roles and groups, yes. But apps still need to enforce authorization correctly. Otherwise it becomes theater: beautiful tokens, weak enforcement.

Good patterns often include:

realm roles for broad access (“employee”, “admin”)
client roles for app-specific permissions
group-based assignment to keep management sane
mappers that keep claims clean and consistent

Branding and MFA: not a vanity project, not a science experiment

Keycloak branding can be done well, but there’s a line between “branded” and “customized into a maintenance nightmare.” Themes should be versioned, tested, and deployed via CI/CD like any other artifact. No manual edits on a live container. That’s how teams end up afraid to upgrade.

MFA is similar. It’s easy to say “turn on MFA.” The real question is: for who, when, and under what conditions?

A practical MFA design often uses:

step-up auth for sensitive actions (billing changes, admin access)
conditional rules (location, device posture, risk signals)
a clear recovery process that doesn’t drown support

And yes, SMS still exists. But it’s rarely the best option when security is taken seriously.

The security checklist nobody wants, but everyone needs

Security reviews don’t care that the login page looks nice. They care about blast radius and auditability.

A hardened Keycloak-on-AWS setup usually includes:

TLS end-to-end (and no “temporary internal HTTP” that sticks around for years)
secrets stored in AWS Secrets Manager or Parameter Store, not pasted into task definitions
least-privilege IAM roles for workloads
locked-down security groups (and a habit of reviewing them)
restricted admin console access (VPN, IP allowlists, or a well-designed admin access strategy)
tested backups and restore procedures
log retention that matches compliance needs, not “whatever the default was”

None of this is exotic. It’s just what production looks like.

Upgrades and operations: the part that separates adults from hobby projects

Keycloak gets updates. Some are feature-y. Some are security-critical. Either way, upgrades can’t be treated like an annual ritual.

A mature workflow has:

staging environments that mimic prod closely enough to be useful
IaC so environments can be reproduced without archaeology
repeatable deployment pipelines
a rollback plan that isn’t “hope”
load testing for auth flows, not just API endpoints

Because when auth goes down, everything looks down. Even if the rest of the system is fine.

The quiet truth: most Keycloak projects are actually migrations

A lot of teams don’t start fresh. They’re untangling a history of:

separate logins for separate apps
legacy SAML glued to modern OIDC clients
inconsistent MFA rules
old Keycloak versions nobody wants to touch

The safest migrations are usually incremental: onboard one app, validate tokens and claims, watch real user behavior, then move to the next. Big-bang cutovers can work, but they demand serious testing and coordination. Most orgs underestimate that part.