Brief Thoughts on PKI and Certificates at Scale

This started as a reply to @davidseidl@mstdn.social and a thread about certificate expiration in a small organization (https://mstdn.social/@davidseidl/109638543580938963). He made some good points, but there's scaling issues for enterprises, as I outline and address below.

At the scale of a couple dozen certs, calendar alerts and individual/backup responsibility is okay. Once you get into hundreds and thousands of certs, you need to plan and automate as much as possible.

At enterprise scale, you're probably using certs for a number of tasks: * User Authentication * Device (server, service, container) Authentication * Data encryption in transit * Data encryption at rest

When you're working in an enterprise (1,000+ employees), maintaining the infrastructure necessary and helping developers understand how to accomplish their goals absolutely requires its own team. You're likely dealing with multiple certificate issuers (internal and external), along with ensuring all the moving parts of certificates (issuers, CMT, CRL/OCSP, and the servers/databases underpinning them) are working smoothly.

That also means thousands of certs on hundreds or thousands of devices, services, or containers. Unless you want your entire day to be consumed with manually updating certs (and maybe you like to do the boring stuff like that), automation is key.

A good Certificate Management Tool will do several things: * Find what certs are already out there through scanning * Manage certificate life-cycles * New certificate provisioning/installation * Renewing existing certificates * Maintaining certificate history * Centralized revocation in the event of a breach * Report what you have in appropriate granularity * Alert appropriate parties in cases where automation isn't yet available

Certificate Inventory: A CMT should be able to scan targets (though an IP range, an Active Directory OU, a list of URLs, et cetera) and find the certificates are either offered through various interfaces (like HTTPS) or stored on the device (like in the Windows CertMgr). The second option will require an account the scanner can use to authenticate to the account.

Certificate Life-Cycle Management: The bread and butter of installing, renewing, and revoking certificates. Maybe you want one cert for a service/application on a dozen servers. Maybe you don't want to have to manually deal with your public-facing .com cert every 60 days. Maybe you have a honeypot farm with a valid cert for $reasons that you want to be able to revoke with one button. That's the heavy lift a CMT provides. It can also maintain a history of previous certificates, so you have more pieces of the “when did this stop working” puzzle.

Reporting: Execs love pretty graphs, and some accountants love internal billing. Reporting from your CMT can make this literally automatic. Need to migrate from $OldCertIssuer to $NewCertIssuer, and Management wants some numbers on who's behind the curve? Security needs to audit all your externally-trusted certificates? Reporting!

Automation should be the target for the majority of your certificates' life, but sometimes automation just isn't available. Old line-of-business applications can be picky, and maybe you don't have the maturity yet for automation success. There are also some high-security edge cases where a manual process is required. Even if your CMT can't talk to the device (say because it's in a segregated network), the certs will still expire when the clock says they do. Or perhaps you have a third-party service that can't request certificates on their own. This is where monitoring and alerting can come into play. Monitoring and alerting on certificates before they expire can let you plan and communicate changes in a calm, orderly fashion instead of “oh gods the cert expired and we need to replace it five minutes ago!”.

An end-state goal is essentially the same as a well-oiled CI/CD pipeline (and in fact interacting with your CMT could be part of of that process). Review reports, alerts, and observability metrics. Let the computers handle the boring parts while your team handles the interesting choices of fitting use-cases and designing good, scaling solutions.

When you're using certificates for data-at-rest encryption, that data is only useful if you can decrypt it. We use our CMS to handle key escrow for our servers. There are specific additional security requirements around that, and we work with our internal security teams to ensure everything is handled properly.

Our CMS acts as a proxy/relay for most certificate use-cases in our environment. We've got a couple of distinct certificate authorities that do different things, but half the certs flow through our CMS. Sometimes that's “store and forward a CSR, return signed cert”, sometimes it's “Fill out a few fields, we'll take some default data, and handle everything behind the scenes”. The other half is just grabbing data from an Active Directory Microsoft Certificate Authority for reporting purposes.

Just because you're not ready for heavy automation doesn't mean you won't see value in a CMT/S. Step one of solving a problem is always identifying the problem. CMT/S will help with that too.

Good CMT/S will integrate with your existing toolsets. If you've got a smooth container deployment pipeline, ideally you can integrate your cert management with an API call or two to include standard, short-lived certs automatically.

The journey through automation (and away from waterfall development) is a long and winding road. How do you eat an elephant? One bite at a time. When you're looking to change and mature a culture, start with small wins. Build momentum. Get some easy-to-understand examples (especially within your own team) you can quickly (elevator-pitch style) demonstrate to others. Just as important is knowing when to say “This is a bigger challenge than anticipated, and we can leave it as a manual process for now.”