Copy Fail: 732 Bytes to Root, Half a Day to Find

The Linux 'Copy Fail' kernel exploit (CVE-2026-31431) is 732 bytes of Python. The hard part isn't patching it -- it's the half-day inventory hunt to figure out which of your VMs, container hosts, and Kubernetes nodes are actually running the vulnerable kernel. This post is about the metric you're not measuring, and what to do about it.

Cover Image for Copy Fail: 732 Bytes to Root, Half a Day to Find

On April 29, 2026, Theori disclosed a flaw in the Linux kernel that takes a 732-byte Python script to turn into root on essentially every mainstream distribution shipped since 2017. It is called Copy Fail, tracked as CVE-2026-31431, scored CVSS 7.8, and within forty-eight hours CISA had added it to the Known Exploited Vulnerabilities catalog and given US federal agencies two weeks to patch.

The proof of concept is public. The mainline fix landed April 1. Vendor backports are arriving in waves. The technical work to remediate is well understood: install the patched kernel package and reboot the host.

That is not the hard part. The hard part is answering this, today, before you patch anything: how many of your VMs, container hosts, and Kubernetes nodes are actually running a vulnerable kernel right now, in which clouds, in which accounts, in which regions?

For most teams, the honest answer is some variant of "give us a few hours and a couple of engineers." That window is the breach.

What Copy Fail Actually Is

Copy Fail is a logic flaw at the intersection of three pieces of long-standing kernel code: AF_ALG (the userspace cryptographic socket interface), the algif_aead module, and the authencesn AEAD template used for IPsec Extended Sequence Numbers.

In 2017 the kernel gained an in-place optimization that allowed AF_ALG to chain page cache pages directly into the destination scatterlist of an AEAD operation. The authencesn template, separately, uses its destination buffer as scratch space when rearranging bytes -- and writes a 4-byte sequence-number value at an offset that sits just past the legitimate output region. With splice() delivering page cache pages into that destination, the 4-byte write lands inside a chained page cache page belonging to any file the attacker can read.

That is the entire primitive: an unprivileged local user obtains a deterministic, controllable 4-byte write into the page cache of any readable file on the system. From there, exploitation is mechanical. Edit a single instruction in a setuid binary. Wait. The next time a privileged process loads that file from cache, the patched bytes execute as root.

Three things make this much worse than a typical local privilege escalation:

  1. No race, no offset leak. The exploit is deterministic -- no spray, no probabilistic step. A 732-byte Python script obtains root on the first try.
  2. Every kernel since 2017. Ubuntu 24.04 LTS, Amazon Linux 2023, RHEL 10.1, SUSE 16, Debian, Fedora, and Arch are all in scope in their default builds. The mainline fix landed on April 1, 2026, but vendor backports are arriving in waves.
  3. The page cache is shared across containers. Containers do not have their own kernel. They share the host kernel and the host page cache. A 4-byte write from inside an unprivileged container can edit a binary on the host filesystem and be re-executed at host privilege. That is a clean container-escape primitive and a Kubernetes-node compromise vector in one.

Why This Lands Harder In Cloud

A modern cloud environment is, by design, full of places where untrusted code is invited in and asked to run: CI/CD runners executing PR-submitted code, multi-tenant Kubernetes nodes, shared compute platforms, bastion hosts. In every one of those settings, "unprivileged local user with code execution" is the normal operating mode, not a breach indicator.

Copy Fail collapses that boundary. A compromised CI job, a malicious dependency in a build, a misbehaving tenant container -- any of these now has a one-step path to root on the host. From the host, IAM-attached service accounts, cloud metadata endpoints, and lateral movement to other workloads on the same node all open up. Microsoft, CISA, and CERT-EU have all called out Kubernetes nodes and CI/CD runners as the priority patching targets. Those are the workloads where the threat model just inverted.

The Lookup Nobody Admits To

Within hours of disclosure, every cloud security team in the world was asked the same question by leadership: are we affected? The honest answer requires three nested lookups, each of which the industry's standard tooling quietly gets wrong.

Lookup 1 -- fleet inventory across accounts. How many VMs do we run, total, across every cloud account in every region? In a multi-account org, the cloud console gives you a per-account count. Federating those counts requires either a configured AWS Config aggregator, an Azure Resource Graph query against management groups, a gcloud asset query, or a CMDB that nobody fully trusts. For most teams this step alone consumes the morning.

Lookup 2 -- installed kernel vs claimed kernel. This is the trap most tools quietly fall into. AWS reports the AMI ID an instance was launched from. The AMI is not the running kernel. The same AMI launched 90 days ago and apt upgraded every Tuesday since has a different kernel than the same AMI launched yesterday. The metadata is a lie of omission. The truth lives on the disk.

Lookup 3 -- installed kernel vs running kernel. Even reading the disk only tells you what is installed. The kernel actually executing is whatever was loaded at boot. A host can have the patched kernel package installed and still be running the vulnerable one because nobody rebooted. Conversely, a host can show a vulnerable package version and be running a different kernel pinned via GRUB. This is where most patching declarations turn out to be premature.

By the time a senior engineer has reconciled the three lookups, half a day is gone. The exploit code has been public the entire time. CISA's two-week federal patching deadline does not reset for the time you spent gathering inventory. And the engineer who did the work just lost the morning they had planned to do something else.

The Metric You're Not Measuring: MTTI

Most security teams measure Mean Time To Remediate (MTTR) -- how long from "we know we're affected" to "we patched it." It is the headline KPI in every cloud security board deck.

The metric upstream of MTTR -- and almost universally unmeasured -- is Mean Time To Inventory (MTTI). How long from "the advisory drops" to "we know exactly which of our resources are affected." MTTI bounds MTTR from below: if it takes you four hours to know, you cannot remediate in less than four hours. Every minute spent on inventory is a minute the attacker has a public exploit and you do not.

Most teams cannot tell you their MTTI for kernel CVEs because they have never measured it -- the lookup has always been a one-off scramble. That scramble is the MTTI. For the median cloud team running across two or three providers and a handful of accounts, a realistic Copy Fail MTTI is somewhere between two hours and two days. For a team that has continuous filesystem-level inventory pre-collected, it is single-digit minutes.

The teams that ship resilient cloud security in 2026 are not the ones with the cleverest preventive controls. They are the ones with the lowest MTTI. Naming the metric is the first step to driving it down.

What VikingCloud Actually Does

We will not claim to have prevented Copy Fail. No external scanner can prevent a kernel logic bug. What we do is collapse MTTI for this class of CVE from a half-day scramble into seconds of querying a continuously refreshed inventory.

Filesystem-truth, not metadata-truth. When you connect a cloud account, VikingCloud performs an agentless, deep filesystem inventory of every VM. We extract the actual installed kernel package -- linux-image-* on Debian and Ubuntu, kernel-* on RHEL and Amazon Linux, kernel-default on SUSE -- at its real installed version. Not the AMI's claimed version. Not what the launch template said. What is on the disk, today. No agent to deploy. The same approach runs on AWS, Azure, and GCP.

One query across every account, every region, every cloud. The inventory is normalized into a single table. A filter on CVE-2026-31431 on the Risks page returns every affected workload across every connected account in seconds. No federation queries, no per-cloud console crawl, no asking the AWS team and the Azure team separately.

KEV, EPSS, and severity pre-correlated. Each finding carries its CISA Known Exploited Vulnerabilities flag, its KEV due date, and its EPSS exploit-probability score on the same row as the CVE. The teams who need to triage 1,700 vulnerabilities down to the 12 that actually matter build that filter set in spreadsheets. We pre-compute it.

One issue, not one alert per VM. Twelve VMs across four accounts running the same vulnerable kernel show up as a single issue with twelve affected workloads, not twelve duplicate tickets. Each affected resource carries its account, region, kernel version, and patch availability for its distro.

Recurring scans verify your patching landed. The standard scan cadence is every 24 hours. When you patch a fleet, the next scan tells you whether the patched package is now installed everywhere it was missing. The verification step that most teams forget happens automatically.

What VikingCloud Does Not Do

The honest limits, because this matters more than the marketing:

  • We see the installed kernel package, not the running kernel. Our scan reads the installed-package database from the filesystem. If a host has the patched package installed but has not rebooted, we will report it as patched. The remediation step "reboot after patching" is on you, and verifying it landed in the running kernel is your monitoring stack's job, not ours. We will tell you what is true at the next reboot. Whether the reboot has happened is a question we do not answer.
  • We are not real-time. Scans run on a schedule, typically daily. If a CVE drops at 02:00 and your scan ran at 01:00, the affected workloads surface at the next cycle. For higher-frequency triage, scans can be triggered manually from the UI.
  • We do not detect exploitation in flight. We tell you what is vulnerable. We do not tell you what is being attacked. Pair this with an EDR or runtime sensor for the other half of the picture.
  • We do not have a Copy Fail-specific detector. We surface CVE-2026-31431 the same way we surface every CVE: by package version match against the vulnerability database. There is no special heuristic and there does not need to be.

If those limits are deal-breakers for your environment, you should know that going in. For the very specific job of knowing in seconds which of your cloud workloads is running a vulnerable kernel, the boundary of what we do maps cleanly onto the boundary of what shrinks MTTI.

What We Found In Our Own Environment -- And The Honest Reframe

When the Copy Fail advisory landed, we logged into VikingCloud, filtered the Risks page to CVE-2026-31431, and had the affected VM in front of us in seconds -- by name, by cloud account, by region, by exact kernel package version. The lookup the rest of the industry was running for half a day was a single filter click.

Here is the part most security blogs would not write: finding exactly one affected VM should make a security team more nervous, not less. If the scanner found one, the next question is whether every account, every region, and every cluster was actually being scanned in the first place. The number of resources your tool can see is the ceiling on the answer it can give. A clean inventory page on a partial scope is the most dangerous report a security team can read.

The first thing we did after finding our affected VM was not patch it. It was confirm scan coverage: every connected account, every region, every cluster, and any cloud organisation with subaccounts that were not yet onboarded. Inventory completeness is not implicit in inventory cleanliness, and any honest tool should make you check both.

What This Means For Your Team

Copy Fail is not the last kernel CVE. It is unusually severe -- every distribution, every cloud, deterministic exploit, container-escape primitive, public PoC, KEV-listed within forty-eight hours -- but the structural problem it exposes has been quietly sitting under every Linux fleet for years. The question "what kernel is running on each of our boxes, right now, today?" should be a one-second answer. For most teams it is not, and that is the gap worth closing.

If your MTTI is sub-hour, you patch in time, you decide what to take out of rotation, you can show your auditor a list, you can tell your CISO honestly whether the company is exposed. If your MTTI is in days, you patch under stress, and you find out the hard way which boxes you missed.

Driving MTTI to single-digit minutes is not a feature that prevents attacks. It is the operating posture that makes you ungameable by the next CVE drop. That is the fight worth investing in.

Start your 14-day free trial to see your VM, container, and Kubernetes inventory cross-referenced against the latest CVE and KEV feeds in seconds, or book a demo if you would prefer we walk through the platform with your team.