Time-to-Patch Metrics: A Survival Analysis Approach Using Qualys and Elastic

Introduction

Understanding how quickly vulnerabilities are remediated across different environments and teams is critical to maintaining a strong security posture. In this article, we describe how we applied survival analysis to vulnerability management (VM) data from Qualys VMDR, using the Elastic Stack. This allowed us to not only confirm general assumptions about team velocity (how quickly teams complete work) and remediation capacity (how much fixing they can take on) but also derive measurable insights. Since most of our security data is in the Elastic Stack, this process should be easily reproducible to other security data sources.

Why We Did It

Our primary motivation was to move from general assumptions to data-backed insights about:

How quickly different teams and environments patch vulnerabilities
Whether patching performance meets internal service level objectives (SLOs)
Where bottlenecks or delays commonly occur
What other factors can affect patching performance

Why Survival Analysis? A Better Alternative to Mean Time to Remediate

Mean Time to Remediate (MTTR) is commonly used to track how quickly vulnerabilities are patched, but both the mean and median suffer from significant limitations (we provide an example later in this article). The mean is highly sensitive to outliers[^1] and assumes the remediation times are evenly balanced around the average remediation time, which is rarely the case in practice. The median is less sensitive to extremes but discards information about the shape of the distribution and says nothing about the long tail of slow-to-patch vulnerabilities. Neither accounts for unresolved cases, i.e. vulnerabilities that remain open beyond the observation window, which are often excluded entirely. In practice, the vulnerabilities that remain open the longest are precisely the ones we should be most concerned about.

Survival analysis addresses these limitations. Originating in medical and actuarial contexts, it models time-to-event data while explicitly incorporating censored observations, meaning in our context vulnerabilities that remain open. (For more details on its application to vulnerability management we strongly recommend “The Metrics Manifesto”). Instead of collapsing remediation behavior into a single number, survival analysis estimates the probability that a vulnerability remains unpatched over time (e.g. 90% of vulnerabilities are remediated within 30 days). This allows for more meaningful assessments, such as the proportion of vulnerabilities patched within SLO (for example within 30, 90, or 180 days).

Survival analysis provides us with a survival function that estimates the probability a vulnerability remains unpatched over time.

::: This method offers a better view of remediation performance, allowing us to assess not just how long vulnerabilities persist, but also how remediation behavior differs across systems, teams, or severity levels. It’s particularly well-suited to security data, which is often incomplete, skewed, and resistant to assumptions of normality. :::

Context

Although we have applied survival analysis across different environments, teams and organizations, in this blog we focus on the results for the Elastic Cloud production environment.

Vulnerability age calculation

There are different methods to calculate vulnerability age.

For our internal metrics like vulnerability adherence SLO, we define vulnerability age as the difference between when a vulnerability was last found and when it was first detected (usually a few days after publication). This approach aims to penalize vulnerabilities that are reintroduced from an outdated base image. In the past, our base images were not updated frequently enough for our satisfaction. If a new instance is created, vulnerabilities can have a significant age (e.g., 100 days) from day one of discovery.

For this analysis, we find it more relevant to calculate the age based on the number of days between the last found date and the first found date. In this case, age represents the number of days the system was effectively exposed.

“Patch everything” strategy

In our Cloud environment, we maintain a policy to patch everything. This is because we almost exclusively use the same base image across all instances. Since Elastic Cloud operates fully on containers, there are no specific application packages (e.g., Elasticsearch) installed directly on our systems. Our fleet remains homogeneous as a result.

Data Pipeline

Ingesting and mapping data into the Elastic Stack can be cumbersome. Luckily, we have many security integrations that handle those natively, Qualys VMDR being one of them.

This integration has 3 main interests over custom ingestion methods (e.g. scripts, beats, …):

It natively enriches vulnerability data from the Qualys Knowledge Base which add CVE IDs, threat intel information, … without needing to configure enrich pipelines.
Qualys data is already mapped to the Elastic Common Schema which is a standardized way of representing data, whether it’s coming from one source or another: for example, CVEs are always stored in field vulnerability.id, independent of the source.
A transform with the latest vulnerability is already set up. This index can be queried to get the latest vulnerabilities status.

Qualys agent integration configuration

For survival analysis, we need to ingest both active and patched vulnerabilities. To analyze a specific period, we need to set the number of days in field max_days_since_detection_updated. In our environment, we ingest Qualys data daily, so there’s no need to ingest a long history of fixed data, as we’ve already done that.

The Qualys VMDR elastic agent integration has been configured with the following:

Property	Value	Comment
(Settings section) Username
(Settings section) Password		Since there are no API keys available in Qualys, we can only authenticate with Basic Authentication. Make sure SSO is disabled on this account
URL	https://qualysapi.qg2.apps.qualys.com (for US2)	https://www.qualys.com/platform-identification/
Interval	4h	Adjust it based on the number of ingested events.
Input parameters	show_asset_id=1& include_vuln_type=confirmed&show_results=1&max_days_since_detection_updated=3&status=New,Active,Re-Opened,Fixed&filter_superseded_qids=1&use_tags=1&tag_set_by=name&tag_include_selector=all&tag_exclude_selector=any&tag_set_include=status:running&tag_set_exclude=status:terminated,status:stopped,status:stale&show_tags=1&show_cloud_tags=1	show_asset_id=1: retrieve asset id show_results=1: details about what is the current installed package and which version should be installed max_days_since_detection_updated=3: filter out any vulnerabilities that haven’t been updated over the last 3 days (e.g. patched older than 3 days) status=New,Active,Re-Opened,Fixed: all vulnerability status are ingested filter_superseded_qids=1: ignore superseded ‘vulnerabilities Tags: filter by tags show_tags=1: retrieve Qualys tags show_cloud_tags=1: retrieve Cloud tags

Once data is fully ingested, it can be reviewed either in Kibana Discover (logs-* data view -> data_stream.dataset : "qualys_vmdr.asset_host_detection" ), either in the Kibana Security App (Findings -> Vulnerabilities).

Loading data into Python with the elasticsearch client

Since the survival analysis calculation will be done in Python, we need to extract data from elastic into a python dataframe. There are several ways to achieve this, and in this article we’ll focus on two of them.

With ES|QL

The easiest and most convenient way is to leverage ES|QL with the arrow format. It’ll automatically populate the python dataframe (rows and columns). We recommend reading the blog post From ES|QL to native Pandas dataframes in Python to get more details.

from elasticsearch import Elasticsearch
import pandas as pd

client = Elasticsearch(
    "https://[host].elastic-cloud.com",
    api_key="...",
)

response = client.esql.query(
    query="""
   FROM logs-qualys_vmdr.asset_host_detection-default
    | WHERE elastic.owner.team == "platform-security" AND elastic.environment == "production"
    | WHERE qualys_vmdr.asset_host_detection.vulnerability.is_ignored == FALSE
    | EVAL vulnerability_age = DATE_DIFF("day", qualys_vmdr.asset_host_detection.vulnerability.first_found_datetime, qualys_vmdr.asset_host_detection.vulnerability.last_found_datetime)
    | STATS 
        mean=AVG(vulnerability_age), 
        median=MEDIAN(vulnerability_age)
    """,
    format="arrow",
)
df = response.to_pandas(types_mapper=pd.ArrowDtype)
print(df)

Today, we have a limitation with ESQL: we can’t paginate through results. Therefore we are limited to 10K output documents (100K if server configuration is modified). Progress can be followed through this enhancement request.

With DSL

In the elasticsearch python client, there is a native feature to extract all the data from a query with transparent pagination. The challenging part is to create the DSL query. We recommend creating the query in Discover and then click on Inspect, and then Request tab to get the DSL query.

query = {
    "track_total_hits": True,
    "query": {
        "bool": {
            "filter": [
                {
                    "match": {
                        "elastic.owner.team": "awesome-sre-team"
                    }
                },
                {
                    "match": {
                        "elastic.environment": "production"
                    }
                },
                {
                    "match": {
"qualys_vmdr.asset_host_detection.vulnerability.is_ignored": False
                    }
                }
            ]
        }
    },
    "fields": [
        "@timestamp",
        "qualys_vmdr.asset_host_detection.vulnerability.unique_vuln_id",
        "qualys_vmdr.asset_host_detection.vulnerability.first_found_datetime",
        "qualys_vmdr.asset_host_detection.vulnerability.last_found_datetime",
        "elastic.vulnerability.age",
        "qualys_vmdr.asset_host_detection.vulnerability.status",
        "vulnerability.severity",
        "qualys_vmdr.asset_host_detection.vulnerability.is_ignored"
    ],
    "_source": False
}

results = list(scan(
        client=es,
        query=query,
        scroll='30m',
        index=source_index,
        size=10000,
        raise_on_error=True,
        preserve_order=False,
        clear_scroll=True
    ))

Survival Analysis

You can refer to the code to understand or reproduce it on your dataset.

What We Learned

Leaning in on the research from the Cyentia Institute we looked at a few different ways to measure how long it takes to remediate vulnerabilities using means, medians, and survival curves. Each method gives a different lens through which we can understand time-to-patch data, and the comparison is important because depending on which method we use, we would draw very different conclusions about how well vulnerabilities are being addressed.

The first method focuses only on vulnerabilities that have already been closed. It calculates the median and mean time it took to patch them. This is intuitive and simple, but it leaves out a potentially large and important portion of the data (the vulnerabilities that are still open). As a result, it tends to underestimate the true time it takes to remediate, especially if some vulnerabilities stay open much longer than others.

The second method tries to include both closed and open vulnerabilities by using the time they’ve been open so far. There are many options to approximate a time-to-patch for the open vulnerabilities, but for simplicity here we assumed they were (will be?) patched at the time of reporting, which we know isn’t true. But it does offer a way to factor in their existence.

The third method uses survival analysis. Specifically, we used the Kaplan-Meier estimator to model the likelihood that a vulnerability is still open at any given time. This method handles the open vulnerabilities properly: instead of pretending they’re patched, it treats them as “censored” data. The survival curve it produces drops over time, showing the proportion of vulnerabilities still open as days or weeks pass.

How Long Do Vulnerabilities Last?

In the current 6-month snapshot[^2], the closed-only time-to-patch has a median ~33 days and a mean ~35 days. On the surface that looks reasonable, but the Kaplan-Meier curve shows what those numbers hide: at 33 days, ~54% are still open; at 35 days, ~46% are still open. So even around the “typical” one-month mark, about half of issues remain unresolved.

We also computed observed-so-far statistics (treating open vulnerabilities as if they were patched at the end of the measurement window). In this window they happen to be almost the same (median ~33 days, mean ~35 days) because the ages of today’s open items cluster near one month. That coincidence can make averages look reassuring, but it’s incidental and unstable: if we shift the snapshot to just before the monthly patch push and these same statistics drop sharply (we’ve seen an observed median of ~19 days and observed a mean of ~15 days) without any change in the underlying process.

The survival curve avoids that trap, because it answers the question of “% still open after 30/60/90 days”, and offers visibility into the long tail that stays open well past a month.

Patch Everything Everywhere The Same Way?

Stratified survival analysis takes the idea of survival curves one step further. Instead of looking at all vulnerabilities together in one big pool, it separates them into groups (or “strata”) based on some meaningful characteristic. In our analysis, we have stratified vulnerabilities by severity, asset criticality, environment, cloud provider, team/division/organization. Each group gets its own survival curve, and here in the example graph we compare how quickly different vulnerability severities are remediated over time.

The benefit of this approach is that it exposes differences that would otherwise be hidden in the aggregate. If we only looked at the overall survival curve, we can only make conclusions about the remediation performance across the board. But stratification reveals if different teams, environments or severity issues are addressed faster than the rest, and in our case that the patch everything strategy is indeed consistent. This level of detail is important for making targeted improvements, helping us understand not just how long remediation takes in general, but if and where real bottlenecks exist.

How Fast Do Teams Act?

While the survival curve emphasizes how long vulnerabilities remain open, we can flip the perspective by using the cumulative distribution function (CDF) instead. The CDF focuses on how quickly vulnerabilities are patched, showing the proportion of vulnerabilities that have been remediated by a given point in time.

Our choice of plotting the CDF provides a clear picture of remediation speed, however it’s important to note that this version includes only vulnerabilities that were patched within the observed time window. Unlike the survival curve which we compute over a rolling 6-month cohort to capture full lifecycles, the CDF is computed month-over-month on items closed in that month[^3].

As such, it tells us how quickly teams remediate vulnerabilities once they do so, and it doesn’t reflect how long unresolved vulnerabilities remain open. For example, we see that 83.2% of the vulnerabilities closed in the current month were resolved within 30 days of the first detection. This highlights patching velocity for recent, successful patches but does not account for longer-standing vulnerabilities that remain open and are likely to have longer time-to-patch durations. Therefore, we use the CDF for understanding short-term response behavior, whereas the full lifecycle dynamics are given by a combination of CDF alongside survival analysis: the CDF describes how fast teams act once they patch, whereas the survival curve shows how long vulnerabilities truly last.

Difference Between Survival Analysis and Mean/Median

Wait, we said that survival analysis is better to analyze time to patch to avoid the impact of outliers. But in this example, mean/median and survival analysis provide similar results. What is the added value? The reason is simple: we don’t have outliers in our production environments since our patching process is fully automated and effective.

To demonstrate the impact on heterogeneous data, we’ll use an outdated example from a non-production environment that lacks automated patching.

ESQL query:

FROM qualys_vmdr.vulnerability_6months
  | WHERE elastic.environment == "my-outdated-non-production-environment"
  | WHERE qualys_vmdr.asset_host_detection.vulnerability.is_ignored == FALSE
  | EVAL vulnerability_age = DATE_DIFF("day", qualys_vmdr.asset_host_detection.vulnerability.first_found_datetime, qualys_vmdr.asset_host_detection.vulnerability.last_found_datetime)
  | STATS
      count=COUNT(*),
      count_closed_only=COUNT(*) WHERE qualys_vmdr.asset_host_detection.vulnerability.status == "Fixed",
      mean_observed_so_far=MEDIAN(vulnerability_age),
      mean_closed_only=MEDIAN(vulnerability_age) WHERE qualys_vmdr.asset_host_detection.vulnerability.status == "Fixed",
      median_observed_so_far=MEDIAN(vulnerability_age),
      median_closed_only=MEDIAN(vulnerability_age) WHERE qualys_vmdr.asset_host_detection.vulnerability.status == "Fixed"

	Observed so far	Closed only
Count	833	322
Mean	178.7 (days)	163.8 (days)
Median	61 (days)	5 (days)
Median survival	527 (days)	N/A

In this example, using mean and median yield very different results. Choosing a single representative metric can be challenging and potentially misleading. The survival analysis graph accurately represents our effectiveness in addressing vulnerabilities within this environment.

Final Thoughts

The benefits of using survival analysis come not only from more accurate measurement but also from the insights into the dynamics of patching behaviour, showing where bottlenecks occur, factors that affect patching velocity and whether it aligns with our SLO. From a technical integration perspective, the use of survival analysis as part of our operational workflows and reporting can be achieved with minimal additional changes to our current Elastic Stack setup: survival analysis can run on the same cadence as our patching cycle with the results being pushed back into Kibana for visualization. The definitive advantage is to pair our existing operational metrics with survival analysis for both long-term trends and short-term performance tracking.

Looking forward, we’re experimenting with additional new metrics like Arrival Rate, Burndown Rate, and Escape Rate that give us a way to move toward a more dynamic understanding of how vulnerabilities are really handled.

Arrival Rate is the measure of how quickly new vulnerabilities are entering the environment. Knowing that fifty new CVEs show up each month, for example, tells us what to expect in the workload before we even start measuring patches. So the arrival rate is a metric that does not necessarily inform about the backlog, but more about the pressure applied to the system.

Burndown Rate (trend) shows the other half of the equation: how quickly vulnerabilities are being remediated relative to how fast they arrive.

Escape Rate adds yet another dimension by focusing on vulnerabilities that slip past the points where they should have been contained. In our context, an escape is about CVEs that miss patching windows or exceed SLO thresholds. An elevated escape rate doesn’t just show that vulnerabilities exist but it also shows that the process designed to control them is failing, whether because patching cycles are too slow, automation processes are lacking, or compensating controls are not working as intended.

Together, the metrics create a better picture: arrival rate tells us how much new risk is being introduced; burndown trends show whether we are keeping pace with that pressure or being overwhelmed by it; escape rates expose where vulnerabilities persist despite planned controls.

[1]:An outlier in statistics is a data point that is very far from the central tendency (or far from the rest of the values in a dataset). For example, if most vulnerabilities are patched within 30 days, but one takes 600 days, that 600-day case is an outlier. Outliers can pull averages upward or downward in ways that don’t reflect the “typical” experience. In the patching context, these are the especially slow-to-patch vulnerabilities that sit open far longer than the norm. They may represent rare but important situations, like systems that can’t be easily updated, or patches that require extensive testing.

[2]: Note: The current 6-month dataset includes both all vulnerabilities that remain open at the end of the observation period (independent of how long ago they have been open /first seen) and all vulnerabilities that were closed during the 6-month window. Despite this mixed cohort approach, survival curves from prior observation windows show consistent trends, particularly in the early part of the curve. The shape and slope over the first 30–60 days have proven remarkably stable across snapshots, suggesting that metrics like median time-to-patch and early-stage remediation behavior are not artifacts of the short observation window. While long-term estimates (e.g. 90th percentile) remain incomplete in shorter snapshots, the conclusions drawn from these cohorts still reflect persistent and reliable patching dynamics.

[3]:We kept the CDF on a monthly cadence for operational reporting (throughput and SLO adherence for work completed during the current month), while the Kaplan-Meier uses a 6-month window to properly handle censoring and expose tail risk across the broader cohort.

Time- to- Patch Metrics: A Survival Analysis Approach Using Qualys and Elastic