Automating GOAD and Live Malware Labs

Introduction: The Need for a Scalable, Automated Simulation Range

In modern security operations, detection engineering is no longer a “set it and forget it” discipline. The central challenge for any security team – and the question that underpins the entire purple-team approach is simple: how do you know whether your detection rules genuinely work? Continually validating detection logic against an ever-shifting adversary toolkit is now a fundamental requirement.

Arguably, the largest hurdle for this exercise has always been setting up the lab. Manually provisioning a multi-domain Active Directory forest, configuring it with specific vulnerabilities, and deploying a separate, contained malware analysis environment is a complex and time-consuming process. This repetitive setup work is a significant drain on an organization's most valuable resource: the time of its senior security analysts. Community discussions echo this frustration, highlighting the hours lost to manual setup before a single test can be run.

This blog details a modern solution that eliminates this bottleneck by combining rapid infrastructure automation with a unified security analytics platform. The solution leverages two key components:

Ludus: An open-source automation overlay that deploys and configures complex, multi-VM cyber ranges from a single command.
Elastic Security: The platform that unifies Security Information Event Management (SIEM), eXtended Detection and Response (XDR), and cloud security, providing a consolidated solution to ingest, detect, and respond to threats. It offers the "limitless visibility" required to observe every action within the simulated environment.

The goal of this guide is to provide a definitive, step-by-step blueprint for building this integrated system. It will show how to move from slow, manual, and inconsistent lab testing to a continuous, automated, and scalable detection-engineering workflow beyond what Elastic Cortado provides.

The Solution Architecture: Ludus + Elastic

This architecture represents a high-fidelity simulation of a modern hybrid enterprise. The Ludus range acts as the "on-prem" or IaaS data center, while the Elastic Cloud deployment represents the "SaaS" security stack. This model perfectly mirrors the hybrid and multi-cloud environments that Elastic Security is designed to protect, making the architecture of the test as valuable as the attacks themselves.

The build consists of the following core components.

Component	Technology	Function
Foundation (Infrastructure)	Ludus (Proxmox/Ansible)	Deploys VM ranges from a single YAML config.
Targets	Identity - GOAD (Windows Server) Supply Chain - XZbot (Debian)	Multi-domain AD forest with intentional vulnerabilities (Kerberoasting, Print Nightmare). Linux host infected with CVE-2024-3094 for supply chain simulation.

The Sensor Grid (Visibility)	Elastic Agent	Unified telemetry collection (EDR + Logs).
The Brain (Analysis)	Elastic Security	SIEM/XDR platform for correlation and AI-driven investigation.

Component 1: The Foundation (Ludus)

Ludus serves as the Infrastructure-as-a-Service (IaaS) layer. Built to run on Proxmox 8/9 or Debian 12/13, it uses YAML configuration files to define complex virtual networks, supporting up to 255 distinct VLANs. Behind the scenes, Ludus easily leverages Packer and Ansible to build, configure, and deploy the virtual machine templates from that single file.
Review and follow the installation steps and hardware requirements in the Ludus quick-start.

Component 2: The Targets (The Labs)

This guide merges two distinct Ludus environments into a single, comprehensive range to test a wider spectrum of threats:

Game of Active Directory (GOAD): A purpose-built Active Directory lab designed by security researchers at Orange Cyberdefense. It is pre-configured with the specific misconfigurations and vulnerabilities needed to simulate common identity-based attack paths, such as Kerberoasting, NTLM Relay, and Active Directory Certificate Services (ADCS) abuse.
XZbot Malware Lab: A high-risk, high-fidelity malware environment. This lab contains the actual, functional CVE-2024-3094 backdoor. This provides a perfect, modern test case for a sophisticated software supply-chain attack.

Important Disclaimer

Handling live malware, even for research, can violate Acceptable Use Policies (AUPs) of ISPs or cloud providers. Ensure you own the infrastructure (Ludus is on-prem) and ensure your upstream ISP allows for such research, or route traffic through a VPN.

Component 3: The Sensor Grid (Elastic Agent & Defend)

To gain visibility, every virtual machine in the Ludus range across both GOAD and XZbot labs will be instrumented with Elastic Agent, a single, unified agent for data collection and protection (via Elastic Defend).

This instrumentation is automated via the badsectorlabs/ludus_elastic_agent Ansible role. This role is the critical lynchpin that programmatically bridges the infrastructure provisioning phase (Ludus/Ansible) with the security instrumentation phase (Elastic), enabling a true "infrastructure-as-code" workflow.

Crucially, the Elastic Agent policy will be configured with the Elastic Defend integration. This elevates the agent from a simple log collector to a full-powered Endpoint Detection & Response (EDR)/eXtended Detection & Response (XDR) solution, providing host-based detections (including Machine Learning (ML) driven malware and ransomware detection) and the deep, kernel-level telemetry essential for detection.

Note: For the purple team approach outlined in this blog, set policies to Detect mode.

Component 4: The Brain (Elastic Cloud Hosted / Elastic Serverless)

All security telemetry and alerts from the Elastic Agents in the Ludus range are streamed to a centralized Elastic Cloud Hosted (ECH) or Elastic Serverless deployment. This is where the unified platform's analytical power comes to life. Using a cloud-native platform is not just for hosting; it is what unlocks Elastic's most advanced, force-multiplying features, including Attack Discovery and the AI Assistant. Click here to start a trial on Elastic Cloud.

The diagram below provides an overview of the build, which is based on the GOAD lab.

Phase 1: Building and Instrumenting the Range

This section provides a technical, step-by-step guide to configuring and deploying the automated range. The process follows a clear "infrastructure-as-code" (IaC) model, where the security instrumentation is defined alongside the infrastructure itself, ensuring a consistent and repeatable monitoring posture for every deployment. The Elastic Cloud instance and its configurations can be managed with the Elastic Cloud and Elastic Stack Terraform provider for a full IaC model of the range and the SIEM.

3.1 Configuring the Elastic Agent Policy (in Kibana)

Before running the Ludus range deployment, the agent policy must be created in the Elastic Cloud instance. This policy is what enables the powerful EDR/XDR telemetry.

The operational flow is as follows:

Log in to the Elastic Cloud (ECH) or Elastic Serverless Kibana instance.
Navigate to Management > Fleet.
Create a new Agent policy (e.g., "ludus-range-policy"). The ludus_elastic_agent role will enroll agents into the policy you specify in your VM-level customization or into the default policy linked to the global variable.
Add the Elastic Defend integration to this policy.
Configure the Elastic Defend integration to run in Detect mode. This activates the full suite of EDR telemetries.
Save the policy and click "Add agent." This will provide the Enrollment token (for ludus_elastic_enrollment_token) and Fleet server URL (for ludus_elastic_fleet_server) needed for the ludus.yml file.
(Optional) Repeat steps 3-6 to create customized policies to align with the host’s functions and capabilities for VM-level customization of policies.

Once this policy is created and the token is pasted into the ludus.yml file, running Ludus range deploy will execute the full, automated workflow. Ludus provisions the VMs, and Ansible installs the Elastic Agent, which then enrolls in Fleet and automatically pulls down the policy containing the Elastic Defend integration. This provides the rich EDR telemetry - kernel-level process, file, network, and registry events - from the moment the lab is born.

3.2 The Ludus YAML Configuration (ludus.yml)

Ludus provides the steps to deploy the GOAD range here. The configuration for the range is stored in the ludus.yml configuration file. For the GOAD range, it is located in ad/GOAD/providers/ludus/config.yml.
The full configuration in the appendix is an example based on a sample running configuration that merges a full GOAD lab (on VLAN 10) with the XZbot lab (on VLAN 20).

To deploy a customized version during installation, update the ad/GOAD/providers/ludus/config.yml file before running the goad.sh script in step 2.

git clone https://github.com/Orange-Cyberdefense/GOAD.git
cd GOAD
sudo apt install python3.11-venv
export LUDUS_API_KEY='myapikey'  # put your Ludus admin api key here nano ad/GOAD/providers/ludus/config.yml # customize the configuration here
./goad.sh -p ludus
GOAD/ludus/local > check
GOAD/ludus/local > set_lab GOAD # GOAD/GOAD-Light/NHA/SCCM
GOAD/ludus/local > install

Two key configuration options can be used to customize the range:

Global Variables: To simplify the config and avoid repetition, the Elastic Agent variables are defined once at the top level in a global Ansible.vars block and are inherited by all VMs.

The enrollment token determines the Elastic Agent policy used.

# ludus.yml
---
# --- GLOBAL ANSIBLE VARS (Simplification) ---
# Define Elastic agent vars once and apply globally
global_role_vars:
  ludus_elastic_fleet_server: "<your-fleet.example.com:443>" # Use 443 for cloud
  ludus_elastic_enrollment_token: "<your_enrollment_token>"
  ludus_elastic_agent_version: "9.2.1"

VM-level Variables: The Elastic Agent variables can be configured at the VM-level to customize the policy applied. These can be combined with the global variable, for example, where the agent version and fleet_server are set via global variables, and the enrollment tokens are set at the VM-level to apply different policies to VMs.

# --- VM DEFINITIONS ---
vms:
  # --- GOAD LAB (VLAN 10) ---
  - name: "{{ range_id }}-GOAD-DC01"
    hostname: "{{ range_id }}-DC01"
    template: win2019-server-x64-template
    vlan: 10
    ip_last_octet: 10
    ram_gb: 4
    cpus: 2
    windows: { sysprep: true }
    ansible:
      roles:
        - badsectorlabs.ludus_elastic_agent
      role_vars:
        ludus_elastic_enrollment_token: "<your_enrollment_token>" # different token for different policies
  # (Definitions for GOAD-DC02, GOAD-DC03, GOAD-SRV02, GOAD-SRV03 
  #  would follow, all inheriting the global ansible vars)

Automating Elastic Agent Deployment

The ludus.yml snippet above demonstrates the automation. By adding the badsectorlabs.ludus_elastic_agent role to the ansible.roles section of each VM definition, Ludus will automatically install and configure the agent during deployment.

This single Ansible role is compatible with all operating systems in our heterogeneous lab, including Windows (for GOAD), Kali, and Debian (for XZbot).

As shown in the simplified YAML, the ansible.vars block at the top level passes the critical parameters to the role:

ludus_elastic_fleet_server: The Fleet server URL and port for your Elastic Cloud deployment (e.g., your-fleet.example.com:443).
ludus_elastic_enrollment_token: The token that enrolls the agent.
The full example sets the ludus_elastic_enrollment_token at the VM level to demonstrate the ability to use different policies.
ludus_elastic_agent_version: The specific agent version to install (e.g., 9.2.1).

Note: The Kali host will have Elastic Defend also deployed to monitor attacker behavior, this won’t be possible in a real-world scenario.

Safety First: Isolation, OPSEC, and Live Malware

This section contains a critical safety and operational security (OPSEC) warning. This configuration involves a significant, non-trivial risk that must be professionally managed.

4.1 The Threat: This is Not a Simulation

It must be stated unequivocally: The Ludus XZbot lab guide and its associated Ansible role install the actual, functional CVE-2024-3094 backdoor. This is not benign, simulated code. The lab's own documentation states: "Danger: This role contains malware (on purpose)."

While described as a "passive backdoor" (meaning it requires an attacker to actively trigger it), any virtual machine running this code with an open internet connection is a catastrophic liability. It could be scanned, exploited by unknown actors, or used as a pivot point to attack other networks.

4.2 The Contradiction: Isolation vs. Cloud Connectivity

This architecture creates a direct and critical operational conflict:

Requirement 1 (Safety): The malware lab must be isolated from the public internet to prevent compromise or breakout.
Requirement 2 (Function): The Elastic Agent must have outbound internet connectivity to reach the Elastic Cloud Hosted / Elastic Serverless endpoints for enrollment and data streaming.

A novice user would fail here, either by exposing their infected lab to the world or by isolating it so completely that no security telemetry can be collected.

4.3 The Solution: Pinhole Egress via Ludus Testing mode

The conflict is resolved using Ludus's built-in "testing" mode, which provides granular control over network egress. This feature is used for the pinhole egress, which enables agent control, telemetry, and log output.

# 1. Start the isolated testing session
ludus testing start # Note external DNS resolvers may also need to be added # ludus testing allow -i 1.1.1.1,8.8.8.8

# 2. Allow Elastic Fleet Server (Control Plane)
# Replace <id> with your specific deployment ID # Note the endpoint will differ based on the cloud providers
ludus testing allow -d <your-deployment-id>.fleet.us-central1.gcp.cloud.es.io

# 3. Allow Elasticsearch Ingest (Data Plane) # Note the endpoint will differ based on the cloud providers
ludus testing allow -d <your-deployment-id>.es.us-central1.gcp.cloud.es.io

This configuration delivers an expert-level solution: the malware is safely contained, while the Elastic Agent is granted only the minimal connectivity required to make policy updates (via communication with the fleet endpoint) and to ingest data (via communication with the ES endpoint).

4.4 Accessing the Range in Testing Mode (WireGuard)

Once Testing Mode is active, standard routing fails. You cannot simply SSH into your Kali VM from your local LAN because the router drops the traffic. Ludus provides an out-of-band management channel using WireGuard.

Ludus configures a WireGuard interface (wg0) on the router VM (198.51.100.1) and assigns you a static client IP (e.g., 198.51.100.2).

Persistent Allow Rules: The router's firewall configuration includes specific rules in the LUDUS_DEFAULTS chain. These rules explicitly ACCEPT traffic sourced from or destined to the WireGuard subnet (198.51.100.0/24).
Priority: Because these rules exist in the LUDUS_DEFAULTS chain, they override the DROP rules applied by Testing Mode.

How to connect:

Generate your config: ludus user wireguard > ludus.conf
Import this into your local WireGuard client and activate the tunnel.
Connect directly to the private IPs of your VMs (e.g., 10.10.10.11) over the tunnel.

Phase 2: Executing the Attacks

With the high-fidelity, fully instrumented range deployed, the "Red Team" phase can begin. This involves logging into a dedicated attacker VM (like the included Kali VM or a remnux-analyzer VM) and executing the attacks. This activity generates the rich, malicious telemetry that Elastic Defend will capture.

This combined range allows for testing defenses against the two dominant, macro-level threat vectors: identity-based "living-off-the-land" (LotL) attacks and vulnerability-based supply-chain intrusions.

5.1 Active Directory Simulation (GOAD)

Initial Access (Credential Stuffing)
1. The attacker targets the external perimeter. Using a list of breached credentials, you execute a password stuffing attack against the Essos.local domain. You successfully validate the credentials for the user khal.drogo.
2. Sample Tool: kerbrute or smartbrute
3. Result: Valid credentials for a low-privilege domain user.
Privilege Escalation (PrintNightmare)
1. khal.drogo has limited rights. To gain a foothold on the CastelBlack server, you exploit PrintNightmare (CVE-2021-34527). This vulnerability in the Windows Print Spooler service allows any authenticated user to install a malicious print driver. You upload a driver that adds a new local admin user to the box.
2. Sample Tool: CVE-2021-34527.py exploit script
3. Result: Local SYSTEM access on CastelBlack.
Credential Dump (DCSync Preparation)
1. Now running as SYSTEM/Admin on CastelBlack, you inspect the machine for cached credentials. You run Impacket's secretsdump to pull hashes from the SAM database and LSASS memory. You discover the NTLM hash for the built-in Administrator account, which was left in memory from a previous support session.
2. Sample Tool: impacket-secretsdump
3. Result: NTLM Hash of a Domain Admin or high-privilege account.
Kerberoasting
1. With valid domain credentials, you pivot to the internal network. You request Kerberos Service Tickets (TGS) for Service Principal Names (SPNs) in the environment. You target the MSSQLSvc account. You take the encrypted ticket offline and crack it to reveal the plaintext password for the SQL service account.
2. Sample Tool: Rubeus or GetUserSPNs.py
3. Result: Plaintext password for the MSSQL service account.
MSSQL Attacks
1. You use the cracked SQL credentials to authenticate directly to the Braavos SQL Server. Since the service account has sysadmin rights, you abuse the xp_cmdshell stored procedure. This feature allows you to spawn a Windows command shell directly from a SQL query, effectively giving you Remote Code Execution (RCE) on the database server.
2. Sample Tool: mssqlclient.py
3. Result: RCE on the Database Server.
Persistence (Scheduled Task)
1. To ensure you don't lose access if the SQL password changes, you establish persistence. You create a Windows Scheduled Task on the compromised SQL server. This task is configured to execute a beacon binary every day, running as SYSTEM.
2. Sample Tool: schtasks.exe or PowerShell
3. Result: Long-term persistence.

5.2 Malware Lab Simulation (XZbot)

Step 7: Supply Chain Pivot (XZ Backdoor)
Simultaneously, you target the Linux infrastructure in the DMZ. You trigger the pre-implanted XZ Backdoor (CVE-2024-3094) on the xz-backdoor-dect VM. By manipulating the SSH handshake with a specific cryptographic key, you bypass authentication entirely and execute commands as root without leaving standard SSH logs.
Tool: xzbot
Result: Root access on Linux infrastructure via supply chain compromise.
The attacker uses the xzbot client provided in the Ludus lab.
From the attacker VM, the following command is run to trigger the backdoor on the vulnerable Debian host:
xzbot --ssh-addr '10.X.X.X:22' -cmd 'setsid sh -c "echo test"' 2>&1
This action causes the sshd process on the target to anomalously spawn a shell and execute the command as root, creating definitive proof of execution.

Phase 3: Unified Detection & Investigation with Elastic Security

This is the "Blue Team" payoff. The telemetry and alerts generated in Phase 2 are now available for analysis within the unified Elastic Security platform.

6.1 The "Powerful SIEM": Centralized Visibility & Prebuilt Detections

The power of the Elastic SIEM is not just in its ability to passively collect logs. Its power comes from the active analysis it performs on the deep, contextual data provided by Elastic Defend. The "Complete Endpoint Visibility" from Defend provides not just basic logs, but kernel-level telemetry - process creations, file modifications, network connections, and registry changes.

This rich data, all normalized to the Elastic Common Schema (ECS), feeds Elastic's extensive library of ~1500+ prebuilt, MITRE-mapped detection rules. These rules are researched, developed, and maintained by the Elastic Security Labs team, providing out-of-the-box detection value.

The Ludus range serves as the perfect validation platform for this value. The attacks executed in Phase 2 are not theoretical; they are mapped directly to specific expected artifacts ("smoking gun"). A combination of prebuilt rules and custom rules is intentionally used together in the example to alert on specific behaviors.

Attack Step	MITRE ATT&CK	Elastic Detection Rule	Expected Artifact ("smoking gun")
1. Credential Stuffing	T1110 (Brute Force)	Potential Account Brute Force (Custom)	Abnormal Auth Success (Event 4624 and ssh login) across hosts.
2. PrintNightmare	T1068 (Exploitation)	Unusual Print Spooler Child Process	Unusual Print Spooler service (spoolsv.exe) child processes.
3. Credential Dump	T1003.006 (OS Credential Dumping)	Potential Remote Credential Access via Registry	Abnormal access to the Security Account Manager (SAM) registry hive.
4. Kerberoasting	T1558.003 (Kerberoasting)	Suspicious Kerberos Authentication Ticket Request (Custom)	Event ID 4769 with 0x17 (RC4) encryption requested.
5. MSSQL Attacks	T1505.001 (SQL Stored Procedures)	Execution via MSSQL xp_cmdshell Stored Procedure	Execution via MSSQL xp_cmdshell stored procedure
6. Persistence	T1053.005 (Scheduled Task)	A scheduled task was created	Event ID 4698 or schtasks.exe /create.
7. XZ Backdoor	T1210 (Exploitation of Remote Services)	Potential Execution via SSH Backdoor	sshd spawns unusual child processes like sh or bash.

Note: Elastic detection rules are open and transparent. You can view the logic, contribute, or raise issues directly on the(https://github.com/elastic/detection-rules).

6.2 Deep Dive: Tracing Process Chains with Event Analyzer

The two labs (GOAD and XZbot) provide a perfect opportunity to use Elastic's specialized investigation tools. The user interface of the Event Analyzer is designed to abstract the complexity of JSON logs into a cognitive model that aligns with how security analysts think: Process Chains. The interface is comprised of three primary interaction zones: the Graphical Canvas, the Detail Panel, and the Timeline integration.

What are we seeing?

The Graphical Canvas (The Process Tree)

The central view is a directed acyclic graph where:

Nodes (Cubes): Each cube represents a distinct process execution. The visualization distinguishes between the "Anchor" event (highlighted with a blue halo) and the surrounding context.
Edges (Lines): Lines represent the parent-child relationship. The directionality is implicit (top-down or left-right), showing the flow of execution.
Visual Badging: Nodes are not static icons; they are dynamic indicators.
- Alert Badges: If a specific process triggered a detection rule (e.g., "Malware Detected"), a colored badge appears on the cube. This allows an analyst to instantly identify which step in the chain was flagged by the detection engine.
- User Context: Visual cues may indicate if a process changed user context (e.g., from a local user to SYSTEM), signaling privilege escalation.

The Detail Panel (Forensic Metadata)

Clicking on any node triggers the Detail Panel, typically sliding in from the right. This panel is the primary source of "What you can see" at a granular level. It exposes fields critical for verification:

Command Line Arguments: This is arguably the single most valuable forensic artifact. The Analyzer displays the full string, exposing flags, scripts, and encoded payloads (e.g., powershell.exe -w hidden -enc Base64).
Process Path and Hash: The full file path helps identify masquerading (e.g., svchost.exe running from C:Temp instead of C:\Windows\System32). File hashes (MD5, SHA-1, SHA-256) are presented for cross-referencing with threat intelligence.
Signer Information: Information about the binary's digital signature helps distinguish between trusted Microsoft binaries and unsigned malware.
Related Event Counts: Instead of cluttering the graph with thousands of file modifications, the node displays summary statistics (e.g., "15 File Events," "3 Network Connections"). Clicking these stats usually drills down into a list view or timeline of those specific actions.

The Temporal Dimension (Time Filter)

A critical, often overlooked aspect of the Analyzer is its handling of time. Attacks can have long "dwell times." A parent process might have started weeks ago (e.g., a legitimate service), while the malicious child spawned today. The Analyzer includes a time slider that allows the analyst to expand the query window. By default, it might look at a narrow window around the alert, but expanding this allows the graph to "reach back" into the Warm or Cold data tiers to find the long-running parent process.

How does it work?

The operational capability of the Event Analyzer leverage the Elastic Common Schema (ECS). In a heterogeneous security environment, logs originate from diverse sources—Windows endpoints, Linux servers, network firewalls, and cloud service providers—each with a unique taxonomy. A CrowdStrike agent might label a process ID as TargetProcessId, while a Sysmon event uses ProcessId. Without normalization, correlating these events into a single chain is algorithmically impossible.
ECS solves this by enforcing a strict field hierarchy. The Event Analyzer relies on specific, high-fidelity ECS fields to construct the visual graph:

process.entity_id: This is the cornerstone of the Analyzer's logic. Operating systems recycle Process IDs (PIDs). A PID of 1234 might belong to svchost.exe at 09:00 and malware.exe at 14:00. Relying on PID for long-term historical analysis introduces collisions that would corrupt the visual graph, linking unrelated events. The process.entity_id is a unique string generated by the Elastic Agent (or ECS-compliant beats) that persists uniquely in the index, ensuring that the graph represents a distinct execution instance, regardless of PID reuse.
process.parent.entity_id: This field establishes the directed edge between nodes. By recursively querying for events where the process.entity_id of one event matches the process.parent.entity_id of another, the Analyzer reconstructs the lineage.

event.sequence: In high-velocity environments, the order of events (e.g., did the file modification happen before or after the network connection?) is critical. ECS timestamps and sequence numbers allow the Analyzer to order events chronologically within the visual node details.

6.3 Deep Dive: Reconstructing User Activity with Session Viewer

For the XZbot (Linux) attack, the Session Viewer is the superior tool. It is specifically designed for "monitoring and investigating session activity on Linux infrastructure".

When the Potential Execution via XZBackdoor alert fires, the analyst investigates the associated sshd process. The Session Viewer presents a "highly readable format inspired by the terminal". It reconstructs the attacker's session, showing the sshd process and its anomalous child process (sh).

Furthermore, it will show the exact command that was executed (sh -c setsid sh -c "usermod -aG sudo sysadmin_backup") and can even display the output of that command. This is the definitive "smoking gun", presented to the analyst in plain, human-readable text, effectively allowing them to watch the attacker's TTY session after the fact.

What are we seeing?

The user interface of the Session Viewer is explicitly designed to bridge the gap between abstract log analysis and the native terminal experience of a Linux administrator. Unlike the Event Analyzer, which focuses on malware process chains, the Session Viewer presents a time-ordered, tree-based visualization that reconstructs the linear narrative of a shell session.

The Process Tree and Timeline

The central component of the view is a Directed Acyclic Graph (DAG) displayed as a hierarchical list.

Vertical Flow: The Session Viewer arranges processes vertically, mimicking the flow of a terminal history file but preserving hierarchy. Child processes are indented relative to their parents. This allows an analyst to immediately distinguish between a command run directly by the user (e.g., curl) and a process spawned by a script execution (e.g., curl executing inside a setup.sh script).
Verbose Mode: A toggle allows analysts to switch between a filtered view (showing significant user activity) and "Verbose Mode." When enabled, this mode reveals typically noisy events like shell startup scripts (.bashrc execution), shell completion helpers, and forks caused by built-in commands. This is crucial for detecting persistence mechanisms hidden in profile scripts.

Visual Badging and Indicators

The UI employs a sophisticated system of badges and icons to provide immediate context without requiring the analyst to drill down into every node. These visual cues are essential for rapid triage.

Visual Indicators in Elastic Session Viewer

Badge/Icon	Visual Appearance	Meaning	Forensic Implication
Exec User Change	Explicit Text Badge	The user context changed (e.g., su, sudo).	Critical for identifying privilege escalation. Shows exactly when a standard user became root.
Process Alert	Gear Icon	A process event triggered a detection rule.	Indicates execution of malicious binaries or suspicious arguments (e.g., whoami).
File Alert	Page Icon	A file modification triggered a rule.	Indicates tampering, persistence creation (cron/systemd), or exfiltration staging.
Network Alert	Page Icon (Secondary)	A network event triggered a rule.	Indicates C2 communication, lateral movement, or exfiltration.
Multiple Alerts	Combined Badge	Single event triggered multiple rule types.	High-confidence indicator of malicious activity (e.g., a process dropped a file and executed it).
Alert Count	Numeric (e.g., (2))	Total alerts associated with a node.	Helps prioritize which steps in the chain were most "noisy" to detection logic.

Terminal Output View

Hovering over the Terminal Output button on a process node reveals a badge indicating the size of the captured output. Clicking this button opens the Terminal Output view, which renders the process.io.text data. This is the "Smoking Gun" feature for Linux investigations.

Replay Capability: It allows the analyst to see exactly what the user saw. If an attacker ran cat /etc/passwd, the process tree shows the execution; the Terminal Output view shows the content of the passwd file as it was displayed to the attacker.
Input Reconstruction: Because the viewer captures TTY I/O, it captures not just the command execution, but the typing. This can reveal backspaces, typos, and corrections (e.g., typing sdo [backspace] sudo), which are strong behavioral indicators of a human adversary rather than an automated script.

The Elastic Advantage: AI-Powered Automated Hunting

The process described in Phase 3 demonstrates a powerful, analyst-driven investigation. However, the primary advantage of using Elastic Cloud Hosted (ECH) or Elastic Serverless is the programmatic access to an integrated Generative AI stack. This stack elevates the process from manual correlation to AI-driven automated hunting.

Note: Elastic's AI features work with the out-of-the-box Elastic Managed LLMs or with third-party LLMs configured using one of the available connectors.

7.1 From Alerts to Attacks: Automated Correlation with Attack Discovery

The GOAD + XZbot labs will generate multiple discrete alerts, as shown in the table above. A junior analyst would be faced with a queue of alerts: Potential Kerberoasting, Suspicious Certificate Request, and Potential XZBackdoor and have to manually "stitch together" this complex, cross-domain attack.

This is the problem solved by Attack Discovery. This GenAI feature, available in Enterprise and Serverless tiers, "delivers fully automated threat hunting at scale". It "AI analyzes every alert to uncover hidden threats", automatically correlating the disparate signals from the Ludus lab into a single, high-fidelity "Attack" investigation.

The primary value of Attack Discovery for a forensic analyst is the compression of time. It automates the "mental stitching" that defines tier-one and tier-two analysis.

Deconstructing the "Mental Stitching"

Consider an example investigation without Attack Discovery.

Trigger: You see an alert: "Suspicious PowerShell Execution."
Query: You pivot to the host timeline.
Scan: You scroll back 15 minutes. You see a "File Download" event.
Hypothesis: "Maybe the user downloaded a bad file, which launched PowerShell."
Verification: You check the file name. It is invoice.js.
Conclusion: "Confirmed malware download."

This process takes between 10 and 30 minutes, dependingon the analyst's skill and familiarity with the environment. Attack Discovery performs this entire sequence in seconds. It looks at the PowerShell alert, sees the file download event in the related context, and presents a Discovery stating: "User executed suspicious PowerShell script likely originating from downloaded file 'invoice.js'."

This feature includes Data Persistence (results are saved for historical tracking) and Scheduling & Actions (it runs automatically and can trigger responses or subsequent Elastic Workflows), moving the SOC from a reactive to a proactive posture.

Example

In our example, as the Attack occurs, we start to see alerts. Instead of triaging the alerts individually, we leverage Attack Discovery for triage.
Compressing the mean-time-to-triage down to seconds and quickly identifying the 2 attacks.

7.2 Accelerating Triage with the AI Assistant

The Elastic Security Assistant uses generative AI to help you find, fix and understand security threats. It works directly inside Elastic Security. You interact with it through a chat interface to investigate alerts and write code.

In our example, once Attack Discovery identifies a correlated attack, we then use the AI Assistant to investigate. The assistant provides two key capabilities:

Natural Language Investigations: The analyst can ask plain-English questions like, "Summarize this attack", "What is the MITRE Tactic for this process?", "What is print spooler?" or “Provide some remediation suggestions.”

Agentic Query Validation workflow: This advanced feature allows the AI to "generate bespoke, validated ES|QL queries". An analyst can ask, "Find all network connections from the host involved in the XZbot alert", and the assistant will write, validate, and self-correct the query before presenting it, drastically lowering the skill barrier to high-end threat hunting.

How It Works

The Assistant connects your Elastic Stack to an LLM of your choice (e.g., GPT-5, Claude, Gemini). It uses Retrieval Augmented Generation (RAG) to fetch relevant data—logs, alerts, and internal documentation—from your environment. You can configure it to anonymize sensitive fields (PII or host/IP metadata) before sending the prompt to the model, ensuring your data remains private while the model reasons the behavioral patterns.

7.3 Intelligent Automation with Elastic Workflows

The attacks described above generate complex, multi-stage alerts. Handling these manually is slow. Elastic has addressed this by acquiring Keep, an open-source AIOps and alert management platform. In Elastic 9.3, this technology is integrated directly into Kibana in Technical Preview as Elastic Workflows.

What are Workflows?

Elastic Workflows is an automation engine built into the Elasticsearch platform. You define Workflows in YAML - what triggers them, what steps they take, what actions they perform - and the platform handles execution. A Workflow can query your environment, transform and enrich security data, branch based on conditions, call external APIs, and integrate with services like Slack, Jira, PagerDuty and more through connectors you've already configured. Workflows can also call AI agents to reason through complex investigations, then continue with response actions based on what the agent discovers. Elastic Workflows combines scripted automation with AI reasoning natively in your SIEM, where your security data already lives.

How It Works: The "Alert Aggregator & Workflow Engine"

Workflows become the middleware layer between detection and remediation, working through three primary mechanisms:

Multi-Source Ingestion: Workflows extend beyond Elastic. Pulling in additional data for enrichment, analysis or initial triage.
Workflow-as-Code (YAML): Workflows are defined in YAML files. This allows teams to version control their incident response procedures as code.
The Workflow Engine: When an alert triggers in Elastic (or an external tool), the Workflow Engine executes a series of steps:
1. Enrichment: Querying an API (like VirusTotal or Active Directory) to add context.
2. Logic: Using if/else statements to determine severity.
3. Action: Sending a Slack message, creating a Jira ticket, or triggering an Elastic Defend response action.

Consider an example Alert and Action flow.

Trigger: You connect the workflow to a specific rule, such as "Malicious Detection Alert".
Steps: You define a sequence of actions.
1. Triage (Agentic): Pass the alert to the AI Assistant. Ask the questions: "How would we remediate and respond to the alert below?”
2. Enrich: Attach the AI Assistant's response as a note to the alert.
3. Respond: Create a case with a link to the alert note.

Example

In our example, we have alerts that trigger our Workflow - Alert Enrichment & Case Creation.
We will also directly trigger it from the Workflows UI to demonstrate the various steps.

The Alert context is provided as an input to the Security AI Assistant
The response is added as a note to the Security alerts
A case is created with metadata from the Alert (timestamp, severity, rule name and alert reason).
A link to the case is added to the case as a comment. Note: this is not shown in the GIF.

Conclusion: From Manual Setup to Continuous Emulation

This blog has provided a complete blueprint for an advanced, scalable, and most importantly, a safe simulation range.

We built: A complex, multi-lab range (GOAD + XZbot) was deployed with a single command using Ludus.
We instrumented: The entire range was seamlessly instrumented with Elastic Agent and Defend as part of the automated deployment, using the ludus_elastic_agent Ansible role.
We secured: The critical conflict between malware isolation and cloud-agent connectivity was solved using Ludus's granular "OPSEC" networking controls.
We validated: The platform's powerful SIEM capabilities were proven by validating Elastic's prebuilt, out-of-the-box detection rules against live, known-bad attacks.
We investigated: The specialized investigation tools, Event Analyzer and Session Viewer, were used to trace the exact attack paths on both Windows and Linux hosts.
We automated: The "force-multiplier" of Elastic's GenAI stack was demonstrated, with Attack Discovery automatically correlating disparate alerts into a single attack and the AI Assistant accelerating the final investigation.
We responded: The power of Elastic Workflows provide the brains and automation for complex response actions and remediation flows.

This architecture is not a one-off build. It is a blueprint for a continuous detection engineering pipeline. It "modernizes security operations" by empowering purple teams to tear down, rebuild, and re-test their defenses on demand, ensuring their detection posture evolves as fast as the threats do.

Take the Next Step: Enable Your Security Team

The architecture in this blog is more than a technical exercise; it's a blueprint for continuous security validation. By pairing this automated range with Elastic’s unified SIEM and XDR platform, you can move from periodic testing to a state of constant readiness.

We invite you to start your own trial, leverage this guide to test and evaluate the platform against real-world threats, and enable your security team with the tools to stay one step ahead of the adversary.

Using another SIEM?

No problem. You can leverage Elastic Serverless and augment your existing SIEM, then gain all of the insights above while using your native SIEM's underlying data. Get started with an Elastic Serverless deployment today. The Elastic AI SOC Engine (EASE) package delivers these AI-driven capabilities, enabling organizations to rapidly add powerful analytics and an AI layer on top of their existing tools before the full migration.

Appendix

Example Full Range

Note: The Kali VM VLAN is outside of the GOAD and XZ backdoor hosts to simulate a segmented network or a remote attacker. The Kali VM VLAN can be changed to 10/20 to simulate “assumed breach” or internal attack scenarios.

global_role_vars:
  ludus_elastic_fleet_server: "https://<fleet_domain>:<fleet_port>" #443 by default for cloud   ## Note on prem fleet server defaults to 8220
  ludus_elastic_agent_version: "9.2.1"
ludus:
  - vm_name: "{{ range_id }}-GOAD-DC01"
    hostname: "{{ range_id }}-DC01"
    template: win2019-server-x64-template
    vlan: 10
    ip_last_octet: 10
    ram_gb: 4
    cpus: 2
    windows:
      sysprep: true
    dns_rewrites:           # Any values in this array will be added to DNS for the range and return an A record for this VM's IP
      - sevenkingdoms.local
      - kingslanding.sevenkingdoms.local
      - kingslanding
    roles:
      - badsectorlabs.ludus_elastic_agent
    role_vars:
      ludus_elastic_enrollment_token: "<goad_policy_enrollment_token>"
  - vm_name: "{{ range_id }}-GOAD-DC02"
    hostname: "{{ range_id }}-DC02"
    template: win2019-server-x64-template
    vlan: 10
    ip_last_octet: 11
    ram_gb: 4
    cpus: 2
    windows:
      sysprep: true
    dns_rewrites:
      - winterfell.north.sevenkingdoms.local
      - north.sevenkingdoms.local
      - winterfell
    roles:
      - badsectorlabs.ludus_elastic_agent
    role_vars:
      ludus_elastic_enrollment_token: "<goad_policy_enrollment_token>"
  - vm_name: "{{ range_id }}-GOAD-DC03"
    hostname: "{{ range_id }}-DC03"
    template: win2016-server-x64-template
    vlan: 10
    ip_last_octet: 12
    ram_gb: 4
    cpus: 2
    windows:
      sysprep: true
    dns_rewrites:
      - essos.local
      - meereen.essos.local
      - meereen
    roles:
      - badsectorlabs.ludus_elastic_agent
    role_vars:
      ludus_elastic_enrollment_token: "<goad_policy_enrollment_token>"
  - vm_name: "{{ range_id }}-GOAD-SRV02"
    hostname: "{{ range_id }}-SRV02"
    template: win2019-server-x64-template
    vlan: 10
    ip_last_octet: 22
    ram_gb: 4
    cpus: 2
    windows:
      sysprep: true
    dns_rewrites:
      - castelblack.north.sevenkingdoms.local
      - castelblack
    roles:
      - badsectorlabs.ludus_elastic_agent
    role_vars:
      ludus_elastic_enrollment_token: "<goad_policy_enrollment_token>"
  - vm_name: "{{ range_id }}-GOAD-SRV03"
    hostname: "{{ range_id }}-SRV03"
    template: win2019-server-x64-template
    vlan: 10
    ip_last_octet: 23
    ram_gb: 4
    cpus: 2
    windows:
      sysprep: true
    dns_rewrites:
      - braavos.essos.local
      - braavos
    roles:
      - badsectorlabs.ludus_elastic_agent
    role_vars:
      ludus_elastic_enrollment_token: "<your_enrollment>"
  - vm_name: "{{ range_id }}-xz-backdoor-dect"
    hostname: "{{ range_id }}-xz-backdoor-dect"
    template: debian-12-x64-server-template
    vlan: 20
    ip_last_octet: 1
    ram_gb: 2
    cpus: 2
    linux:
      packages: # You can define packages to install on Linux hosts
        - ca-certificates
        - netcat-openbsd
        - net-tools
    roles:
      - badsectorlabs.ludus_xz_backdoor
      - badsectorlabs.ludus_elastic_agent
    role_vars:
      ludus_xz_backdoor_install_xzbot: true
      ludus_xz_backdoor_install_backdoor: true
      ludus_elastic_enrollment_token: "<linux_policy_enrollment_token>"
  - vm_name: "{{ range_id }}-kali"
    hostname: "{{ range_id }}-kali"
    template: kali-x64-desktop-template
    vlan: 50
    ip_last_octet: 99
    ram_gb: 8
    cpus: 4
    linux: true
    testing:
      snapshot: false # Snapshot this VM going into testing, and revert it coming out of testing. Default: true
      block_internet: false # Allow internet access for Kali, default is true
    roles:
      - badsectorlabs.ludus_xz_backdoor
      - badsectorlabs.ludus_elastic_agent
    role_vars:
      ludus_xz_backdoor_install_xzbot: true
      ludus_elastic_enrollment_token: "<linux_policy_enrollment_token>"

The release and timing of any features or functionality described in this post remain at Elastic's sole discretion. Any features or functionality not currently available may not be delivered on time or at all.

Automating GOAD and Live Malware Labs

Introduction: The Need for a Scalable, Automated Simulation Range

The Solution Architecture: Ludus + Elastic

Component 1: The Foundation (Ludus)

Component 2: The Targets (The Labs)

Important Disclaimer

Component 3: The Sensor Grid (Elastic Agent & Defend)

Component 4: The Brain (Elastic Cloud Hosted / Elastic Serverless)

Phase 1: Building and Instrumenting the Range

3.1 Configuring the Elastic Agent Policy (in Kibana)

3.2 The Ludus YAML Configuration (ludus.yml)

Automating Elastic Agent Deployment

Safety First: Isolation, OPSEC, and Live Malware

4.1 The Threat: This is Not a Simulation

4.2 The Contradiction: Isolation vs. Cloud Connectivity

4.3 The Solution: Pinhole Egress via Ludus Testing mode

4.4 Accessing the Range in Testing Mode (WireGuard)

Phase 2: Executing the Attacks

5.1 Active Directory Simulation (GOAD)

5.2 Malware Lab Simulation (XZbot)

Phase 3: Unified Detection & Investigation with Elastic Security

6.1 The "Powerful SIEM": Centralized Visibility & Prebuilt Detections

6.2 Deep Dive: Tracing Process Chains with Event Analyzer

What are we seeing?

The Graphical Canvas (The Process Tree)

The Detail Panel (Forensic Metadata)

The Temporal Dimension (Time Filter)

How does it work?

6.3 Deep Dive: Reconstructing User Activity with Session Viewer

What are we seeing?

The Process Tree and Timeline

Visual Badging and Indicators

Terminal Output View

The Elastic Advantage: AI-Powered Automated Hunting

7.1 From Alerts to Attacks: Automated Correlation with Attack Discovery

Deconstructing the "Mental Stitching"

Example

7.2 Accelerating Triage with the AI Assistant

How It Works

7.3 Intelligent Automation with Elastic Workflows

What are Workflows?

How It Works: The "Alert Aggregator & Workflow Engine"

Example

Conclusion: From Manual Setup to Continuous Emulation

Take the Next Step: Enable Your Security Team

Using another SIEM?

Appendix

Example Full Range

Jump to section

Elastic Security Labs Newsletter

Share this article