Cisco Talos has released EvidenceForge, an open-source synthetic log generation tool designed to address a persistent gap in security operations: the lack of high-quality, labeled telemetry for training analysts, validating detections, and building machine learning models.

The Problem with Existing Datasets

Security teams routinely need realistic log data with known ground truth, but the available options carry significant trade-offs. Production telemetry raises compliance concerns. Public datasets such as the LANL dataset and OpTC have been so heavily anonymized that they no longer resemble actual log sources. Attack simulation frameworks like Atomic Red Team and MITRE Caldera require real infrastructure, scale poorly, and are time-consuming to operate. Hiring a red team is expensive and produces only the specific scenario they executed.

Existing synthetic generators address some of these constraints but share a common architectural weakness: they emit events independently, one format at a time, with no shared state across log sources. The result is datasets where a process in Sysmon does not connect coherently to the same process in Windows Security Events, or a network logon leaves no consistent connection trace elsewhere. The absence of a causal story is what experienced analysts recognize as synthetic.

How EvidenceForge Works

EvidenceForge takes a different approach by grounding all output in a single canonical SecurityEvent object. Every log entry across every format derives from that shared object, which carries a timestamp, event type, and more than 30 composable context objects including fields for process identity, network connection details, authentication metadata, DNS, and HTTP protocol data.

Because all emitters read from the same shared context, they cannot produce conflicting values. There is one process ID, one logon ID, one Zeek UID, and one timestamp across all correlated output. The engine is also OS-aware, generating Windows Security Events and Sysmon output for Windows hosts while producing syslog and bash history for Linux hosts, according to the role each system plays in the scenario.

Scenarios are defined in a YAML configuration file that describes the simulated environment, including hosts, users, and network topology, along with an optional attack storyline. The engine incorporates timing models, assigns roles to users and systems, and generates realistic background noise and red herrings alongside malicious activity.

Supported Log Formats

A single scenario produces correlated output across more than 20 formats, including:

  • Windows Security Events (covering authentication, process lifecycle, Kerberos, persistence, and account management)
  • Sysmon
  • EDR and XDR telemetry
  • Linux syslog and bash history
  • Zeek logs in JSON format
  • Snort IDS alerts
  • Firewall, web server access, and forward HTTP proxy logs

Intended Use and Limitations

Talos positions EvidenceForge as useful for training threat hunters and incident responders, validating detection logic against labeled ground truth, and producing the balanced, multi-source telemetry that supervised ML models require at scale. The project also generates ground truth documentation and an analyst briefing alongside each dataset.

Talos is candid about the tool’s ceiling: no purely synthetic dataset will consistently fool a seasoned analyst, and EvidenceForge does not attempt to achieve that. The stated goal is sufficient fidelity to be operationally useful, not production indistinguishability.