Cybersecurity statistics:
Methodology and sources
Purpose of this page
This page explains how the cybersecurity statistics presented on our Cybersecurity Statistics page are collected, processed, and interpreted, and provides full transparency regarding the data sources referenced. The main Cybersecurity Statistics page presents summarized findings and NordVPN research insights.
Data sources and attribution
Source discovery is performed via Google Custom Search API (GCS), using multiple Custom Search Engines (CSEs) configured for:
media outlets: 44 mainstream and tech media sources (e.g., BBC, CNN, The New York Times, WSJ, FT, Reuters, Bloomberg, TechCrunch, Wired, Ars Technica, Time, Forbes).
authoritative/reference sites: 25 industry and expert sources (e.g., CISA, KrebsOnSecurity, The Hacker News, Dark Reading, BleepingComputer, SecurityWeek, Infosecurity Magazine).
local news: 100+ regional and national outlets across APAC, EMEA, and the Americas (e.g., Channel NewsAsia, CSA.gov.sg, Zaobao; HK01, unwire.hk; Japan Times, NISC, JPCERT, ITMedia).
unrestricted/general.
Queries are keyword-driven from a maintained keyword list that groups terms by category.
All records include explicit attribution:
Original article link
Media outlet (domain extracted from the URL)
Publication date and collection date
We synthesize information from many sources for statistics and event aggregation; each statistic is derived from article-level evidence stored with links.
Content retrieval and collection cadence
Fetches full-text content from discovered links with:
Primary: NewsPlease
Fallback: direct HTML download with hardened requests session and trafilatura extraction.
Timeouts, retries, TLS fallbacks, and referer headers are used to reduce transient failures.
Publication date and title are taken from the extractor when available; date parsing is normalized to date-only.
Daily runs query the last 1 day of content.
Feature extraction
Extracted fields include:
Media outlet (from URL)
First paragraph (first 3–5 sentences)
Keyword features: total count in text, presence in title, sentences containing the seed keyword, and presence of any keywords from the maintained list
Word count
LLM relevance assessment
Each article is evaluated by an LLM with a deterministic setting (temperature 0) and a constrained prompt that requires explicit, structured outputs:
Whether the article is cyber-event relevant
If relevant, a high-level event type is assigned:
Incident: A confirmed cyberattack or breach has already occurred (e.g., ransomware deployment, data exfiltration, DDoS, system compromise).
Vulnerability: Discovery or disclosure of a security flaw in software/hardware/systems that could be exploited (potential risk rather than confirmed exploitation).
Threat Intelligence: Reporting on threat actors, tools, TTPs, and campaigns—focuses on “who/how,” not a specific victim incident.
Regulatory‑Legal: Laws, regulations, enforcement actions, court decisions, or major policy changes that affect cybersecurity obligations.
Article type and categorization
Relevant articles are categorized via structured taxonomy prompts (primary: attack status, event type, regulatory/legal; secondary: impact metrics/class, technical specifics, sectors, geography, size, approximate damage).
Event clustering (article-to-event aggregation)
Objective: group articles that describe the same underlying incident into a single ‘event’.
Method:
Retrieve existing events from the database to provide context (titles, known organizations affected, threat actors, links).
For each candidate article (where Article Type = Single Incident), the LLM compares article details against batches of existing events and either:
1. Assigns an existing event ID when there’s a high-confidence match, or
2. Creates a new event otherwise.
Prompts emphasize high precision: only link to an existing event when highly confident. Organization(s) affected and threat actor signals are treated as strong indicators.
Events maintain aggregated fields: first/last seen dates, article count, organizations affected, threat actors, titles, links.
Accuracy and quality assurance
Determinism and constraints:
LLM temperature set to 0 to maximize determinism and reduce hallucinations.
Constrained prompts require explicit fields and JSON outputs; parsing enforces schema.
Non-content articles (missing title/text) are rejected early.
LLM-governed, schema-validated metrics:
All metric fields are produced by deterministic LLM runs (temperature 0) under strict, documented guidelines and JSON schemas; only schema‑compliant outputs are counted, with periodic human QA to calibrate and prevent drift.
Event/article classification for precision filtering:
Event-type and article-focus classification serves as a strict relevance gate, filtering out off-topic, low-signal, or roundup-style content. This focus on single-incident reporting reduces noise and measurably improves dataset precision and accuracy.
Multi-source validation:
Event clustering references previously stored event context; mismatches reduce the chance of incorrect merges.
Aggregations include the list of source links per event for manual verification.
Human-in-the-loop:
High-impact or ambiguous cases can be flagged for editorial review and fact-checking.
Regular QA reviews: sampled articles and events are audited on a monthly cadence, with precision review; any drift triggers prompt/model or keyword adjustments.
Traceability:
Every statistic can be traced to articles and links contained in the database for auditability.
Limitations
Coverage limits:
GCS-based discovery depends on keywords and CSE configuration; not all incidents are captured, especially outside configured languages or paywalled content.
Some sites block automated retrieval; such articles may be partially or fully missing.
LLM-specific risks:
Despite deterministic settings and structured prompts, misclassification can occur, particularly with sparse or ambiguous texts.
Event clustering may split the same incident into multiple events or merge similar but distinct incidents in edge cases.
How statistics are computed
Article-level fields are derived from direct extraction and LLM outputs (stored per record).
Event-level metrics aggregate constituent articles by event_id:
article counts, first/last seen dates
de-duplicated organizations affected and threat actors
representative titles and canonical link lists
Report statistics pull from these stored tables; each figure can be traced back to event rows and underlying article records.
Scope of the data
The statistics and insights referenced across our cybersecurity content are derived from a combination of:
Publicly available cybersecurity incident reporting
Media coverage of confirmed cyber incidents
Industry reports and surveys
Government and regulatory disclosures
The data reflects publicly observable and reported activity, not the full universe of all cyber incidents that occur globally. Many cyber events are never disclosed, reported, or covered by the media.
Data sources and discovery
Source types
Cybersecurity-related articles and reports are collected from multiple source categories, including:
Mainstream and technology media.
Examples include major international news organizations and technology publications.Authoritative and expert cybersecurity sources.
Including government agencies, cybersecurity research organizations, and established industry publications.Regional and local news outlets.
Covering cybersecurity incidents across North America, Europe, Asia-Pacific, and other regions.Industry and research reports.
Including annual breach reports, threat landscape reports, surveys, and economic analyses.
Each source is attributed at the article or report level, with publication date, outlet, and original URL preserved.
Discovery process
Content discovery is performed using automated search queries based on a maintained cybersecurity keyword list. Keywords are grouped by topic (for example: data breaches, ransomware, phishing, vulnerabilities, regulation).
Searches are run on a daily basis to capture newly published content. Each run queries recent material only, ensuring the dataset reflects current reporting.
Content collection and processing
Article retrieval
Once a source is discovered, the full article text is retrieved using automated extraction tools. Where primary extraction fails, fallback methods are used to ensure robust coverage.
Deduplication
To avoid double counting:
Identical URLs are processed only once
Re-published or syndicated content is deduplicated at the article level
Event-level aggregation (described below) further reduces duplication across outlets
Relevance filtering and classification
Cybersecurity relevance assessment
Each article is evaluated to determine whether it is relevant to cybersecurity statistics. Articles must meaningfully describe or analyze a cybersecurity event, threat, vulnerability, or regulatory action.
Event type classification
Relevant articles are classified into high-level categories, including:
Incident – A confirmed cyberattack or breach that has already occurred
Vulnerability – Disclosure of a security weakness that could be exploited
Threat intelligence – Reporting on threat actors, tools, campaigns, or techniques
Regulatory / legal – Laws, enforcement actions, policy changes, or legal proceedings related to cybersecurity
This classification ensures that statistics referring to “incidents,” “breaches,” or “attacks” are not conflated with vulnerability disclosures or general commentary.
Event clustering (article-to-event aggregation)
Multiple articles often report on the same underlying cyber incident. To prevent overcounting:
Articles describing the same incident are grouped into a single event
Events are assigned stable internal identifiers
Articles are linked to existing events only when there is high confidence they describe the same occurrence
Indicators used for clustering include affected organizations, threat actors, timelines, and incident descriptions.
Event-level records maintain:
First and last appearance dates
Number of related articles
Affected organizations
Referenced threat actors
Source links for verification
Use of automated analysis and quality controls
Automated classification
Structured, deterministic language-model analysis is used for classification, extraction, and aggregation. All automated outputs follow predefined schemas to ensure consistency.
The models operate with deterministic settings to reduce variability and hallucination risk.
Quality assurance
To maintain accuracy:
Schema validation ensures only properly structured outputs are counted
Regular monthly sampling and review procedures, including a precision review, are conducted to detect classification drift. This review identifies shifts in classification, which then informs and necessitates adjustments to the model.
Ambiguous or high-impact cases are flagged for human review
Aggregated statistics retain traceability to individual articles and events
How statistics are calculated
Article-level vs event-level metrics
Some statistics are based on:
Article-level counts (e.g., volume of media coverage)
Event-level counts (e.g., number of distinct breaches or incidents)
Where applicable, event-level metrics are preferred to reduce duplication.
Interpretation of counts and frequencies
Statistics such as “incidents per day” or “breaches per year” represent reported or media-visible activity, not total global activity.
Vendor telemetry, government complaint systems, and economic projections often report significantly higher volumes due to differences in scope and methodology. These differences are noted where relevant.
Limitations and considerations
While care is taken to ensure accuracy and consistency, the data has inherent limitations:
Not all incidents are publicly disclosed or reported
Media coverage varies by region, sector, and incident scale
Some sources restrict access
Classification errors may occur in edge cases
Economic loss figures may change as investigations evolve
Statistics should therefore be interpreted as directional indicators, not exhaustive measurements.
Sources Index
Each numbered source below corresponds to a superscript reference used on the Cybersecurity Statistics page. Superscripts link directly to the relevant source entry on this page.
| Source 1 Statista – |
|---|
| Source 2 Identity Theft |
| Source 3 Identity Theft |
| Source 4 Verizon – |
| Source 5 IBM – |
| Source 6 South Korean |
| Source 7 Aflac – June |
| Source 8 HIPAA Journal – |
| Source 9 California Attorney |
| Source 10 Iowa Attorney |
| Source 11 Rhode Island |
| Source 12 Rhode Island |
| Source 13 Aflac Newsroom – |
| Source 14 HIPAA Journal – |
| Source 15 Office of the |
| Source 16 Qantas – Information |
| Source 17 Qantas Newsroom – |
| Source 18 Michigan Attorney |
| Source 19 Maine Attorney |
| Source 20 California Attorney |
| Source 21 University of |
| Source 22 Microsoft Digital |
| Source 23 WIRED – NotPetya |
| Source 24 Reuters – UnitedHealth |
| Source 25 The Guardian – Jaguar |
| Source 26 NBC News – |
| Source 27 Delaware Department |
| Source 28 Cybersecurity |
| Source 29 JumpCloud – Phishing |
| Source 30 Hornetsecurity – Email |
| Source 31 Spearshield – |
| Source 32 APWG – Phishing |
| Source 33 arXiv – Academic |
| Source 34 DeepStrike – Password |
| Source 35 NordPass – Top 200 |
| Source 36 Financial Times – |
| Source 37 SecurityScorecard – |
| Source 38 National Technology & |
| Source 39 Palo Alto Networks – |
| Source 40 IBM – Threat |
| Source 41 Tenable – |
| Source 42 Cybersecurity |
| Source 43 Statista Market |
| Source 44 Statista – Cost of |
| Source 45 FTC – Consumer |
| Source 46 FBI IC3 – 2024 Internet |
| Source 47 Kroll – Data Breach |
| Source 48 IBM – Cost of a Data |
| Source 49 SailPoint – 2024 |
| Source 50 DeepStrike – |
| Source 51 Proofpoint & |
| Source 52 Check Point – |
| Source 53 Thales – 2024 |
| Source 54 Cyfirma – Energy & |
| Source 55 World Economic |
| Source 56 DeepStrike – Cyber |
| Source 57 Devolutions – State of |
| Source 58 TotalAssure – |
| Source 59 Cisco – Cybersecurity |
| Source 60 IANS Research – |
| Source 61 Munich Re – |
| Source 62 Gartner – 2025 |
| Source 63 Forrester – 2024 |
| Source 64 Ivanti – State of |
| Source 65 U.S. Department of |
| Source 66 U.S. Department of |
| Source 67 Google Cloud – |
| Source 68 Gartner – Generative AI |
| Source 69 Splashtop – Top |
| Source 70 ENISA – Threat |