Cybersecurity statistics:
Methodology and sources

Purpose of this page

This page explains how the cybersecurity statistics presented on our Cybersecurity Statistics page are collected, processed, and interpreted, and provides full transparency regarding the data sources referenced. The main Cybersecurity Statistics page presents summarized findings and NordVPN research insights.

Data sources and attribution

Source discovery is performed via Google Custom Search API (GCS), using multiple Custom Search Engines (CSEs) configured for:

media outlets: 44 mainstream and tech media sources (e.g., BBC, CNN, The New York Times, WSJ, FT, Reuters, Bloomberg, TechCrunch, Wired, Ars Technica, Time, Forbes).
authoritative/reference sites: 25 industry and expert sources (e.g., CISA, KrebsOnSecurity, The Hacker News, Dark Reading, BleepingComputer, SecurityWeek, Infosecurity Magazine).
local news: 100+ regional and national outlets across APAC, EMEA, and the Americas (e.g., Channel NewsAsia, CSA.gov.sg, Zaobao; HK01, unwire.hk; Japan Times, NISC, JPCERT, ITMedia).
unrestricted/general.

Queries are keyword-driven from a maintained keyword list that groups terms by category.

All records include explicit attribution:

Original article link
Media outlet (domain extracted from the URL)
Publication date and collection date

We synthesize information from many sources for statistics and event aggregation; each statistic is derived from article-level evidence stored with links.

Content retrieval and collection cadence

Fetches full-text content from discovered links with:

Primary: NewsPlease
Fallback: direct HTML download with hardened requests session and trafilatura extraction.

Timeouts, retries, TLS fallbacks, and referer headers are used to reduce transient failures.

Publication date and title are taken from the extractor when available; date parsing is normalized to date-only.

Daily runs query the last 1 day of content.

Feature extraction

Extracted fields include:

Media outlet (from URL)
First paragraph (first 3–5 sentences)
Keyword features: total count in text, presence in title, sentences containing the seed keyword, and presence of any keywords from the maintained list
Word count

LLM relevance assessment

Each article is evaluated by an LLM with a deterministic setting (temperature 0) and a constrained prompt that requires explicit, structured outputs:

Whether the article is cyber-event relevant

If relevant, a high-level event type is assigned:

Incident: A confirmed cyberattack or breach has already occurred (e.g., ransomware deployment, data exfiltration, DDoS, system compromise).
Vulnerability: Discovery or disclosure of a security flaw in software/hardware/systems that could be exploited (potential risk rather than confirmed exploitation).
Threat Intelligence: Reporting on threat actors, tools, TTPs, and campaigns—focuses on “who/how,” not a specific victim incident.
Regulatory‑Legal: Laws, regulations, enforcement actions, court decisions, or major policy changes that affect cybersecurity obligations.

Article type and categorization

Relevant articles are categorized via structured taxonomy prompts (primary: attack status, event type, regulatory/legal; secondary: impact metrics/class, technical specifics, sectors, geography, size, approximate damage).

Event clustering (article-to-event aggregation)

Objective: group articles that describe the same underlying incident into a single ‘event’.

Method:

Retrieve existing events from the database to provide context (titles, known organizations affected, threat actors, links).
For each candidate article (where Article Type = Single Incident), the LLM compares article details against batches of existing events and either:

1. Assigns an existing event ID when there’s a high-confidence match, or

2. Creates a new event otherwise.

Prompts emphasize high precision: only link to an existing event when highly confident. Organization(s) affected and threat actor signals are treated as strong indicators.

Events maintain aggregated fields: first/last seen dates, article count, organizations affected, threat actors, titles, links.

Accuracy and quality assurance

Determinism and constraints:

LLM temperature set to 0 to maximize determinism and reduce hallucinations.
Constrained prompts require explicit fields and JSON outputs; parsing enforces schema.
Non-content articles (missing title/text) are rejected early.

LLM-governed, schema-validated metrics:

All metric fields are produced by deterministic LLM runs (temperature 0) under strict, documented guidelines and JSON schemas; only schema‑compliant outputs are counted, with periodic human QA to calibrate and prevent drift.

Event/article classification for precision filtering:

Event-type and article-focus classification serves as a strict relevance gate, filtering out off-topic, low-signal, or roundup-style content. This focus on single-incident reporting reduces noise and measurably improves dataset precision and accuracy.

Multi-source validation:

Event clustering references previously stored event context; mismatches reduce the chance of incorrect merges.
Aggregations include the list of source links per event for manual verification.

Human-in-the-loop:

High-impact or ambiguous cases can be flagged for editorial review and fact-checking.
Regular QA reviews: sampled articles and events are audited on a monthly cadence, with precision review; any drift triggers prompt/model or keyword adjustments.

Traceability:

Every statistic can be traced to articles and links contained in the database for auditability.

Limitations

Coverage limits:

GCS-based discovery depends on keywords and CSE configuration; not all incidents are captured, especially outside configured languages or paywalled content.
Some sites block automated retrieval; such articles may be partially or fully missing.

LLM-specific risks:

Despite deterministic settings and structured prompts, misclassification can occur, particularly with sparse or ambiguous texts.
Event clustering may split the same incident into multiple events or merge similar but distinct incidents in edge cases.

How statistics are computed

Article-level fields are derived from direct extraction and LLM outputs (stored per record).

Event-level metrics aggregate constituent articles by event_id:

article counts, first/last seen dates
de-duplicated organizations affected and threat actors
representative titles and canonical link lists

Report statistics pull from these stored tables; each figure can be traced back to event rows and underlying article records.

Scope of the data

The statistics and insights referenced across our cybersecurity content are derived from a combination of:

Publicly available cybersecurity incident reporting
Media coverage of confirmed cyber incidents
Industry reports and surveys
Government and regulatory disclosures

The data reflects publicly observable and reported activity, not the full universe of all cyber incidents that occur globally. Many cyber events are never disclosed, reported, or covered by the media.

Data sources and discovery

Source types

Cybersecurity-related articles and reports are collected from multiple source categories, including:

Mainstream and technology media.
Examples include major international news organizations and technology publications.
Authoritative and expert cybersecurity sources.
Including government agencies, cybersecurity research organizations, and established industry publications.
Regional and local news outlets.
Covering cybersecurity incidents across North America, Europe, Asia-Pacific, and other regions.
Industry and research reports.
Including annual breach reports, threat landscape reports, surveys, and economic analyses.

Each source is attributed at the article or report level, with publication date, outlet, and original URL preserved.

Discovery process

Content discovery is performed using automated search queries based on a maintained cybersecurity keyword list. Keywords are grouped by topic (for example: data breaches, ransomware, phishing, vulnerabilities, regulation).

Searches are run on a daily basis to capture newly published content. Each run queries recent material only, ensuring the dataset reflects current reporting.

Content collection and processing

Article retrieval

Once a source is discovered, the full article text is retrieved using automated extraction tools. Where primary extraction fails, fallback methods are used to ensure robust coverage.

Deduplication

To avoid double counting:

Identical URLs are processed only once
Re-published or syndicated content is deduplicated at the article level
Event-level aggregation (described below) further reduces duplication across outlets

Relevance filtering and classification

Cybersecurity relevance assessment

Each article is evaluated to determine whether it is relevant to cybersecurity statistics. Articles must meaningfully describe or analyze a cybersecurity event, threat, vulnerability, or regulatory action.

Event type classification

Relevant articles are classified into high-level categories, including:

Incident – A confirmed cyberattack or breach that has already occurred
Vulnerability – Disclosure of a security weakness that could be exploited
Threat intelligence – Reporting on threat actors, tools, campaigns, or techniques
Regulatory / legal – Laws, enforcement actions, policy changes, or legal proceedings related to cybersecurity

This classification ensures that statistics referring to “incidents,” “breaches,” or “attacks” are not conflated with vulnerability disclosures or general commentary.

Event clustering (article-to-event aggregation)

Multiple articles often report on the same underlying cyber incident. To prevent overcounting:

Articles describing the same incident are grouped into a single event
Events are assigned stable internal identifiers
Articles are linked to existing events only when there is high confidence they describe the same occurrence

Indicators used for clustering include affected organizations, threat actors, timelines, and incident descriptions.

Event-level records maintain:

First and last appearance dates
Number of related articles
Affected organizations
Referenced threat actors
Source links for verification

Use of automated analysis and quality controls

Automated classification

Structured, deterministic language-model analysis is used for classification, extraction, and aggregation. All automated outputs follow predefined schemas to ensure consistency.

The models operate with deterministic settings to reduce variability and hallucination risk.

Quality assurance

To maintain accuracy:

Schema validation ensures only properly structured outputs are counted
Regular monthly sampling and review procedures, including a precision review, are conducted to detect classification drift. This review identifies shifts in classification, which then informs and necessitates adjustments to the model.
Ambiguous or high-impact cases are flagged for human review
Aggregated statistics retain traceability to individual articles and events

How statistics are calculated

Article-level vs event-level metrics

Some statistics are based on:

Article-level counts (e.g., volume of media coverage)
Event-level counts (e.g., number of distinct breaches or incidents)

Where applicable, event-level metrics are preferred to reduce duplication.

Interpretation of counts and frequencies

Statistics such as “incidents per day” or “breaches per year” represent reported or media-visible activity, not total global activity.

Vendor telemetry, government complaint systems, and economic projections often report significantly higher volumes due to differences in scope and methodology. These differences are noted where relevant.

Limitations and considerations

While care is taken to ensure accuracy and consistency, the data has inherent limitations:

Not all incidents are publicly disclosed or reported
Media coverage varies by region, sector, and incident scale
Some sources restrict access
Classification errors may occur in edge cases
Economic loss figures may change as investigations evolve

Statistics should therefore be interpreted as directional indicators, not exhaustive measurements.

Sources Index

Each numbered source below corresponds to a superscript reference used on the Cybersecurity Statistics page. Superscripts link directly to the relevant source entry on this page.

Source ¹ Statista – Cybercrime worldwide Link⁠‌
Source ² Identity Theft Resource Center (ITRC) – Weekly Breach Breakdown Q3 2025 Link⁠‌
Source ³ Identity Theft Resource Center (ITRC) – H1 2025 Data Breach Analysis Link⁠‌
Source ⁴ Verizon – Data Breach Investigations Report (DBIR) 2025 Link⁠‌
Source ⁵ IBM – Cost of a Data Breach Report 2025 Link⁠‌
Source ⁶ South Korean Ministry of Science and ICT – SK Telecom data exfiltration incident Link⁠‌
Source ⁷ Aflac – June 2025 security incident regulatory filing Link⁠‌
Source ⁸ HIPAA Journal – Largest healthcare data breaches of 2025 Link⁠‌
Source ⁹ California Attorney General – Aflac breach report (SB24-616010) Link⁠‌
Source ¹⁰ Iowa Attorney General – Aflac data breach notification Link⁠‌
Source ¹¹ Rhode Island Attorney General – Data‑breach notifications Link⁠‌
Source ¹² Rhode Island AG – Data‑breach notification Link⁠‌
Source ¹³ Aflac Newsroom – June 2025 security incident update Link⁠‌
Source ¹⁴ HIPAA Journal – Aflac data breach article Link⁠‌
Source ¹⁵ Office of the Australian Information Commissioner – Statement on Qantas cyber incident Link⁠‌
Source ¹⁶ Qantas – Information for customers on cyber incident Link⁠‌
Source ¹⁷ Qantas Newsroom – Update on Qantas cyber incident (9 July 2025) Link⁠‌
Source ¹⁸ Michigan Attorney General – Consumer alert on data breaches (TransUnion) Link⁠‌
Source ¹⁹ Maine Attorney General – Allianz Life cyber incident notice Link⁠‌
Source ²⁰ California Attorney General – Allianz data breach report (SB24-612078) Link⁠‌
Source ²¹ University of Maryland – Cyber Security Statistics Link⁠‌
Source ²² Microsoft Digital Defense Report 2023 Link⁠‌
Source ²³ WIRED – NotPetya cyberattack article Link⁠‌
Source ²⁴ Reuters – UnitedHealth tech unit hack article Link⁠‌
Source ²⁵ The Guardian – Jaguar Land Rover hack article Link⁠‌
Source ²⁶ NBC News – MGM Resorts cyberattack cost article Link⁠‌
Source ²⁷ Delaware Department of Technology & Information – eSecurityNews (Oct 2023) Link⁠‌
Source ²⁸ Cybersecurity Ventures – Global ransomware damage cost projection Link⁠‌
Source ²⁹ JumpCloud – Phishing attack statistics Link⁠‌
Source ³⁰ Hornetsecurity – Email threats in 2024 Link⁠‌
Source ³¹ Spearshield – Click‑to‑credential phishing study Link⁠‌
Source ³² APWG – Phishing Activity Trends Reports Link⁠‌
Source ³³ arXiv – Academic password/credential research (2025) Link⁠‌
Source ³⁴ DeepStrike – Password statistics 2025 Link⁠‌
Source ³⁵ NordPass – Top 200 Most Common Passwords Link⁠‌
Source ³⁶ Financial Times – Supply‑chain cybersecurity article Link⁠‌
Source ³⁷ SecurityScorecard – 2025 Supply Chain Cybersecurity Trends Link⁠‌
Source ³⁸ National Technology & Security Coalition – 2025 Software Supply Chain Security Report Link⁠‌
Source ³⁹ Palo Alto Networks – State of Cloud Native Security Link⁠‌
Source ⁴⁰ IBM – Threat Intelligence Report Link⁠‌
Source ⁴¹ Tenable – Cloud Security Risk Report 2025 Link⁠‌
Source ⁴² Cybersecurity Ventures – Cybersecurity Cost Report Link⁠‌
Source ⁴³ Statista Market Insights – Estimated cost of cybercrime worldwide 2018‑2029 (ResearchGate) Link⁠‌
Source ⁴⁴ Statista – Cost of cybercrime worldwide forecast Link⁠‌
Source ⁴⁵ FTC – Consumer Sentinel Network Data Book 2024 Link⁠‌
Source ⁴⁶ FBI IC3 – 2024 Internet Crime Report Link⁠‌
Source ⁴⁷ Kroll – Data Breach Outlook 2025 Link⁠‌
Source ⁴⁸ IBM – Cost of a Data Breach 2024: Financial Industry Link⁠‌
Source ⁴⁹ SailPoint – 2024 State of Identity Security in Financial Services Link⁠‌
Source ⁵⁰ DeepStrike – Healthcare data breach statistics 2025 Link⁠‌
Source ⁵¹ Proofpoint & Ponemon – Healthcare Cybersecurity Report Link⁠‌
Source ⁵² Check Point – Cyber Security Report 2025 Link⁠‌
Source ⁵³ Thales – 2024 Data Threat Report: Critical Infrastructure Edition Link⁠‌
Source ⁵⁴ Cyfirma – Energy & Utilities industry report Link⁠‌
Source ⁵⁵ World Economic Forum – Global Cybersecurity Outlook 2025 Link⁠‌
Source ⁵⁶ DeepStrike – Cyber attacks on small businesses Link⁠‌
Source ⁵⁷ Devolutions – State of IT Security Report 2025 Link⁠‌
Source ⁵⁸ TotalAssure – Small business cybersecurity statistics 2025 Link⁠‌
Source ⁵⁹ Cisco – Cybersecurity Readiness Index 2025 Link⁠‌
Source ⁶⁰ IANS Research – Security budgets press release (2024) Link⁠‌
Source ⁶¹ Munich Re – Cyber insurance risks and trends 2025 Link⁠‌
Source ⁶² Gartner – 2025 information security spending forecast Link⁠‌
Source ⁶³ Forrester – 2024 Cybersecurity Benchmarks (Global) Link⁠‌
Source ⁶⁴ Ivanti – State of Cybersecurity Report Link⁠‌
Source ⁶⁵ U.S. Department of Homeland Security – FY 2025 Budget in Brief Link⁠‌
Source ⁶⁶ U.S. Department of Defense – CYBERCOM Budget Justification Link⁠‌
Source ⁶⁷ Google Cloud – Cybersecurity forecast Link⁠‌
Source ⁶⁸ Gartner – Generative AI attack survey (Sep 22 2025) Link⁠‌
Source ⁶⁹ Splashtop – Top cybersecurity trends and predictions for 2026 Link⁠‌
Source ⁷⁰ ENISA – Threat Landscape 2024 Link⁠‌

Cybersecurity statistics: Methodology and sources

Purpose of this page

Scope of the data

Data sources and discovery

Source types

Discovery process

Content collection and processing

Article retrieval

Deduplication

Relevance filtering and classification

Cybersecurity relevance assessment

Event type classification

Event clustering (article-to-event aggregation)

Use of automated analysis and quality controls

Automated classification

Quality assurance

How statistics are calculated

Article-level vs event-level metrics

Interpretation of counts and frequencies

Limitations and considerations

Sources Index

Cybersecurity statistics:
Methodology and sources