Cybersecurity statistics:
Methodology and sources

Purpose of this page

This page explains how the cybersecurity statistics presented on our Cybersecurity Statistics page are collected, processed, and interpreted, and provides full transparency regarding the data sources referenced. The main Cybersecurity Statistics page presents summarized findings and NordVPN research insights.

Data sources and attribution

Source discovery is performed via Google Custom Search API (GCS), using multiple Custom Search Engines (CSEs) configured for:

  • media outlets: 44 mainstream and tech media sources (e.g., BBC, CNN, The New York Times, WSJ, FT, Reuters, Bloomberg, TechCrunch, Wired, Ars Technica, Time, Forbes).

  • authoritative/reference sites: 25 industry and expert sources (e.g., CISA, KrebsOnSecurity, The Hacker News, Dark Reading, BleepingComputer, SecurityWeek, Infosecurity Magazine).

  • local news: 100+ regional and national outlets across APAC, EMEA, and the Americas (e.g., Channel NewsAsia, CSA.gov.sg, Zaobao; HK01, unwire.hk; Japan Times, NISC, JPCERT, ITMedia).

  • unrestricted/general.

Queries are keyword-driven from a maintained keyword list that groups terms by category.

All records include explicit attribution:

  • Original article link

  • Media outlet (domain extracted from the URL)

  • Publication date and collection date

We synthesize information from many sources for statistics and event aggregation; each statistic is derived from article-level evidence stored with links.

Content retrieval and collection cadence

Fetches full-text content from discovered links with:

  • Primary: NewsPlease

  • Fallback: direct HTML download with hardened requests session and trafilatura extraction.

Timeouts, retries, TLS fallbacks, and referer headers are used to reduce transient failures.

Publication date and title are taken from the extractor when available; date parsing is normalized to date-only.

Daily runs query the last 1 day of content.

Feature extraction

Extracted fields include:

  • Media outlet (from URL)

  • First paragraph (first 3–5 sentences)

  • Keyword features: total count in text, presence in title, sentences containing the seed keyword, and presence of any keywords from the maintained list

  • Word count

LLM relevance assessment

Each article is evaluated by an LLM with a deterministic setting (temperature 0) and a constrained prompt that requires explicit, structured outputs:

Whether the article is cyber-event relevant

If relevant, a high-level event type is assigned:

  • Incident: A confirmed cyberattack or breach has already occurred (e.g., ransomware deployment, data exfiltration, DDoS, system compromise).

  • Vulnerability: Discovery or disclosure of a security flaw in software/hardware/systems that could be exploited (potential risk rather than confirmed exploitation).

  • Threat Intelligence: Reporting on threat actors, tools, TTPs, and campaigns—focuses on “who/how,” not a specific victim incident.

  • Regulatory‑Legal: Laws, regulations, enforcement actions, court decisions, or major policy changes that affect cybersecurity obligations.

Article type and categorization

Relevant articles are categorized via structured taxonomy prompts (primary: attack status, event type, regulatory/legal; secondary: impact metrics/class, technical specifics, sectors, geography, size, approximate damage).

Event clustering (article-to-event aggregation)

Objective: group articles that describe the same underlying incident into a single ‘event’.

Method:

  • Retrieve existing events from the database to provide context (titles, known organizations affected, threat actors, links).

  • For each candidate article (where Article Type = Single Incident), the LLM compares article details against batches of existing events and either:

1. Assigns an existing event ID when there’s a high-confidence match, or

2. Creates a new event otherwise.

  • Prompts emphasize high precision: only link to an existing event when highly confident. Organization(s) affected and threat actor signals are treated as strong indicators.

Events maintain aggregated fields: first/last seen dates, article count, organizations affected, threat actors, titles, links.

Accuracy and quality assurance

Determinism and constraints:

  • LLM temperature set to 0 to maximize determinism and reduce hallucinations.

  • Constrained prompts require explicit fields and JSON outputs; parsing enforces schema.

  • Non-content articles (missing title/text) are rejected early.

LLM-governed, schema-validated metrics:

  • All metric fields are produced by deterministic LLM runs (temperature 0) under strict, documented guidelines and JSON schemas; only schema‑compliant outputs are counted, with periodic human QA to calibrate and prevent drift.

Event/article classification for precision filtering:

  • Event-type and article-focus classification serves as a strict relevance gate, filtering out off-topic, low-signal, or roundup-style content. This focus on single-incident reporting reduces noise and measurably improves dataset precision and accuracy.

Multi-source validation:

  • Event clustering references previously stored event context; mismatches reduce the chance of incorrect merges.

  • Aggregations include the list of source links per event for manual verification.

Human-in-the-loop:

  • High-impact or ambiguous cases can be flagged for editorial review and fact-checking.

  • Regular QA reviews: sampled articles and events are audited on a monthly cadence, with precision review; any drift triggers prompt/model or keyword adjustments.

Traceability:

  • Every statistic can be traced to articles and links contained in the database for auditability.

Limitations

Coverage limits:

  • GCS-based discovery depends on keywords and CSE configuration; not all incidents are captured, especially outside configured languages or paywalled content.

  • Some sites block automated retrieval; such articles may be partially or fully missing.

LLM-specific risks:

  • Despite deterministic settings and structured prompts, misclassification can occur, particularly with sparse or ambiguous texts.

  • Event clustering may split the same incident into multiple events or merge similar but distinct incidents in edge cases.

How statistics are computed

Article-level fields are derived from direct extraction and LLM outputs (stored per record).

Event-level metrics aggregate constituent articles by event_id:

  • article counts, first/last seen dates

  • de-duplicated organizations affected and threat actors

  • representative titles and canonical link lists

Report statistics pull from these stored tables; each figure can be traced back to event rows and underlying article records.

Scope of the data

The statistics and insights referenced across our cybersecurity content are derived from a combination of:

  • Publicly available cybersecurity incident reporting

  • Media coverage of confirmed cyber incidents

  • Industry reports and surveys

  • Government and regulatory disclosures

The data reflects publicly observable and reported activity, not the full universe of all cyber incidents that occur globally. Many cyber events are never disclosed, reported, or covered by the media.

Data sources and discovery

Source types

Cybersecurity-related articles and reports are collected from multiple source categories, including:

  • Mainstream and technology media.
    Examples include major international news organizations and technology publications.

  • Authoritative and expert cybersecurity sources.
    Including government agencies, cybersecurity research organizations, and established industry publications.

  • Regional and local news outlets.
    Covering cybersecurity incidents across North America, Europe, Asia-Pacific, and other regions.

  • Industry and research reports.
    Including annual breach reports, threat landscape reports, surveys, and economic analyses.

Each source is attributed at the article or report level, with publication date, outlet, and original URL preserved.

Discovery process

Content discovery is performed using automated search queries based on a maintained cybersecurity keyword list. Keywords are grouped by topic (for example: data breaches, ransomware, phishing, vulnerabilities, regulation).

Searches are run on a daily basis to capture newly published content. Each run queries recent material only, ensuring the dataset reflects current reporting.

Content collection and processing

Article retrieval

Once a source is discovered, the full article text is retrieved using automated extraction tools. Where primary extraction fails, fallback methods are used to ensure robust coverage.

Deduplication

To avoid double counting:

  • Identical URLs are processed only once

  • Re-published or syndicated content is deduplicated at the article level

  • Event-level aggregation (described below) further reduces duplication across outlets

Relevance filtering and classification

Cybersecurity relevance assessment

Each article is evaluated to determine whether it is relevant to cybersecurity statistics. Articles must meaningfully describe or analyze a cybersecurity event, threat, vulnerability, or regulatory action.

Event type classification

Relevant articles are classified into high-level categories, including:

  • Incident – A confirmed cyberattack or breach that has already occurred

  • Vulnerability – Disclosure of a security weakness that could be exploited

  • Threat intelligence – Reporting on threat actors, tools, campaigns, or techniques

  • Regulatory / legal – Laws, enforcement actions, policy changes, or legal proceedings related to cybersecurity

This classification ensures that statistics referring to “incidents,” “breaches,” or “attacks” are not conflated with vulnerability disclosures or general commentary.

Event clustering (article-to-event aggregation)

Multiple articles often report on the same underlying cyber incident. To prevent overcounting:

  • Articles describing the same incident are grouped into a single event

  • Events are assigned stable internal identifiers

  • Articles are linked to existing events only when there is high confidence they describe the same occurrence

Indicators used for clustering include affected organizations, threat actors, timelines, and incident descriptions.

Event-level records maintain:

  • First and last appearance dates

  • Number of related articles

  • Affected organizations

  • Referenced threat actors

  • Source links for verification

Use of automated analysis and quality controls

Automated classification

Structured, deterministic language-model analysis is used for classification, extraction, and aggregation. All automated outputs follow predefined schemas to ensure consistency.

The models operate with deterministic settings to reduce variability and hallucination risk.

Quality assurance

To maintain accuracy:

  • Schema validation ensures only properly structured outputs are counted

  • Regular monthly sampling and review procedures, including a precision review, are conducted to detect classification drift. This review identifies shifts in classification, which then informs and necessitates adjustments to the model.

  • Ambiguous or high-impact cases are flagged for human review

  • Aggregated statistics retain traceability to individual articles and events

How statistics are calculated

Article-level vs event-level metrics

Some statistics are based on:

  • Article-level counts (e.g., volume of media coverage)

  • Event-level counts (e.g., number of distinct breaches or incidents)

Where applicable, event-level metrics are preferred to reduce duplication.

Interpretation of counts and frequencies

Statistics such as “incidents per day” or “breaches per year” represent reported or media-visible activity, not total global activity.

Vendor telemetry, government complaint systems, and economic projections often report significantly higher volumes due to differences in scope and methodology. These differences are noted where relevant.

Limitations and considerations

While care is taken to ensure accuracy and consistency, the data has inherent limitations:

  • Not all incidents are publicly disclosed or reported

  • Media coverage varies by region, sector, and incident scale

  • Some sources restrict access

  • Classification errors may occur in edge cases

  • Economic loss figures may change as investigations evolve

Statistics should therefore be interpreted as directional indicators, not exhaustive measurements.

Sources Index

Each numbered source below corresponds to a superscript reference used on the Cybersecurity Statistics page. Superscripts link directly to the relevant source entry on this page.

Source 1

Statista –
Cybercrime worldwide

Source 2

Identity Theft
Resource Center
(ITRC) – Weekly
Breach
Breakdown
Q3 2025

Source 3

Identity Theft
Resource Center
(ITRC) – H1 2025
Data Breach Analysis

Source 4

Verizon –
Data Breach
Investigations
Report (DBIR) 2025

Source 5

IBM –
Cost of a Data
Breach Report 2025

Source 6

South Korean
Ministry of
Science and
ICT – SK Telecom data
exfiltration
incident

Source 7

Aflac – June
2025 security
incident
regulatory filing

Source 8

HIPAA Journal –
Largest healthcare
data breaches of 2025

Source 9

California Attorney
General – Aflac
breach report
(SB24-616010)

Source 10

Iowa Attorney
General – Aflac
data breach notification

Source 11

Rhode Island
Attorney General –
Data‑breach
notifications

Source 12

Rhode Island
AG –
Data‑breach
notification

Source 13

Aflac Newsroom –
June 2025
security incident
update

Source 14

HIPAA Journal –
Aflac data
breach article

Source 15

Office of the
Australian Information
Commissioner –
Statement on Qantas
cyber incident

Source 16

Qantas – Information
for customers on
cyber incident

Source 17

Qantas Newsroom –
Update on Qantas
cyber incident
(9 July 2025)

Source 18

Michigan Attorney
General – Consumer
alert on data breaches
(TransUnion)

Source 19

Maine Attorney
General – Allianz Life
cyber incident notice

Source 20

California Attorney
General – Allianz data
breach report
(SB24-612078)

Source 21

University of
Maryland – Cyber
Security Statistics

Source 22

Microsoft Digital
Defense Report 2023

Source 23

WIRED – NotPetya
cyberattack article

Source 24

Reuters – UnitedHealth
tech unit hack article

Source 25

The Guardian – Jaguar
Land Rover hack article

Source 26

NBC News –
MGM Resorts
cyberattack cost article

Source 27

Delaware Department
of Technology &
Information –
eSecurityNews
(Oct 2023)

Source 28

Cybersecurity
Ventures – Global
ransomware damage
cost projection

Source 29

JumpCloud – Phishing
attack statistics

Source 30

Hornetsecurity – Email
threats in 2024

Source 31

Spearshield –
Click‑to‑credential
phishing study

Source 32

APWG – Phishing
Activity Trends Reports

Source 33

arXiv – Academic
password/credential
research (2025)

Source 34

DeepStrike – Password
statistics 2025

Source 35

NordPass – Top 200
Most Common
Passwords

Source 36

Financial Times –
Supply‑chain
cybersecurity article

Source 37

SecurityScorecard –
2025 Supply Chain
Cybersecurity Trends

Source 38

National Technology &
Security Coalition –
2025 Software Supply
Chain Security Report

Source 39

Palo Alto Networks –
State of Cloud
Native Security

Source 40

IBM – Threat
Intelligence Report

Source 41

Tenable –
Cloud Security
Risk Report 2025

Source 42

Cybersecurity
Ventures –
Cybersecurity Cost
Report

Source 43

Statista Market
Insights – Estimated
cost of cybercrime
worldwide 2018‑2029
(ResearchGate)

Source 44

Statista – Cost of
cybercrime worldwide
forecast

Source 45

FTC – Consumer
Sentinel Network Data
Book 2024

Source 46

FBI IC3 – 2024 Internet
Crime Report

Source 47

Kroll – Data Breach
Outlook 2025

Source 48

IBM – Cost of a Data
Breach 2024: Financial
Industry

Source 49

SailPoint – 2024
State of Identity
Security in Financial
Services

Source 50

DeepStrike –
Healthcare data
breach statistics 2025

Source 51

Proofpoint &
Ponemon – Healthcare
Cybersecurity Report

Source 52

Check Point –
Cyber Security
Report 2025

Source 53

Thales – 2024
Data Threat Report:
Critical Infrastructure
Edition

Source 54

Cyfirma – Energy &
Utilities industry report

Source 55

World Economic
Forum – Global
Cybersecurity Outlook
2025

Source 56

DeepStrike – Cyber
attacks on small
businesses

Source 57

Devolutions – State of
IT Security Report 2025

Source 58

TotalAssure –
Small business
cybersecurity statistics
2025

Source 59

Cisco – Cybersecurity
Readiness Index 2025

Source 60

IANS Research –
Security budgets
press release (2024)

Source 61

Munich Re –
Cyber insurance risks
and trends 2025

Source 62

Gartner – 2025
information security
spending forecast

Source 63

Forrester – 2024
Cybersecurity
Benchmarks (Global)

Source 64

Ivanti – State of
Cybersecurity Report

Source 65

U.S. Department of
Homeland Security –
FY 2025 Budget in Brief

Source 66

U.S. Department of
Defense – CYBERCOM
Budget Justification

Source 67

Google Cloud –
Cybersecurity forecast

Source 68

Gartner – Generative AI
attack survey
(Sep 22 2025)

Source 69

Splashtop – Top
cybersecurity trends
and predictions
for 2026

Source 70

ENISA – Threat
Landscape 2024