CNVD Severity Classification and RMSV Effects: Honest Metrics & Data Leakage (www.vulnerability-lookup.org)
from cedric@lemmy.ml to security@lemmy.ml on 03 Apr 09:28
https://lemmy.ml/post/45401891

We recently made significant improvements to our CNVD severity classifier and the underlying Vulnerability-CNVD dataset, prompted by a thorough independent review from Eric Romang. These changes ship in VulnTrain v3.0.0, released today.

What happened

Eric opened VulnTrain#19 with a detailed technical analysis of the dataset and model. His key findings:

His full analysis, code, and data are available at eromang/researches/CNVD-Dataset-Validation.

What we fixed

Data leakage

We implemented a deduplicate_split function that groups entries by description text before splitting. All entries sharing a description land in the same split. The result: our retrained model scores 76.8% accuracy on the deduplicated test set, matching Eric’s independently measured unleaked accuracy of 76.6%. The model quality was always ~77% — we just have honest metrics now.

Class imbalance experiments

We tested four loss strategies to improve Low-class recall:

Strategy Low recall Medium recall Overall acc
Uniform (baseline) 41.0% 81.7% 76.8%
Sqrt-dampened weights 49.0% 74.8% 74.6%
Balanced weights 60.8% 70.2% 73.2%
Focal loss (gamma=2) 63.3% 64.4% 71.1%

Every strategy that improved Low recall caused disproportionate Medium recall loss. The Low/Medium vocabulary overlap in CNVD descriptions makes this a data-level ceiling, not a loss-function problem. Eric’s own experience with the CyberScale Phase 1 project — predicting 4-class CVSS bands from CVE descriptions using ModernBERT-base — reached the same conclusion: nothing moved the needle beyond ~2pp. Adjacent severity classes share vocabulary because vulnerability descriptions are formulaic.

We defaulted to uniform loss and documented the Low class limitation.

Dataset improvements

The Vulnerability-CNVD dataset now includes:

The RMSV effect

The RMSV regulations deserve attention. Before September 2021, CNVD published vulnerability details for most of the IDs it reserved. After the regulations took effect, publication rates dropped sharply. As a result, the CNVD dataset is increasingly sparse for recent years and the model’s training data is concentrated in pre-2022 entries. Users should be aware of this temporal bias.

CNVD reserves 50,000–100,000 vulnerability IDs per year but publishes full details for only a fraction. As noted above, the publication rate has declined significantly:

Model card

The model card is now dynamically generated from actual training metrics and documents the known limitations: Low-class recall, keyword dependency, negation blindness, and CVE overlap.

Links

Acknowledgments

Thanks to Eric Romang for his detailed and constructive analysis. His work directly led to these improvements and confirmed that the model adds real value (+12pp over a keyword heuristic baseline) despite its limitations.

Funding

<img alt="EU Funding" src="https://lemmy.ml/post/45401891">

AIPITCH aims to create advanced artificial intelligence-based tools supporting key operational services in cyber defense. These include technologies for early threat detection, automatic malware classification, and improvement of analytical processes through the integration of Large Language Models (LLM). The project has the potential to set new standards in the cybersecurity industry.

The project leader is NASK National Research Institute. The international consortium includes:

Funded by the European Union. Views and opinions expressed are however those of the author(s) only and do not necessarily reflect those of the European Union or the European Cybersecurity Competence Centre. Neither the European Union nor the European Cybersecurity Competence Centre can be held responsible for them.

#security

threaded - newest