LKML-Mining Dataset:
Bridging the Knowledge Gap in Linux Kernel Maintenance
Abstract
The Linux Kernel Mailing List (LKML) represents one of the largest and most complex open-source collaboration environments. Existing datasets often focus solely on committed code, ignoring the rich iterative process of code review. LKML-Mining Dataset reconstructs the full lifecycle of over 440,000 patch series from 2015 to 2025. By linking initial submissions ($v_1$) to their subsequent revisions ($v_2 \dots v_N$) and final outcomes, this dataset enables researchers to analyze maintainer interactions, quantify the "cost" of security fixes, and develop automated tools to bridge the knowledge gap in kernel maintenance.
Dataset Overview
Data Structure
The dataset is decoupled into "Series" (Graph) and "Events" (Metadata). Here is a simplified view of a Patch Series entry:
{
"series_id": "lkml_2024_9_9_62",
"subject": "vhost: Add support of kthread API",
"topic_type": "PATCH",
"variants": {
"v1": {
"event_id": "lkml_2024_9_9_62",
"message_count": 11
},
"v2": {
"event_id": "lkml_2024_10_4_68",
"message_count": 30
}
},
"connections": [
{ "from": "v1", "to": "v2" }
]
}
Citation
If you find this dataset useful for your research, please cite:
@inproceedings{bai2026lkmlmining,
title={LKML-Mining Dataset: Bridging the Knowledge Gap in Linux Kernel Maintenance},
author={Bai, Luyao and [Co-authors]},
booktitle={Proceedings of [Conference Name]},
year={2026},
note={Under Review}
}