LKML-Mining Dataset: Bridging the Knowledge Gap in Linux Kernel Maintenance

Luyao Bai1, [Co-author Name]2, [Co-author Name]1
1University of Illinois Chicago (UIC)    2[Other Institution]

Abstract

The Linux Kernel Mailing List (LKML) represents one of the largest and most complex open-source collaboration environments. Existing datasets often focus solely on committed code, ignoring the rich iterative process of code review. LKML-Mining Dataset reconstructs the full lifecycle of over 440,000 patch series from 2015 to 2025. By linking initial submissions ($v_1$) to their subsequent revisions ($v_2 \dots v_N$) and final outcomes, this dataset enables researchers to analyze maintainer interactions, quantify the "cost" of security fixes, and develop automated tools to bridge the knowledge gap in kernel maintenance.

Dataset Overview

440k+
Patch Series
10
Years History
200GB+
Raw Data
JSON
Structured

Data Structure

The dataset is decoupled into "Series" (Graph) and "Events" (Metadata). Here is a simplified view of a Patch Series entry:

{
  "series_id": "lkml_2024_9_9_62",
  "subject": "vhost: Add support of kthread API",
  "topic_type": "PATCH",
  "variants": {
    "v1": {
      "event_id": "lkml_2024_9_9_62",
      "message_count": 11
    },
    "v2": {
      "event_id": "lkml_2024_10_4_68",
      "message_count": 30
    }
  },
  "connections": [
    { "from": "v1", "to": "v2" }
  ]
}

View full schema on Hugging Face →

Citation

If you find this dataset useful for your research, please cite:

@inproceedings{bai2026lkmlmining,
  title={LKML-Mining Dataset: Bridging the Knowledge Gap in Linux Kernel Maintenance},
  author={Bai, Luyao and [Co-authors]},
  booktitle={Proceedings of [Conference Name]},
  year={2026},
  note={Under Review}
}