An interpretable framework combining linguistic features, machine learning, and SHAP explanations for transparent AI text detection.
As Large Language Models (LLMs) become widely used in education and online communication, distinguishing AI-generated text from human writing has emerged as a critical challenge. While many detection systems report high benchmark accuracy, their real-world reliability remains uncertain, particularly in high-stakes settings such as academic assessment.
This paper investigates whether modern detectors truly identify machine authorship or instead learn dataset-specific artefacts. We introduce an interpretable detection framework that combines linguistic feature engineering, classical machine learning, and explainable AI. Across two major benchmark corpora (PAN-CLEF 2025 and COLING 2025), models trained on 38 linguistic features achieve leaderboard-competitive performance, reaching an F1 score of 0.9734 without relying on huge LLMs.
However, performance drops substantially under cross-domain and cross-generator evaluation. Using SHAP-based explanations, we show that the most influential features differ markedly between datasets, indicating that detectors often rely on corpus-specific stylistic cues rather than stable signals of machine authorship. Ensemble models partially improve robustness but do not eliminate this effect.
These findings show that benchmark accuracy is not reliable evidence of authorship detection: strong in-domain performance can coincide with substantial failure under domain and generator shift. We argue that explainability should be treated not only as a transparency aid, but as a validity diagnostic that reveals what evidence a detector is actually using. In educational and other high-stakes contexts, detectors should therefore be used only as probabilistic, explainable support for human judgment and never as an automated basis for punitive decisions. To support replication and practical use, we release an open-source Python package that returns both predictions and instance-level explanations for individual texts.
See how our explainable classifier analyzes real text samples, providing transparent insights into its predictions.
Install our Python package to analyze your own text with full SHAP explainability.
# Installation
pip install explain-ai-generated-text
# Usage
from xai import shap_explainer
text = "Your text to analyze goes here..."
result = shap_explainer(text)
# Get prediction
print(result['prediction']) # 'AI-generated' or 'Human-written'
# Explore feature explanations
for feature, data in result.items():
if feature != 'prediction':
print(f"{feature}: SHAP={data['SHAP_Value']:.3f}, Direction={data['Direction']}")