- 📜 Abstract
- 📂 File Structure
- 💾 Datasets
- 🤖 Models & Checkpoints
- 🚀 Training
- âš¡ Inference & Evaluation
- 📄 Citation
Recent advancements have expanded the role of Large Language Models in board games from playing agents to creative co-designers. However, a critical gap remains: current systems lack the capacity to offer constructive critique grounded in the emergent user experience. Bridging this gap is fundamental for harmonizing Human-AI collaboration, as it empowers designers to refine their creations via external perspectives while steering models away from biased or unpredictable outcomes. Automating critique for board games presents two challenges: inferring the latent dynamics connecting rules to gameplay without an explicit engine, and modeling the subjective heterogeneity of diverse player groups. To address these, we curate a dataset of 1,727 structurally corrected rulebooks and 150K reviews selected via quality scoring and facet-aware sampling. We augment this data with Mechanics-Dynamics-Aesthetics (MDA) reasoning to explicitly bridge the causal gap between written rules and player experience. We further distill player personas and introduce MeepleLM, a specialized model that internalizes persona-specific reasoning patterns to accurately simulate the subjective feedback of diverse player archetypes. Experiments demonstrate that MeepleLM significantly outperforms latest commercial models (e.g., GPT-5.1, Gemini3-Pro) in community alignment and critique quality, achieving a 70% preference rate in user studies assessing utility. MeepleLM serves as a reliable virtual playtester for general interactive systems, marking a pivotal step towards audience-aligned, experience-aware Human-AI collaboration.
.
├── assets/ # Project images and figures
├── data/
│ ├── metadata/ # Meta-info (Game IDs, names, BGG stats, splits)
│ ├── finetuning/ # Alpaca-formatted datasets
│ ├── reviews/ # Filtered review data
│ └── rulebooks/ # Structured Markdown rulebooks
├── checkpoints/ # LoRA adapters for MeepleLM & Ablations
├── training/ # YAML configurations for LLaMA-Factory
├── inference/ # Inference scripts (vLLM example)
└── results/ # Generated critiques
We provide the complete pipeline data, from raw sources to instruction-tuning ready files.
-
data/metadata/:game_info.json: Mappings of Game ID to metadata (Name, Rank, Weight, Year).test_games_list.json: The official evaluation split (207 games) used in the paper.
-
data/finetuning/: Ready-to-use Alpaca format datasets for SFT. Each folder contains_train.jsonand_test.json.MeepleLM/: Full dataset with MDA CoT reasoning chains.wo_MDA/: Ablation without reasoning chains.wo_Persona/: Ablation without persona profiles.wo_Rulebook/: Ablation without rule context.
-
data/rulebooks/: The corpus of 1,727 processed rulebooks in Markdown format. -
data/reviews/: The filtered high-quality review corpus used to construct the training data.
We provide LoRA adapters trained on Qwen3-8B. These can be loaded easily using vLLM.
| Model Variant | Description | Path |
|---|---|---|
| MeepleLM (Ours) | Full model with Persona-conditioning and MDA reasoning. | ./checkpoints/MeepleLM/ |
| w/o MDA | Ablation removing Chain-of-Thought reasoning. | ./checkpoints/wo_MDA/ |
| w/o Persona | Ablation using a generic player prompt. | ./checkpoints/wo_Persona/ |
| w/o Rulebook | Ablation relying solely on internal knowledge. | ./checkpoints/wo_Rulebook/ |
You can serve the model with the LoRA adapter enabled. For example, to serve MeepleLM:
vllm serve Qwen/Qwen3-8B \
--enable-lora \
--lora-modules MeepleLM=checkpoints/MeepleLM \
--served-model-name MeepleLM \
--port 8000
All models were trained using the LLaMA-Factory framework. We provide the exact YAML configurations used for our experiments in the training/ directory.
To reproduce the training process:
- Install LLaMA-Factory: Please refer to the official repository] for installation instructions.
- Register Datasets:
Add the paths from
data/finetuning/to LLaMA-Factory'sdata/dataset_info.json. - Run Training:
llamafactory-cli train training/train_meeplelm.yaml
(Note: Config files for ablation studies are also provided in the training/ folder.)
The inference/ directory contains scripts to generate virtual playtest results.
playtest_inference.py: A sample script designed to work with the MeepleLM checkpoint served via vLLM. It iterates through the test set games, applying the Persona constraints to generate reviews.results/: Stores the output JSON files generated by the model (e.g.,results/inference_meeplelm/).
Note: The provided inference script is configured for the MeepleLM LoRA adapter and local vLLM server. If you wish to evaluate other models or use different API endpoints, please modify the
API_URLandMODEL_NAMEparameters in the script accordingly.
If you use MeepleLM, the rulebook dataset, or the persona taxonomy in your research, please cite our paper:
@article{li2026meeplelm,
title={MeepleLM: A Virtual Playtester Simulating Diverse Subjective Experiences},
author={Li, Zizhen and Li, Chuanhao and Wang, Yibin and Feng, Yukang and Sun, Jianwen and Ai, Jiaxin and Zhang, Fanrui and Sun, Mingzhu and Huang, Yifei and Zhang, Kaipeng},
journal={arXiv preprint arXiv:2601.07251},
year={2026}
}