Changmin Lee, Jaemin Kim, and Taesik Gong
ICML '26: Proceedings of the 43rd International Conference on Machine Learning
conda create -n epic python=3.10 -y
conda activate epic
pip install -r requirements.txtGPU with CUDA is recommended for embedding models and vLLM.
All preprocessing artifacts are written to the repository root (where you run the scripts).
- Download and extract Wikipedia (e.g. Wikimedia dump + WikiExtractor).
- Sample documents:
python preprocess/sample_documents.py \
--doc_type wiki \
--input_dir /path/to/filtered_wiki_json \
--sample_size 10000
# → sampled_wiki_doc_10000.jsonl- Chunk and embed:
python preprocess/build_chunks.py --doc_file sampled_wiki_doc_10000.jsonl
# → sampled_wiki_chunk_10000.jsonl
python preprocess/build_embeddings.py \
--model_name facebook/contriever \
--chunk_file sampled_wiki_chunk_10000.jsonl
# → sampled_wiki_embedding_facebook_contriever_10000.npy- Download ELI5 supporting documents.
- Sample:
python preprocess/sample_documents.py \
--doc_type eli5 \
--input_dir /path/to/eli5_repo \
--sample_size 2000
# → sampled_eli5_doc_2000.jsonl- Chunk and embed:
python preprocess/build_chunks.py --doc_file sampled_eli5_doc_2000.jsonl
python preprocess/build_embeddings.py \
--model_name facebook/contriever \
--chunk_file sampled_eli5_chunk_2000.jsonlcd dataset/creation/prefeval
python collect.py # downloads lmsys/lmsys-chat-1m
python preprocess.py # → lmsys_chat1m_conv_chunks_text.jsonl
cd ../../..Sample 2,000 chunks into the format expected by build_embeddings.py (fields: id, title, text), save as sampled_lmsys_doc_2000.jsonl in the repo root, then:
python preprocess/build_chunks.py --doc_file sampled_lmsys_doc_2000.jsonl
python preprocess/build_embeddings.py \
--model_name facebook/contriever \
--chunk_file sampled_lmsys_chunk_2000.jsonlEdit run_vllm_llama.sh and set your Hugging Face token (required for gated Llama weights):
export HUGGINGFACE_TOKEN=hf_xxxxxxxxxxxxxxxxThen launch:
bash run_vllm_llama.sh 0 8008
# Usage: bash run_vllm_llama.sh <GPU_IDS> <PORT>
# Example with 2 GPUs: bash run_vllm_llama.sh 0,1 8008Other backends (optional):
bash run_vllm_qwen.sh 0 8008 # Qwen3-4B-Instruct
bash run_vllm_oss.sh 0 8008 # gpt-oss-20bDefault settings: Contriever embeddings, Llama-3.1-8B-Instruct via vLLM on port 8008.
Single persona (PrefWiki, full pipeline):
python EPIC_main.py \
--method EPIC \
--persona_index 0 \
--device cuda:0 \
--mode all \
--output_dir output \
--dataset PrefWiki \
--emb_model_name facebook/contriever \
--doc_mode wiki \
--vllm_server_url 8008 \
--llm_model_name meta-llama/Llama-3.1-8B-InstructAll personas:
python EPIC_main.py \
--method EPIC \
--persona_index all \
--device cuda:0 \
--mode all \
--output_dir output \
--dataset PrefWiki \
--emb_model_name facebook/contriever \
--doc_mode wiki \
--vllm_server_url 8008 \
--llm_model_name meta-llama/Llama-3.1-8B-InstructOther datasets — change --dataset and --doc_mode:
# PrefELI5
python EPIC_main.py --method EPIC --persona_index all --mode all \
--output_dir output --dataset PrefELI5 --doc_mode eli5 \
--vllm_server_url 8008 --device cuda:0
# PrefRQ (wiki corpus)
python EPIC_main.py --method EPIC --persona_index all --mode all \
--output_dir output --dataset PrefRQ --doc_mode wiki \
--vllm_server_url 8008 --device cuda:0
# PrefEval (lmsys corpus)
python EPIC_main.py --method EPIC --persona_index all --mode all \
--output_dir output --dataset PrefEval --doc_mode lmsys \
--vllm_server_url 8008 --device cuda:0Run stages separately with --mode indexing, generation, or evaluation.
--dataset |
--doc_mode |
Personas (--persona_index all) |
|---|---|---|
PrefWiki |
wiki |
0–56 (57) |
PrefRQ |
wiki |
0–89 (90) |
PrefELI5 |
eli5 |
0–72 (73) |
PrefEval |
lmsys |
0–56 (57) |
Results are written under paths like:
output_prefwiki/wiki/EPIC/<persona_index>/
gen_EPIC_flat_<persona_index>.json
eval_EPIC_flat_<persona_index>.json
Persona-level FAISS indices are stored under data/indexing/<doc_mode>/EPIC_<dataset>/.
Completed steps are skipped automatically if output files already exist.
EPIC_main.py # Entry point
EPIC_indexing.py # Retrieval index construction
EPIC_generation.py # LLM generation with retrieved context
EPIC_evaluation.py # Preference-violation evaluation
EPIC_utils.py # Shared utilities
preprocess/ # Corpus sampling, chunking, embeddings
dataset/ # Benchmark task JSON files
prompt/ # Prompt templates
run_vllm_*.sh # vLLM launch scripts
assets/ # Figures

