Wenda Xu

I am a final year PhD student at UCSB NLP group , co-advised by Prof. William Wang and Prof. Lei Li. I am currently a visiting scholar at CMU Language Technologies Institute (Email: wendaxu@ucsb.edu, wendaxu@andrew.cmu.edu).

Research Interests

My major research interests lie in the area of large language model (LLM) evaluation and alignment (at pre-training, post-training and inference stages). In one sentence, I want to learn metrics that can assess LLM's generation quality and align LLM with well defined feedback.

I am the first author of SEScore1&2 and InstructScore (Best Unsupervised Text Generation metrics at WMT22 shared task). Currently, I am actively working on LLM post-training techniques, in both preference learning and knowledge distillation.

CV  /  Linkedin  /  Google Scholar  /  Twitter  /  Github

profile photo

Looking for full time industry positions!

Industry Experience

Google Cloud AI Research

Duration: 06/2024 - 10/2024
Mentors: Chen-Yu Lee, Rishabh Agarwal
Hosts: Rujun Han, Zifeng Wang
Publication: Speculative Knowledge Distillation: Bridging the Teacher-Student Gap Through Interleaved Sampling

Google Translate Research

Duration: 06/2023 - 12/2023
Mentors: Markus Freitag
Hosts: Dan Deutsch, Mara Finkelstein, Juraj Juraska
Publication: LLMRefine: Pinpointing and Refining Large Language Models via Fine-Grained Actionable Feedback

Tiktok AI Lab

Duration: 06/2022 - 10/2022
Mentors: Mingxuan Wang
Hosts: Xian Qian
Publication: SESCORE2: Learning Text Generation Evaluation via Synthesizing Realistic Mistakes

Large Language Model Alignment

Aligning student LLM to teacher LLM (Post-training stage)
Speculative Knowledge Distillation: Bridging the Teacher-Student Gap Through Interleaved Sampling
Wenda Xu, Rujun Han, Zifeng Wang, Long Le, Dhruv Madeka, Lei Li, William Yang Wang, Rishabh Agarwal, Chen-Yu Lee, Tomas Pfister
On submission
project page / arXiv / code

We use Interleaved sampling that utilizes on-policy student samples likely to be generated by the teacher, mitigating the issues of low-quality samples in on-policy KD and dynamically switching between supervised and on-policy KD. Our experiments consistently show SKD's superiority in both task-specific and task-agnostic distillations. SKD is robust to various model families, initialization, and dataset sizes.

Aligning LLM with AI or human feedback (Post-training stage)
BPO: Supercharging Online Preference Learning by Adhering to the Proximity of Behavior LLM
Wenda Xu*, Jiachen Li*, William Yang Wang, Lei Li
*Two authors contributed equally
EMNLP Main 2024
project page / arXiv / code

Online BPO: our 1) We collect data in the on-policy/online fashion 2) we respect behavior LLM when constructing trust region. 3) With only two phrase of online data collection, we can significantly improve offline DPO (TL;DR (72.0%->89.5%), Helpfulness (82.2%->93.5%), Harmfulness (77.5%->97.7%)).

Data augmentation to align LLM (Pre-training stage)
Self-Supervised Knowledge Assimilation for Expert-Layman Text Style Transfer
Wenda Xu, Michael Saxon, Misha Sra, William Yang Wang
AAAI2022
project page / arXiv / code

We develop a self-supervised approach to perform expert layman text style transfer. We propose a novel SSL task knowledge base assimilation to inject knowledge into pretraining. We achieve amazing performance in human evaluation, relative improving overall success rate by 106%!

Using fine-grained LLM agent to align large language model (Inference stage)
LLMRefine: Pinpointing and Refining Large Language Models via Fine-Grained Actionable Feedback
Wenda Xu, Daniel Deutsch, Mara Finkelstein, Juraj Juraska, Biao Zhang, Zhongtao Liu, William Yang Wang, Lei Li, Markus Freitag
NAACL 2024
project page / arXiv / code

Can we not criticize LLM but pinpoint errors it makes and automatically guide it with fine-grained actionable feedback? Can we formulate iterative refinement into a local search problem, simulated annealing? This is the work after my prior work InstructScore, where I really think about how to incorporate fine-grained actionable feedback to guide text generation. Fine-grained LLM agent iteratively improves PALM2 for 1.7 MetricX points on translation tasks, 8.1 ROUGE-L on ASQA, 2.2 ROUGE-L on topical summarization.

Large Language Model Evaluation

Pitfalls of LLM’s self-quality assessment
Pride and Prejudice: LLM Amplifies Self-Bias in Self-Refinement
Wenda Xu, Guanglei Zhu, Xuandong Zhao, Liangming Pan, Lei Li, William Yang Wang
ACL 2024 Main (Oral)
project page / arXiv / code

Self-feedback display model's bias towards their own outputs. We find that self-bias is prevalent in all examined LLMs across multiple languages and tasks (6 LLMs, 4 languages, 3 tasks). To mitigate such biases, we discover that larger mode size and external feedback with accurate assessment can significantly reduce bias in the self-refine pipeline, leading to actual performance improvement in downstream tasks. First define and quantify LLM’s self-bias towards its own outputs.

Learning explainable and fine-grained quality feedback
INSTRUCTSCORE: Explainable Text Generation Evaluation with Finegrained Feedback
Wenda Xu, Danqing Wang, Liangming Pan, Zhenqiao Song, Markus Freitag, William Yang Wang, Lei Li
EMNLP Main 2023 (Oral)
project page / arXiv / code

InstructScore is an explainable text generation metric, which instead of outputing a scalar score, it outputs error location, error type and severity measures to candidate text. Fine-grained 7B LLM evaluator surpasses all other unsupervised metrics, including those based on 175B GPT-3 and GPT-4.

Learning to evaluate the quality of generated text without labels
SESCORE2: Learning Text Generation Evaluation via Synthesizing Realistic Mistakes
Wenda Xu, Xian Qian, Mingxuan Wang, Lei Li, William Yang Wang
ACL Main 2023
project page / arXiv / code

SESCORE2, is a SSL method to train a metric for general text generation tasks without human ratings. We develop a technique to synthesize candidate sentences with varying levels of mistakes for training. To make these self-constructed samples realistic, we introduce retrieval augmented synthesis on anchor text; It outperforms SEScore in four text generation tasks with three languages (The overall kendall correlation improves 14.3%).

Not All Errors are Equal: Learning Text Generation Metrics using Stratified Error Synthesis
Wenda Xu, Yi-lin Tuan, Yujie Lu, Michael Saxon, Lei Li, William Yang Wang
EMNLP2022 and also appeared at WMT22 shared metric task (Best unsupervised metric)
project page / arXiv / code / HuggingFace

SEScore is a reference-based text-generation evaluation metric that requires no pre-human-annotated data to train on. We develop a novel stratified error synthesis to synthesize diverse errors with varying severity levels at raw data. Then, we learn a neural metric to automatically give ratings to model generated texts. Its effectiveness over prior methods like BLEU, BERTScore, COMET and BLEURT has been demonstrated on various NLG tasks.

Selected Collaboration Publications

CA*: Addressing Evaluation Pitfalls in Computation-Aware Latency for Simultaneous Speech Translation
Xi Xu, Wenda Xu, Siqi Ouyang, Lei Li
Preprint
arXiv / code

There has been a longstanding belief that current metrics yield unrealistically high latency measurements in unsegmented streaming settings. In this paper, we investigate this phenomenon, revealing its root cause in a fundamental misconception underlying existing latency evaluation approaches.

Uncovering Factor Level Preferences to Improve Human-Model Alignment
Juhyun Oh, Eunsu Kim, Jiseon Kim, Wenda Xu, Inha Cha, William Yang Wang, Alice Oh
Preprint
arXiv / code

We introduce PROFILE (PRObing Factors of InfLuence for Explainability), a novel framework that uncovers and quantifies the influence of specific factors driving preferences. PROFILE's factor level analysis explains the ``why'' behind human-model alignment and misalignment, offering insights into the direction of model improvement.

Translation Canvas: An Explainable Interface to Pinpoint and Analyze Translation Systems
Chinmay Dandekar, Wenda Xu, Xi Xu, Siqi Ouyang, Lei Li
EMNLP 2024 Demo
arXiv / code

We build an interactive transaltion canvas to visualize InstructScore, highlighting error spans with explanations and selectively displaying systems' predictions. Translation Canvas assists machine trans lation researchers in comprehending system-level model performance by identifying common errors (their frequency and severity).

Automatically correcting large language models: Surveying the landscape of diverse self-correction strategies
Liangming Pan, Michael Saxon, Wenda Xu, Deepak Nathani, Xinyi Wang, William Yang Wang
TACL 2024
arXiv / code

A survey of self-correct strategies of LLM, including training-time, generation-time and inference-time approaches.

CausalDialogue: Modeling Utterance-level Causality in Conversations
Yi-Lin Tuan, Alon Albalak, Wenda Xu, Michael Saxon, Connor Pryor, Lise Getoor, William Yang Wang
ACL 2023
arXiv / code

New work explores the causal relationship encoded in branching dialogue graphs from RPG video game.

PECO: Examining Single Sentence Label Leakage in Natural Language Inference Datasets through Progressive Evaluation of Cluster Outliers
Michael Saxon, Xinyi Wang, Wenda Xu, William Yang Wang
EACL 2023
arXiv / code

We demonstrated automated detection of spurious, annotator-driven correlations that lead to cheating features in NLI.

Neuro-Symbolic Procedural Planning with Commonsense Prompting
Yujie Lu, Weixi Feng, Wanrong Zhu, Wenda Xu, Xin Eric Wang, Miguel Eckstein, William Yang Wang
ICLR 2023
arXiv / code

This work mitigates spurious correlations by using symbolic program executors on latent procedural representations



Visualize Before You Write: Imagination-Guided Open-Ended Text Generation
Wanrong Zhu, An Yan, Yujie Lu, Wenda Xu, Xin Eric Wang, Miguel Eckstein, William Yang Wang
EACL 2023
arXiv / code

Introduce a novel paradigm that leverages machine-generated images to guide open-ended text generation. This endows the machines with the ability of creative visualization that human writers often demonstrate.





Fun Fact: What is the meaning of Wenda(闻达)?

I add this because Starbucks people keep putting down "Wendy" or "Wanda"

子张问:“士何如斯可谓之达矣?”
子曰:“何哉,尔所谓达者?”
子张对曰:“在邦必闻,在家必闻。”
子曰:“是闻也,非达也。夫达也者,质直而好义,察言而观色,虑以下人。在邦必达,在家必达。夫闻也者,色取仁而行违,居之不疑。在邦必闻,在家必闻。”

The origin of word "Wenda(闻达)" was from a conversation between Confucius and his student. Here is the English translation:

Zi Zhang asked, "What makes a scholar truly accomplished ('Da' means 'accomplished', 达)?"
Confucius asked, "Define 'accomplished'?"
Zi Zhang replied, "To be renowned in the states of feudal lords, and to be renowned in the lands of ministers."
Confucius said, "This is more about fame ('Wen' means 'fame', 闻) than accomplishment. True accomplishment is about honesty, love for righteousness, understanding others, and modesty. Such a person will succeed anywhere. Those who seek fame may pretend to be virtuous, but their actions betray them, leading to hollow fame regardless of where they are."