Wenda Xu

I am a forth year PhD student at UCSB NLP group , co-advised by Prof. William Wang and Prof. Lei Li. I am currently a visiting scholar at CMU Language Technologies Institute (Email: wendaxu@ucsb.edu, wendaxu@andrew.cmu.edu).

Research Interests

My major research interests lie in the area of text generation evaluation and large language model (LLM) alignment. In one sentence, I want to design methods to enable LLM learn to generate actionable feedback (both in the form of quality score or natural language diagnostic report) and use actionable feedback to align LLM with human principles.

Currently, I actively work on the text generation evaluation (both in quality and interpretability). I am the first author of SEScore1&2 and InstructScore (Best Unsupervised Text Generation metrics at WMT22 shared task). I am interested in the fine-grained feedback guided text generation and extend this generic pipeline into multilingual, multimodality content generation.

CV  /  Bio  /  Google Scholar  /  Twitter  /  Github

profile photo
First author Publications
Perils of Self-Feedback: Self-Bias Amplifies in Large Language Models
Wenda Xu, Guanglei Zhu, Xuandong Zhao, Liangming Pan, Lei Li, William Yang Wang
Preprint
project page / arXiv / code

Self-feedback display model's bias towards their own outputs. We find that self-bias is prevalent in all examined LLMs across multiple languages and tasks (6 LLMs, 4 languages, 3 tasks). To mitigate such biases, we discover that larger mode size and external feedback with accurate assessment can significantly reduce bias in the self-refine pipeline, leading to actual performance improvement in downstream tasks.

Pinpoint, Not Criticize: Refining Large Language Models via Fine-Grained Actionable Feedback
Wenda Xu, Daniel Deutsch, Mara Finkelstein, Juraj Juraska, Biao Zhang, Zhongtao Liu, William Yang Wang, Lei Li, Markus Freitag
NAACL 2024
project page / arXiv / code

Can we not criticize LLM but pinpoint errors it makes and automatically guide it with fine-grained actionable feedback? Can we formulate iterative refinement into a local search problem, simulated annealing? This is the work after my prior work InstructScore, where I really think about how to incorporate fine-grained actionable feedback to guide text generation. InstructScore not only offers quality judgements but actionable feedback to improve LLM!

INSTRUCTSCORE: Explainable Text Generation Evaluation with Finegrained Feedback
Wenda Xu, Danqing Wang, Liangming Pan, Zhenqiao Song, Markus Freitag, William Yang Wang, Lei Li
EMNLP Main 2023 (Oral)
project page / arXiv / code

InstructScore is an explainable text generation metric, which instead of outputing a scalar score, it outputs error location, error type and severity measures to candidate text. It achieves high correlation to human on four text generation tasks: Translation, table-to-text, captioning and commonsense generation and generalizes to unseen text generation task: keyword-to-text.

SEScore2: Retrieval Augmented Pretraining for Text Generation Evaluation
Wenda Xu, Xian Qian, Mingxuan Wang, Lei Li, William Yang Wang
ACL Main 2023
project page / arXiv / code

SESCORE2, is a SSL method to train a metric for general text generation tasks without human ratings. We develop a technique to synthesize candidate sentences with varying levels of mistakes for training. To make these self-constructed samples realistic, we introduce retrieval augmented synthesis on anchor text; It outperforms SEScore in four text generation tasks with three languages (The overall kendall correlation improves 14.3%).

Not All Errors are Equal: Learning Text Generation Metrics using Stratified Error Synthesis
Wenda Xu, Yi-lin Tuan, Yujie Lu, Michael Saxon, Lei Li, William Yang Wang
EMNLP2022, WMT22 shared metric task (Best unsupervised metric)
project page / arXiv / code / HuggingFace

SEScore is a reference-based text-generation evaluation metric that requires no pre-human-annotated error data. It develops a novel stratified error synthesis to synthesize diverse errors with varying severity levels. Its effectiveness over prior methods like BLEU, BERTScore, COMET and BLEURT has been demonstrated on various NLG tasks.

Self-Supervised Knowledge Assimilation for Expert-Layman Text Style Transfer
Wenda Xu, Michael Saxon, Misha Sra, William Yang Wang
AAAI2022
project page / arXiv / code

We develop a self-supervised approach to perform expert layman text style transfer. We propose a novel SSL task knowledge base assimilation to inject knowledge into pretraining. We achieve amazing performance in human evaluation.

Collaboration Publications
Automatically correcting large language models: Surveying the landscape of diverse self-correction strategies
Liangming Pan, Michael Saxon, Wenda Xu, Deepak Nathani, Xinyi Wang, William Yang Wang
TACL 2024
arXiv / code

A survey of self-correct strategies of LLM, including training-time, generation-time and inference-time approaches.

CausalDialogue: Modeling Utterance-level Causality in Conversations
Yi-Lin Tuan, Alon Albalak, Wenda Xu, Michael Saxon, Connor Pryor, Lise Getoor, William Yang Wang
ACL 2023
arXiv / code

New work explores the causal relationship encoded in branching dialogue graphs from RPG video game.

PECO: Examining Single Sentence Label Leakage in Natural Language Inference Datasets through Progressive Evaluation of Cluster Outliers
Michael Saxon, Xinyi Wang, Wenda Xu, William Yang Wang
EACL 2023
arXiv / code

We demonstrated automated detection of spurious, annotator-driven correlations that lead to cheating features in NLI.

Neuro-Symbolic Procedural Planning with Commonsense Prompting
Yujie Lu, Weixi Feng, Wanrong Zhu, Wenda Xu, Xin Eric Wang, Miguel Eckstein, William Yang Wang
ICLR 2023
arXiv / code

This work mitigates spurious correlations by using symbolic program executors on latent procedural representations



Visualize Before You Write: Imagination-Guided Open-Ended Text Generation
Wanrong Zhu, An Yan, Yujie Lu, Wenda Xu, Xin Eric Wang, Miguel Eckstein, William Yang Wang
EACL 2023
arXiv / code

Introduce a novel paradigm that leverages machine-generated images to guide open-ended text generation. This endows the machines with the ability of creative visualization that human writers often demonstrate.





BrainSec: Automated Brain Tissue Segmentation Pipeline for Scalable Neuropathological Analysis
Zhengfeng Lai, Luca Cerny Oliveira, Runlin Guo, Wenda Xu, Zin Hu, Kelsey Mifflin, Charles Decarli, Sen-Ching Cheung, Chen-Nee Chuah, Brittany N Dugger
IEEE Access
arXiv / code

Propose a patch-based approach, BrainSec, to classify the GM/WM/background regions. Integrated BrainSec with an Amyloid- β pathology classification model to identify pathologies distributions and quantify them in segmented GM/WM regions, respectively.




Fun Fact: What is the meaning of Wenda(闻达)?

I add this because Starbucks people keep putting down "Wendy" or "Wanda"

子张问:“士何如斯可谓之达矣?”
子曰:“何哉,尔所谓达者?”
子张对曰:“在邦必闻,在家必闻。”
子曰:“是闻也,非达也。夫达也者,质直而好义,察言而观色,虑以下人。在邦必达,在家必达。夫闻也者,色取仁而行违,居之不疑。在邦必闻,在家必闻。”

The origin of word "Wenda(闻达)" was from a conversation between Confucius and his student. Here is the English translation:

Zi Zhang asked, "What makes a scholar truly accomplished ('Da' means 'accomplished', 达)?"
Confucius asked, "Define 'accomplished'?"
Zi Zhang replied, "To be renowned in the states of feudal lords, and to be renowned in the lands of ministers."
Confucius said, "This is more about fame ('Wen' means 'fame', 闻) than accomplishment. True accomplishment is about honesty, love for righteousness, understanding others, and modesty. Such a person will succeed anywhere. Those who seek fame may pretend to be virtuous, but their actions betray them, leading to hollow fame regardless of where they are."