My major research interests lie in the area of large language model (LLM) evaluation and alignment (at pre-training, post-training and inference stages). In one sentence, I want to learn metrics that can assess LLM's generation quality and align LLM with well defined feedback.
I am the first author of SEScore1&2 and InstructScore (Best Unsupervised Text Generation metrics at WMT22 shared task). Currently, I am actively working on LLM post-training techniques, in both preference learning and knowledge distillation.
We use Interleaved sampling that utilizes on-policy student samples likely to be generated by the teacher, mitigating the issues of low-quality samples in on-policy KD and dynamically switching
between supervised and on-policy KD. Our experiments consistently show SKD's superiority in both task-specific and task-agnostic distillations. SKD is robust to various model families, initialization, and dataset sizes.
Aligning LLM with AI or human feedback (Post-training stage)
Online BPO: our 1) We collect data in the on-policy/online fashion 2) we respect behavior LLM when constructing trust region. 3) With only two phrase of online data collection, we can significantly improve offline DPO (TL;DR (72.0%->89.5%), Helpfulness (82.2%->93.5%), Harmfulness (77.5%->97.7%)).
Data augmentation to align LLM (Pre-training stage)
We develop a self-supervised approach to perform expert layman text style transfer. We propose a novel
SSL task knowledge base assimilation to inject knowledge into pretraining. We achieve amazing performance
in human evaluation, relative improving overall success rate by 106%!
Using fine-grained LLM agent to align large language model (Inference stage)
Can we not criticize LLM but pinpoint errors it makes and automatically guide it with fine-grained actionable feedback? Can we formulate iterative refinement into a local search problem, simulated annealing? This is the work after my prior work InstructScore, where I really think about how to incorporate fine-grained actionable feedback to guide text generation. Fine-grained LLM agent iteratively improves PALM2 for 1.7 MetricX points on translation tasks, 8.1 ROUGE-L on ASQA, 2.2 ROUGE-L on topical summarization.
Self-feedback display model's bias towards their own outputs. We find that self-bias is prevalent in all examined LLMs across multiple languages and tasks (6 LLMs, 4 languages, 3 tasks). To mitigate such biases, we discover that larger mode size and external feedback with accurate assessment can significantly reduce bias in the self-refine pipeline, leading to actual performance improvement in downstream tasks. First define and quantify
LLM’s self-bias towards its own outputs.
Learning explainable and fine-grained quality feedback
InstructScore is an explainable text generation metric, which instead of outputing a scalar score, it outputs error location, error type and severity measures to candidate text. Fine-grained 7B LLM evaluator surpasses all other unsupervised metrics,
including those based on 175B GPT-3 and GPT-4.
Learning to evaluate the quality of generated text without labels
SESCORE2, is a SSL method to train a metric for general text generation tasks without human ratings. We develop a technique to synthesize candidate sentences with varying levels of mistakes for training. To make these self-constructed samples realistic, we introduce retrieval augmented synthesis on anchor text;
It outperforms SEScore in four text generation tasks with three languages (The overall kendall correlation improves 14.3%).
SEScore is a reference-based text-generation evaluation metric that requires no pre-human-annotated data to train on. We develop a novel stratified error synthesis to synthesize diverse errors with varying severity levels at raw data. Then, we learn a neural metric to automatically give ratings to model generated texts. Its effectiveness over prior methods like BLEU, BERTScore, COMET and BLEURT has been demonstrated on various NLG tasks.
There has been a longstanding belief that current metrics yield unrealistically high latency measurements in unsegmented streaming settings. In this paper, we investigate this phenomenon, revealing its root cause in a fundamental misconception underlying existing latency evaluation approaches.
We introduce PROFILE (PRObing Factors of InfLuence for Explainability), a novel framework that uncovers and quantifies the influence of specific factors driving preferences.
PROFILE's factor level analysis explains the ``why'' behind human-model alignment and misalignment, offering insights into the direction of model improvement.
We build an interactive transaltion canvas to visualize InstructScore, highlighting
error spans with explanations and selectively displaying systems' predictions. Translation Canvas assists machine trans
lation researchers in comprehending system-level model performance by identifying common errors (their frequency and severity).
Introduce a novel paradigm that leverages machine-generated images to guide open-ended text generation. This endows the machines with the ability of creative visualization that human writers often demonstrate.
Fun Fact: What is the meaning of Wenda(闻达)?
I add this because Starbucks people keep putting down "Wendy" or "Wanda"
The origin of word "Wenda(闻达)" was from a conversation between Confucius and his student. Here is the English translation:
Zi Zhang asked, "What makes a scholar truly accomplished ('Da' means 'accomplished', 达)?"
Confucius asked, "Define 'accomplished'?"
Zi Zhang replied, "To be renowned in the states of feudal lords, and to be renowned in the lands of ministers."
Confucius said, "This is more about fame ('Wen' means 'fame', 闻) than accomplishment. True accomplishment is about honesty, love for righteousness, understanding others, and modesty. Such a person will succeed anywhere. Those who seek fame may pretend to be virtuous, but their actions betray them, leading to hollow fame regardless of where they are."