My research focuses on improving large language models (LLMs) through rigorous evaluation and efficient post-training. I actively develop automated methods to assess model capabilities, including efficient data curation technique for building challenge benchmarks. To complement this, I design unsupervised and explainable evaluation metrics. My work in this area led to SEScore1&2 and InstructScore, which was recognized as the best unsupervised metric at the WMT22 shared task. My background also includes developing efficient post-training techniques, from preference learning with BPO to knowledge distillation with Speculative KD.
- Developed and scaled an automated pipeline to generate dynamic challenge sets for evaluating Gemini models across both pre-training and post-training stages.
- Led the research efforts in automatic benchmark construction using LLMs, studying both self-bias in LLM-generated benchmark and efficient auto-challenge set construction.
- Led the research efforts in studying length bias of translation evaluation metric (On submission publication).
- Built a generic KD framework that generalizes to on-policy and supervised KD, achieves substantial gains in task specific and task agnostic knowledge distillation (now deployed in production at Google Translate).
- Developed an efficient, inference-time optimization technique that iteratively refines PaLM 2 outputs at the span level, which has been successfully deployed in production by the YouTube team.
- Developed a learned evaluation metric without human labels that achieved high correlation with human judgment and greatly improves the assessment of translation quality.
Efficient Post-training
Bridging the performance gap between student and teacher LLM (Post-training stage)
A generic KD framework that generalizes to on-policy and supervised KD, achieves substantial gains in task specific and task agnostic knowledge distillation.
Aligning LLM with AI or human feedback (Post-training stage)
We develop a novel self-supervised approach to perform expert layman text style transfer, relative improvement in overall success rate at text style transfer by 106%.
Using fine-grained autorater to refine large language model (Inference stage)
An efficient inference time optimization technique that iteratively refines the outputs of the PaLM 2 model at text span-level, achieving improvements of 1.7 MetricX on translation, 8.1 ROUGE-L on ASQA, and 2.2 ROUGE-L on topical summarization.
We demonstrate that LLM-generated benchmarks exhibit a systematic self-bias that inflates a model's own performance, and low source text diversity is one primary cause.
A fine-grained 7B LLM autorater (detect errors at span level), trained on synthetic data, surpasses unsupervised metrics, including those based on 175B GPT-3 and GPT-4.
SEScore is a reference-based text-generation evaluation metric that requires no pre-human-annotated data to train on. We develop a novel stratified error synthesis to synthesize diverse errors with varying severity levels at raw data.
A novel framework that uncovers and quantifies the influence of specific factors driving preferences, exposing the gap of model's understanding and generation capability.
Introduce a novel paradigm that leverages machine-generated images to guide open-ended text generation. This endows the machines with the ability of creative visualization that human writers often demonstrate.
Fun Fact: What is the meaning of Wenda(闻达)?
I add this because Starbucks people keep putting down "Wendy" or "Wanda"
The origin of word "Wenda(闻达)" was from a conversation between Confucius and his student. Here is the English translation:
Zi Zhang asked, "What makes a scholar truly accomplished ('Da' means 'accomplished', 达)?"
Confucius asked, "Define 'accomplished'?"
Zi Zhang replied, "To be renowned in the states of feudal lords, and to be renowned in the lands of ministers."
Confucius said, "This is more about fame ('Wen' means 'fame', 闻) than accomplishment. True accomplishment is about honesty, love for righteousness, understanding others, and modesty. Such a person will succeed anywhere. Those who seek fame may pretend to be virtuous, but their actions betray them, leading to hollow fame regardless of where they are."