Research
I work on making large language models more reliable—particularly for code generation. Today's models are impressively fluent, but fluency isn't correctness. Code that looks right but breaks under execution, misses an edge case, or violates a security invariant is worse than no code at all. I believe the root cause is how these models are trained: if the supervision signal doesn't encode what "correct" actually means, the model won't learn to produce correct outputs.
My research addresses this by designing structured training signals grounded in program semantics. Over the past decade, this has taken different forms:
- Enforcing syntactic correctness through tree-based structured generation (CODIT)
- Learning code representations through large-scale pretraining (PLBART, NatGen)
- Fine-tuning with formal verification for provable functional correctness (Proof-Oriented Programming)
- Building execution-driven systems (DeepTest, DeepProof) that turn test failures and verification outcomes into training signal
- Developing localized preference optimization for secure code generation (ACL'25)
The common thread: correctness is multi-dimensional (syntax, semantics, invariants, robustness), and each dimension needs its own feedback. I build systems that compose these signals—tests, static analysis, verification, execution feedback—into training objectives that teach models what "right" means from multiple angles.
Publications
Full list on Google Scholar