Large language models can now generate code that looks right. But looking right and being right are very different things. Code that passes a cursory glance often breaks under execution, ignores edge cases, or violates assumptions that experienced engineers take for granted. In programming, correctness isn't a matter of opinion—it's grounded in execution semantics and observable behavior. The gap between what models produce and what they should produce is where my research lives.
I believe this gap is, at its core, a training problem. Models reflect the signals they're optimized for. When supervision is shallow—matching token patterns, satisfying surface-level similarity metrics—models will learn to produce outputs that satisfy those criteria without deeply understanding the task. To make models reliably correct, we need to rethink what we train them on, how we provide feedback, and how that feedback connects back to the actual semantics of the task.
How I Got Here
My path to this research question developed over a decade of working on code generation and understanding. Early in my career, I worked on CODIT, a tree-based code editing model. The core insight was simple but important: if you want models to produce syntactically valid code, you should build that structure into the generation process rather than hoping the model figures it out. That work taught me that architectural choices about how models produce outputs directly shape what they can reliably produce.
As the field moved toward large-scale pretraining, I contributed to PLBART and NatGen—models that learned code representations from large corpora. These were effective for many tasks, but they also revealed a ceiling: pretraining gives models fluency and reasonable generalization, but it doesn't give them correctness guarantees. A model pretrained on millions of programs can produce plausible code without understanding invariants, preconditions, or the semantic implications of its choices.
That realization pushed me toward formal methods. In my work on Proof-Oriented Programming, I studied what happens when models operate in languages with formal semantics—where every function comes with a specification, and correctness means something precise and verifiable. This work convinced me that correctness isn't one thing. It's a stack: syntax at the bottom, then semantics, then invariants, then robustness. Each layer requires its own kind of feedback, and you can't collapse them all into a single number.
Connecting Models to Reality
My recent work at Microsoft Research focuses on closing the loop between what models generate and what actually works. DeepTest is an agent-driven testing system that takes model-generated code and subjects it to rigorous evaluation—running it, measuring coverage, checking for vulnerabilities. It works at scale across Windows components and Azure services, and the testing pipeline itself is powered by symbolic program analysis combined with LLM capabilities. DeepProofgoes further by integrating theorem proving and formal verification into the training process—models don't just generate proofs, they learn from verification failures.
The key idea in both systems is that evaluation signals can become training signals. When a test fails or a proof doesn't check, that's not just a measurement—it's information that should flow back into the model. This is conceptually different from the standard train-then-evaluate paradigm. Instead, the evaluation infrastructure becomes part of the training infrastructure, creating a continuous loop where the model improves by confronting the consequences of its own outputs.
My recent work on preference optimization(ACL'25) explores this from a different angle. Standard RLHF-style methods use coarse reward signals—a whole response is good or bad. But for code, we can do better. We can localize preferences to specific lines, specific security patterns, specific behavioral properties. By distilling preferences into structured, fine-grained signals and aligning them with concrete aspects of correctness (like secure coding practices), we get reward signals that actually point the model in meaningful directions.
The Compositional Feedback Thesis
If there's one idea that ties all of this together, it's that reliable models need composable feedback. I think of feedback as falling along two axes:
Guidance Signals
Constrain what the model can produce—syntax requirements, type constraints, invariants that must hold. These shape the output space before generation.
Evaluation Signals
Assess what the model did produce—through execution, testing, static analysis, or verification. These provide post-hoc judgment.
The challenge is that these signals are heterogeneous, partial, and sometimes contradictory. Tests check behavior but miss edge cases. Static analysis is conservative and noisy. Verification is precise but expensive and often incomplete. No single signal captures everything. The research problem is: how do you combine them into something coherent that a model can learn from?
My approach is to treat this as a systems problem. Rather than designing one monolithic reward function, I build pipelines where multiple forms of feedback feed into training at different stages and different granularities. This compositional view allows each signal to contribute what it's good at without being asked to do more than it can.
Building the Systems
In practice, this research results in end-to-end pipelines that wire together generation, execution, analysis, and training. The model produces output. The output gets executed, tested, analyzed. The results flow back as training signal. The model updates. This isn't a one-shot process—it's a continuous cycle where models learn from their own behavior in realistic environments.
With the rise of agentic workflows—where models interact with tools, write and run code, call APIs, and iterate—these feedback loops become even richer. Every interaction trace is potential training data. Every tool invocation produces observable outcomes. The question isn't whether we have enough signal; it's how to structure the signal we have so that learning is stable and productive.
There are real challenges here. Models can game weak signals (reward hacking is real and persistent). Feedback that's too noisy degrades training rather than improving it. And the computational cost of running execution-based feedback at scale is non-trivial. These aren't theoretical concerns—they're engineering constraints that shape what's feasible.
What's Next
I'm pursuing three directions going forward:
Automatic Supervision Generation
Can we derive training signals from natural language specs, code structure, documentation, and even the model's own uncertainty? The bottleneck in training isn't usually compute—it's the quality and availability of supervision.
Compositional Reward Models
Rather than learning a single reward function, can we learn modular components that assess different aspects of correctness and combine them dynamically?
Agentic Feedback Loops
As models increasingly operate as agents in complex environments, the interaction traces themselves become a rich source of supervision. I want to develop principled methods for learning from these interactions.
The thread connecting all of this is a belief that we shouldn't accept unreliable models and then patch their behavior externally. Reliability should be a property that emerges from how a model is trained—from the structure, granularity, and correctness of the signals it learns from. That's the goal I'm working toward.