YASA: Scalable Multi-Language Taint Analysis on the Unified AST at Ant Group (arxiv.org)
from yogthos@lemmy.ml to security@lemmy.ml on 04 Apr 06:35
https://lemmy.ml/post/45445089

This paper addresses a massive headache in enterprise software engineering which is securing codebases that mix multiple programming languages. Large companies rely on a mix of Java, Go, Python, and JavaScript to build their platforms. Historically security teams had to use entirely separate static analysis tools for each language which scales terribly and creates a fragmented mess of vulnerability reports. Existing multi-language frameworks like CodeQL and Joern fall short. CodeQL relies heavily on language-specific extractors which means only a tiny fraction of its codebase is actually shared across languages. Joern tries to shove everything into a unified graph but ends up over-abstracting the code and losing critical nuances like how JavaScript handles prototype chains.

To fix this the researchers built a Unified Abstract Syntax Tree framework called YASA. The core innovation is that instead of dropping down to a low-level representation or abstracting away too much detail the UAST categorizes syntax elements strategically. It creates universal nodes for common concepts like basic control flow and loops. It explicitly preserves language-specific nodes for unique features like Go channels or Python generators so it does not lose important context. It completely breaks down syntactic sugar into simpler universal components.

Once the code is parsed into this unified format YASA runs a precise points-to analyzer. This engine processes universal concepts with shared logic but hands off weird language behaviors to dedicated handlers. The taint checker then steps in to trace how potentially malicious data flows through the application to find vulnerabilities. Because the core logic is shared adding support for a new language requires significantly less engineering effort compared to older tools.

The Ant Group team used YASA on over one hundred million lines of internal code across thousands of applications. It identified over three hundred unknown taint paths and security experts confirmed 92 of them as actual zero-day vulnerabilities. When tested against standard benchmarks YASA thoroughly outperformed top single-language analyzers and crushed CodeQL and Joern in both soundness and completeness.

#security

threaded - newest