Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen
LoRA enables efficient adaptation of large language models by training small rank decomposition matrices instead of fine-tuning all parameters. LoRA democratizes large language model fine-tuning by making adaptation practical for resource-constrained organizations, enabling efficient multi-task deployment and reducing environmental impact. It has become a foundational technique for practical LLM adaptation in industry and research.
Delong Chen, Mustafa Shukor, Theo Moutakanni, Willy Chung, Jade Yu, Tejaswi Kasarla, Yejin Bang, Allen Bolourchi, Yann LeCun, Pascale Fung
VL-JEPA is a vision-language model that predicts continuous text embeddings instead of generating tokens, achieving better performance with 50% fewer parameters than standard VLMs. This work demonstrates that embedding-space prediction is a more efficient alternative to autoregressive token generation for vision-language tasks, enabling smaller models to match larger ones' performance. The approach reduces computational costs at inference through selective decoding and enables flexible task support without architectural changes, making it practical for resource-constrained applications.
Shahul Es, Jithin James, Luis Espinosa-Anke, Steven Schockaert
Ragas is a reference-free evaluation framework that assesses different dimensions of Retrieval Augmented Generation systems without requiring ground truth annotations. As organizations rapidly adopt RAG systems with LLMs, automated evaluation is critical for development velocity and risk mitigation. This framework enables practitioners to quickly iterate on RAG architectures and identify failure modes related to hallucinations, retrieval quality, or generation fidelity without costly human evaluation campaigns.
Inception Labs, Samar Khanna, Siddhant Kharbanda, Shufan Li, Harshit Varma, Eric Wang, Sawyer Birnbaum, Ziyang Luo, Yanis Miraoui, Akash Palrecha, Stefano Ermon, Aditya Grover, Volodymyr Kuleshov
Mercury introduces ultra-fast diffusion-based language models that generate multiple tokens in parallel, achieving state-of-the-art speed while maintaining competitive quality for coding tasks. This work addresses a critical bottleneck in LLM deployment by achieving dramatic speedups without quality loss, making advanced AI coding assistants more practical for real-time applications and reducing computational costs for developers and enterprises.