publications | Shivanshu Gupta

2025

NeurIPS

LayerIF: Estimating Layer Quality for Large Language Models using Influence Functions

Hadi Askari, Shivanshu Gupta, Fei Wang, and 2 more authors

2025

@misc{askari2025layerifestimatinglayerquality,
  title = {LayerIF: Estimating Layer Quality for Large Language Models using Influence Functions},
  author = {Askari, Hadi and Gupta, Shivanshu and Wang, Fei and Chhabra, Anshuman and Chen, Muhao},
  year = {2025},
  eprint = {2505.23811},
  archiveprefix = {arXiv},
  primaryclass = {cs.CL},
  url = {https://arxiv.org/abs/2505.23811},
  comment = {rJfX4m7vDkIC}
}

arXiv

Leveraging In-Context Learning for Language Model Agents

Shivanshu Gupta, Sameer Singh, Ashish Sabharwal, and 2 more authors

2025

arXiv Bib

@misc{gupta2025leveragingincontextlearninglanguage,
  title = {Leveraging In-Context Learning for Language Model Agents},
  author = {Gupta, Shivanshu and Singh, Sameer and Sabharwal, Ashish and Khot, Tushar and Bogin, Ben},
  year = {2025},
  eprint = {2506.13109},
  archiveprefix = {arXiv},
  primaryclass = {cs.CL},
  url = {https://arxiv.org/abs/2506.13109},
  comment = {rJfX4m7vDkIC}
}

arXiv

Unraveling Indirect In-Context Learning Using Influence Functions

Hadi Askari, Shivanshu Gupta, Terry Tong, and 3 more authors

2025

arXiv Bib

@misc{askari2025unravelingindirectincontextlearning,
  title = {Unraveling Indirect In-Context Learning Using Influence Functions},
  author = {Askari, Hadi and Gupta, Shivanshu and Tong, Terry and Wang, Fei and Chhabra, Anshuman and Chen, Muhao},
  year = {2025},
  eprint = {2501.01473},
  archiveprefix = {arXiv},
  primaryclass = {cs.LG},
  url = {https://arxiv.org/abs/2501.01473},
  comment = {Y0pCki6q_DkC}
}

2024

ICML
GistScore: Learning Better Representations for In-Context Example Selection with Gist Bottlenecks

Shivanshu Gupta, Clemens Rosenbaum, and Ethan R. Elenberg

Jul 2024

Abs arXiv Bib PDF

In-Context Learning (ICL) is the ability of Large Language Models (LLMs) to perform new tasks when conditioned on prompts comprising a few task examples. However, ICL performance can be critically sensitive to the choice of examples. To dynamically select the best examples for every test input, we propose Example Gisting, a novel approach for training example encoders through supervised finetuning with an attention bottleneck between the inputs and outputs. These gist models form the basis for GistScore, a novel metric for scoring and selecting informative examples. Further, we experiment with two variations: (1) finetuning gist models for each dataset and (2) multi-task training a single model on a large collection of datasets. The latter can be used for new tasks out-of-the-box, enabling a training-free ICL pipeline. Evaluations with 21 datasets spanning 9 tasks and 8 diverse LLMs show that our fine-tuned models get state-of-the-art ICL performance with over 20% absolute gain over off-the-shelf retrievers and 5% over the best prior methods. Further, our multi-task model generalizes well to new tasks, datasets, and prompt templates. Selection using this model matches or outperforms prior methods while being three orders of magnitude faster than the strongest training-free baseline.
@inproceedings{gupta2023gistscore, title = {{G}ist{S}core: Learning Better Representations for In-Context Example Selection with Gist Bottlenecks}, author = {Gupta, Shivanshu and Rosenbaum, Clemens and Elenberg, Ethan R.}, booktitle = {Proceedings of the 41st International Conference on Machine Learning}, pages = {17072--17099}, year = {2024}, editor = {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix}, volume = {235}, series = {Proceedings of Machine Learning Research}, month = jul, publisher = {PMLR}, url = {https://proceedings.mlr.press/v235/gupta24c.html}, comment = {zYLM7Y9cAGgC} }
NAACL
Leveraging Code to Improve In-Context Learning for Semantic Parsing

Ben Bogin^*, Shivanshu Gupta^*, Peter Clark, and 1 more author

Jun 2024

Abs DOI arXiv Bib PDF

In-context learning (ICL) is an appealing approach for semantic parsing due to its few-shot nature and improved generalization. However, learning to parse to rare domain-specific languages (DSLs) from just a few demonstrations is challenging, limiting the performance of even the most capable LLMs.In this work, we show how pre-existing coding abilities of LLMs can be leveraged for semantic parsing by (1) using general-purpose programming languages such as Python instead of DSLs and (2) augmenting prompts with a structured domain description that includes, e.g., the available classes and functions. We show that both these changes significantly improve accuracy across three popular datasets; combined, they lead to dramatic improvements (e.g., 7.9% to 66.5% on SMCalFlow compositional split) and can substantially improve compositional generalization, nearly closing the performance gap between easier i.i.d. and harder compositional splits. Finally, comparisons across multiple PLs and DSL variations suggest that the similarity of a target language to general-purpose code is more important than prevalence in pretraining corpora. Our findings provide an improved methodology for building semantic parsers in the modern context of ICL with LLMs.
@inproceedings{bogin2023leveraging, title = {Leveraging Code to Improve In-Context Learning for Semantic Parsing}, author = {Bogin, Ben and Gupta, Shivanshu and Clark, Peter and Sabharwal, Ashish}, editor = {Duh, Kevin and Gomez, Helena and Bethard, Steven}, booktitle = {Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)}, month = jun, year = {2024}, address = {Mexico City, Mexico}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2024.naacl-long.279}, doi = {10.18653/v1/2024.naacl-long.279}, pages = {4971--5012}, comment = {Tyk-4Ss8FVUC} }

2023

EMNLP Findings
Coverage-based Example Selection for In-Context Learning

Shivanshu Gupta, Matt Gardner, and Sameer Singh

Dec 2023

Abs arXiv Bib PDF

In-context learning (ICL), the ability of large language models to perform novel tasks by conditioning on a prompt with a few task examples, requires these examples to be informative about the test instance. The standard approach of independently ranking and selecting the most similar examples selects redundant examples while omitting important information. In this work, we show that BERTScore-Recall (BSR) selects better examples that demonstrate more of the salient aspects, e.g. reasoning patterns, of the test input. We further extend BSR and many standard metrics to easily optimizable set-level metrics, giving still better coverage of those salient aspects. On 15 datasets spanning 6 tasks and with 7 diverse LLMs, we show that (1) BSR is the superior metric for in-context example selection across the board, and (2) for compositional tasks, set selection using Set-BSR outperforms independent ranking by up to 17 points on average and, despite being training-free, surpasses methods that leverage task or LLM-specific training.
@inproceedings{gupta-etal-2023-coverage, title = {Coverage-based Example Selection for In-Context Learning}, author = {Gupta, Shivanshu and Gardner, Matt and Singh, Sameer}, editor = {Bouamor, Houda and Pino, Juan and Bali, Kalika}, booktitle = {Findings of the Association for Computational Linguistics: EMNLP 2023}, month = dec, year = {2023}, address = {Singapore}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2023.findings-emnlp.930}, pages = {13924--13950}, comment = {UeHWp8X0CEIC} }
ACL Findings
Cross-Lingual Knowledge Distillation for Answer Sentence Selection in Low-Resource Languages

Shivanshu Gupta, Yoshitomo Matsubara, Ankit Chadha, and 1 more author

Jul 2023

Abs arXiv Bib PDF Website

While impressive performance has been achieved on the task of Answer Sentence Selection (AS2) for English, the same does not hold for languages that lack large labeled datasets. In this work, we propose Cross-Lingual Knowledge Distillation (CLKD) from a strong English AS2 teacher as a method to train AS2 models for low-resource languages in the tasks without the need of labeled data for the target language. To evaluate our method, we introduce 1) Xtr-WikiQA, a translation-based WikiQA dataset for 9 additional languages, and 2) TyDi-AS2, a multilingual AS2 dataset with over 70K questions spanning 8 typologically diverse languages. We conduct extensive experiments on Xtr-WikiQA and TyDi-AS2 with multiple teachers, diverse monolingual and multilingual pretrained language models (PLMs) as students, and both monolingual and multilingual training. The results demonstrate that CLKD either outperforms or rivals even supervised fine-tuning with the same amount of labeled data and a combination of machine translation and the teacher model. Our method can potentially enable stronger AS2 models for low-resource languages, while TyDi-AS2 can serve as the largest multilingual AS2 dataset for further studies in the research community.
@inproceedings{gupta-etal-2023-cross, title = {Cross-Lingual Knowledge Distillation for Answer Sentence Selection in Low-Resource Languages}, author = {Gupta, Shivanshu and Matsubara, Yoshitomo and Chadha, Ankit and Moschitti, Alessandro}, booktitle = {Findings of the Association for Computational Linguistics: ACL 2023}, month = jul, year = {2023}, address = {Toronto, Canada}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2023.findings-acl.885}, pages = {14078--14092}, comment = {IjCSPb-OGe4C} }

2022

EMNLP
Successive Prompting for Decomposing Complex Questions

Dheeru Dua, Shivanshu Gupta, Sameer Singh, and 1 more author

Dec 2022

Abs arXiv Bib PDF Code

Answering complex questions that require making latent decisions is a challenging task, especially when limited supervision is available. Recent works leverage the capabilities of large language models (LMs) to perform complex question answering in a few-shot setting by demonstrating how to output intermediate rationalizations while solving the complex question in a single pass. We introduce “Successive Prompting” where, we iteratively break down a complex task into a simple task, solve it, and then repeat the process until we get the final solution. Successive prompting decouples the supervision for decomposing complex questions from the supervision for answering simple questions, allowing us to (1) have multiple opportunities to query in-context examples at each reasoning step (2) learn question decomposition separately from question answering, including using synthetic data, and (3) use bespoke (fine-tuned) components for reasoning steps where a large LM does not perform well. The intermediate supervision is typically manually written, which can be expensive to collect. We introduce a way to generate synthetic dataset which can be used to bootstrap model’s ability to decompose and answer intermediate questions. Our best model (with successive prompting) achieves an improvement in F1 of ~5% when compared with a state-of-the-art model with synthetic augmentations and few-shot version of the DROP dataset.
@inproceedings{dua-etal-2022-successive, title = {Successive Prompting for Decomposing Complex Questions}, author = {Dua, Dheeru and Gupta, Shivanshu and Singh, Sameer and Gardner, Matt}, booktitle = {Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing}, month = dec, year = {2022}, address = {Abu Dhabi, United Arab Emirates}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2022.emnlp-main.81}, pages = {1251--1265}, comment = {2osOgNQ5qMEC} }
EMNLP Findings
Structurally Diverse Sampling for Sample-Efficient Training and Comprehensive Evaluation

Shivanshu Gupta, Sameer Singh, and Matt Gardner

Dec 2022

Abs arXiv Bib PDF Code

A growing body of research has demonstrated the inability of NLP models to generalize compositionally and has tried to alleviate it through specialized architectures, training schemes, and data augmentation, among other approaches. In this work, we study a different approach: training on instances with diverse structures. We propose a model-agnostic algorithm for subsampling such sets of instances from a labeled instance pool with structured outputs. Evaluating on both compositional template splits and traditional IID splits of 5 semantic parsing datasets of varying complexity, we show that structurally diverse training using our algorithm leads to comparable or better generalization than prior algorithms in 9 out of 10 dataset-split type pairs. In general, we find structural diversity to consistently improve sample efficiency compared to random train sets. Moreover, we show that structurally diverse sampling yields comprehensive test sets that are a lot more challenging than IID test sets. Finally, we provide two explanations for improved generalization from diverse train sets: 1) improved coverage of output substructures, and 2) a reduction in spurious correlations between these substructures.
@inproceedings{gupta-etal-2022-structurally, title = {Structurally Diverse Sampling for Sample-Efficient Training and Comprehensive Evaluation}, author = {Gupta, Shivanshu and Singh, Sameer and Gardner, Matt}, booktitle = {Findings of the Association for Computational Linguistics: EMNLP 2022}, month = dec, year = {2022}, address = {Abu Dhabi, United Arab Emirates}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2022.findings-emnlp.365}, pages = {4966--4979}, comment = {qjMakFHDy7sC} }
EMNLP
Unobserved Local Structures Make Compositional Generalization Hard

Ben Bogin, Shivanshu Gupta, and Jonathan Berant

Dec 2022

Abs DOI arXiv Bib PDF Code

While recent work has convincingly showed that sequence-to-sequence models struggle to generalize to new compositions (termed compositional generalization), little is known on what makes compositional generalization hard on a particular test instance. In this work, we investigate what are the factors that make generalization to certain test instances challenging. We first substantiate that indeed some examples are more difficult than others by showing that different models consistently fail or succeed on the same test instances. Then, we propose a criterion for the difficulty of an example: a test instance is hard if it contains a local structure that was not observed at training time. We formulate a simple decision rule based on this criterion and empirically show it predicts instance-level generalization well across 5 different semantic parsing datasets, substantially better than alternative decision rules. Last, we show local structures can be leveraged for creating difficult adversarial compositional splits and also to improve compositional generalization under limited training budgets by strategically selecting examples for the training set.
@misc{https://doi.org/10.48550/arxiv.2201.05899, doi = {10.48550/ARXIV.2201.05899}, url = {https://arxiv.org/abs/2201.05899}, author = {Bogin, Ben and Gupta, Shivanshu and Berant, Jonathan}, keywords = {Computation and Language (cs.CL), FOS: Computer and information sciences, FOS: Computer and information sciences}, title = {Unobserved Local Structures Make Compositional Generalization Hard}, publisher = {arXiv}, year = {2022}, copyright = {arXiv.org perpetual, non-exclusive license}, comment = {u-x6o8ySG0sC} }

2021

EMNLP
COVR: A Test-Bed for Visually Grounded Compositional Generalization with Real Images

Ben Bogin, Shivanshu Gupta, Matt Gardner, and 1 more author

Nov 2021

Abs DOI arXiv Bib PDF Code Website

While interest in models that generalize at test time to new compositions has risen in recent years, benchmarks in the visually-grounded domain have thus far been restricted to synthetic images. In this work, we propose COVR, a new test-bed for visually-grounded compositional generalization with real images. To create COVR, we use real images annotated with scene graphs, and propose an almost fully automatic procedure for generating question-answer pairs along with a set of context images. COVR focuses on questions that require complex reasoning, including higher-order operations such as quantification and aggregation. Due to the automatic generation process, COVR facilitates the creation of compositional splits, where models at test time need to generalize to new concepts and compositions in a zero- or few-shot setting. We construct compositional splits using COVR and demonstrate a myriad of cases where state-of-the-art pre-trained language-and-vision models struggle to compositionally generalize.
@inproceedings{bogin-etal-2021-covr, title = {{COVR}: A Test-Bed for Visually Grounded Compositional Generalization with Real Images}, author = {Bogin, Ben and Gupta, Shivanshu and Gardner, Matt and Berant, Jonathan}, booktitle = {Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing}, month = nov, year = {2021}, address = {Online and Punta Cana, Dominican Republic}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2021.emnlp-main.774}, doi = {10.18653/v1/2021.emnlp-main.774}, pages = {9824--9846}, comment = {d1gkVwhDpl0C} }