Generalizing Conversational Dense Retrieval via LLM-Cognition Data Augmentation (2024)

\useunder

\ul

Haonan Chen1, Zhicheng Dou1, Kelong Mao1
Jiongnan Liu1, Ziliang Zhao1
1Gaoling School of Artificial Intelligence, Renmin University of China
{hnchen,dou}@ruc.edu.cn
Corresponding author.

Abstract

Conversational search utilizes muli-turn natural language contexts to retrieve relevant passages.Existing conversational dense retrieval models mostly view a conversation as a fixed sequence of questions and responses, overlooking the severe data sparsity problem – that is, users can perform a conversation in various ways.Consequently, they often struggle to generalize to diverse conversations in real-world scenarios.In this work, we propose a framework for generalizing Conversational dense retrieval via LLM-cognition data Augmentation (ConvAug).We first generate multi-level augmented conversations to capture the diverse nature of conversational contexts.Inspired by human cognition, we devise a cognition-aware prompting process to mitigate the generation of false positives, false negatives, and hallucinations.Moreover, we develop a difficulty-adaptive sample filter that selects challenging samples for complex conversations, thereby giving the model a larger learning space.A contrastive learning objective is then employed to train a better conversational context encoder.Extensive experiments conducted on four public datasets, under both normal and zero-shot settings, demonstrate the effectiveness, generalizability, and applicability of ConvAug.The code is released at https://github.com/haon-chen/ConvAug.

1 Introduction

Conversational search is anticipated to become the leading form of ad-hoc search engines in the futureGao etal. (2023a).This approach, utilizing multi-turn natural language interactions, offers a user-friendly experience, particularly for complex information-seeking tasks.

There are two typical approaches for conversational search.One way is conversational query rewriting (CQR)Mo etal. (2023a); Wu etal. (2022).CQR models convert a conversational query into a de-contextualized search query suitable for ad-hoc retrieval.However, CQR models either perform poorly because they cannot be optimized by downstream retrieval taskMao etal. (2023c), or have unacceptable search latency when using large language models (LLMs) during inferenceMao etal. (2023b).Another approach is to perform conversational dense retrieval (CDR) in an end-to-end manner.It typically uses the entire conversational context to train the context encoder within CDR models for passage retrieval.This approach has been demonstrated to be more effective than CQR models on the downstream retrieval task of conversational searchJin etal. (2023); Mao etal. (2023c).

Existing CDR approaches typically utilize conversations as fixed multi-turn natural language texts to train the context encoder.However, in real-world scenarios, users can express conversations in various ways.The conversational search data often lack the diversity to support training for such variability due to the severe data sparsity issue.In other words, numerous alternative conversations with the same intent (or with similar expressions but different intents) as a specific data sample are unrecorded.As a result, CDR models trained on these limited and fixed data often struggle to adapt to diverse real-world conversations.Some works have tried to compensate for the deficiency of multi-turn texts.However, these efforts often rely on basic rule-based strategiesZhu etal. (2021) or human annotations to augment conversationsMao etal. (2022a).Furthermore, comprehending turn dependencies in multi-turn conversations poses a significant challenge for simple language models.

To tackle these problems, we propose an LLM-based data augmentation framework to mimic how users perform diverse conversations.Specifically, we design multi-level augmentation strategies to generate positive (similar intents but different expressions, denoted as +\boldsymbol{+}bold_+) and hard negative conversations (similar expressions but different intents, denoted as \boldsymbol{-}bold_-):(1) Token level. To mitigate the model’s overreliance on specific tokens, we randomly mask some tokens of conversations (+\boldsymbol{+}bold_+).Besides, we identify and replace the entities (\boldsymbol{-}bold_-) to help the model focus on key information.(2) Turn level. To prevent the model from depending on specific turns or the order of turns within conversations, we mask (+\boldsymbol{+}bold_+) and reorder (+\boldsymbol{+}bold_+) turns to generate diverse conversations.We also generate a noisy turn (+\boldsymbol{+}bold_+) to enhance the model’s denoising ability.To avoid generating false positives, we identify the turn dependency structure to guide the turn-level augmentations.(3) Conversation level. We paraphrase the conversation (+\boldsymbol{+}bold_+) to introduce linguistic variations.We also shift the intent of conversations to help the model detect subtle intent changes (\boldsymbol{-}bold_-).

However, LLMs may generate false positives or negatives and be prone to generate texts with hallucinationsLi etal. (2023).To produce high-quality conversations, we propose a three-step prompting process inspired by human cognition.Initially, we prompt an LLM to get a comprehensive understanding of the conversation (e.g., its intent and theme) in the first stepVanDijk etal. (1983).Subsequently, the LLM associates existing elements, such as expressions, intents, and entities, with new yet related onesCollins and Loftus (1975).Finally, the LLM can conclude final outputs based on former outputs.These outputs are less prone to be false positives, false negatives, or hallucinations, as the LLM has a deeper understanding of the original conversation (Step 1) and the generated elements are associated based on existing ones (Step 2).

Subsequently, we employ contrastive learning to bring together augmented positive conversations and push them away from negative ones.Through this, we aim to train a more robust and generalized conversational context encoder, capable of accurately interpreting users’ search intents of diverse conversations.To enhance the contrastive learning process, we go beyond basic random sampling methodsZhu etal. (2021), and introduce a difficulty-adaptive sample filter to select more challenging augmented samples for more difficult conversations.We believe that complex conversations offer a larger learning space for the model.More challenging data can thus provide the model with richer information, enabling it to understand these complex conversations better.

Extensive experiments on four public datasets demonstrate that ConvAug can consistently improve the performance of various conversational dense retrievers across various complexity levels of conversational turns.

The contributions of our work are as follows:

(1) We propose an LLM-based multi-level data augmentation framework ConvAug for conversational search.It manages to comprehensively improve the effectiveness and generalizability of conversational retrievers.

(2) To obtain high-quality data, a cognition-aware prompting process is designed to prevent false positives/negatives and mitigate the hallucination problem of LLMs in conversation generation.

(3) We develop a difficulty-adaptive sample filter to select challenging samples for complex conversations to improve the model’s understanding of those with large learning spaces.

2 Related Work

Conversational search.CQR models usually utilize the context to rewrite the conversation into a standalone queryLin etal. (2020); Qian and Dou (2022); Mo etal. (2023a).Some researchers attempt to connect the downstream retrieval task to the rewriting taskWu etal. (2022); Chen etal. (2022); Mao etal. (2023a).On the other hand, CDR models try to utilize the whole conversation to train a conversational context encoder.Some works use a few-shot manner to train the CDR modelYu etal. (2021); Mao etal. (2022b); Mo etal. (2024).Some design delicate denoising approaches for better CDR modelsMao etal. (2022a); Mo etal. (2023b); Mao etal. (2023c).However, none of these models focus on developing a context encoder that can comprehend diverse conversations smoothly.

Data augmentation for Information Retrieval.Because of the limited nature of relevance judgments, researchers of Information Retrieval (IR)Zhu etal. (2023a); Mao etal. (2020); Huang etal. (2023); Lin etal. (2023) have resorted to data augmentation.Some use LLMs to generate queries from adocumentBonifacio etal. (2022), or documents from a queryGao etal. (2023b); Mackie etal. (2023); Wang etal. (2023) in ad-hoc retrieval.For multi-turn ranking, some use basic rule-based approaches to generate variance of sequences for session searchZhu etal. (2021), personalized searchZhou etal. (2021), and product searchDai etal. (2023).COTEDMao etal. (2022a) generates conversations based on human-annotated necessary historical turns.

LLM for Information Retrieval.LLMs have been widely used in various modules of the IR pipelineZhu etal. (2023b), such as retrieverAsai etal. (2023a), rerankerMa etal. (2023), and readerAsai etal. (2023b).In conversational search, some employ LLMs to aid the trainingYe etal. (2023); Cheng etal. (2024) and the inferenceMao etal. (2023b) stage of CQR.InstructorJin etal. (2023) uses LLMs to generate pseudo passage labels to facilitate unsupervised CDR models.However, these models fail to utilize LLMs to alternate the contexts for a generalized context encoder.

3 Methodology: ConvAug

In this section, we present our two-stage framework ConvAug,as illustrated in Figure1.In the first stage, we leverage an LLM to perform our data augmentation strategies tailored for conversational search.A three-step cognition-aware prompting process is developed to guide the LLM to generate multi-level augmented conversations.The second stage is to utilize the augmented data to optimize the conversational context encoder.We propose to select more challenging samples for more complex conversations to facilitate model learning.

3.1 Problem Formulation

In this work, we focus on the conversational passage retrieval task.The context of a conversation is denoted as Cn={q1,r1,,qn1,rn1,qn}subscript𝐶𝑛subscript𝑞1subscript𝑟1subscript𝑞𝑛1subscript𝑟𝑛1subscript𝑞𝑛C_{n}=\{q_{1},r_{1},...,q_{n-1},r_{n-1},q_{n}\}italic_C start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = { italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_q start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT , italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT }, where qisubscript𝑞𝑖q_{i}italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and risubscript𝑟𝑖r_{i}italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are the query and response of the i𝑖iitalic_i-th turn (tisubscript𝑡𝑖t_{i}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT) in Cnsubscript𝐶𝑛C_{n}italic_C start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, and qnsubscript𝑞𝑛q_{n}italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is the current query.Given Cnsubscript𝐶𝑛C_{n}italic_C start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, our goal is to retrieve the relevant passage d+superscript𝑑d^{+}italic_d start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT from the passage collection 𝒟𝒟\mathcal{D}caligraphic_D.For convenience, we will omit the subscript n𝑛nitalic_n in the rest of this paper.

3.2 LLM-enhanced Data Augmentation

Conversational search suffers from a severe data sparsity issue, i.e., varying expressions of recorded conversations are inadequate, leading to insufficient training of context encoders.As shown in Figure2, we propose to mimic the diverse ways users might express conversations by developing data augmentation strategies.We propose both positive (+\boldsymbol{+}bold_+) and hard negative (\boldsymbol{-}bold_-) generation strategies to produce conversations with similar (C+superscript𝐶C^{+}italic_C start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT) and different intents (Csuperscript𝐶C^{-}italic_C start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT), respectively.Furthermore, the LLM-based generation is prompted by a three-step cognition-aware process to mitigate hallucinations and enhance the data quality.

Generalizing Conversational Dense Retrieval via LLM-Cognition Data Augmentation (1)
Generalizing Conversational Dense Retrieval via LLM-Cognition Data Augmentation (2)

3.2.1 Multi-level Conversation Alteration

\bullet Token-level alteration

Firstly, we propose to perform fine-grained token-level alterations on C𝐶Citalic_C to help the model learn nuanced information.

Token Masking (+\boldsymbol{+}bold_+). To prevent the model from relying too much on specific tokens, we employ a rule-based strategy.A context is treated as a sequence of tokens: C={w1,w2,,wM}𝐶subscript𝑤1subscript𝑤2subscript𝑤𝑀C=\{w_{1},w_{2},\ldots,w_{M}\}italic_C = { italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_w start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT }, where M𝑀Mitalic_M is the total number of tokens.We randomly mask a proportion rwsubscript𝑟wr_{\text{w}}italic_r start_POSTSUBSCRIPT w end_POSTSUBSCRIPT of the tokens in C𝐶Citalic_C with a special token “[token_mask]”.By this, we aim to produce a similar context Ctom+subscriptsuperscript𝐶tomC^{+}_{\text{tom}}italic_C start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT start_POSTSUBSCRIPT tom end_POSTSUBSCRIPT as it only has little differences from C𝐶Citalic_C in some tokens.

Entity Replacing (\boldsymbol{-}bold_-). In real-world scenarios, the same conversational flow can occur with different entities.We use the LLM to identify and replace entities in C𝐶Citalic_C to generate Centsubscriptsuperscript𝐶entC^{-}_{\text{ent}}italic_C start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ent end_POSTSUBSCRIPT, which is contextually similar to C𝐶Citalic_C but differs in critical details.By contrasting it to other C+superscript𝐶C^{+}italic_C start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT, the model can pay closer attention to the key information in the context rather than the superficial aspects.

\bullet Dependecy-aware turn-level alteration

Secondly, we propose more coarse-grained alterations at the turn level.As shown in Figure2, the understanding of t2=(q2,r2)subscript𝑡2subscript𝑞2subscript𝑟2t_{2}=(q_{2},r_{2})italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = ( italic_q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) and t3=(q3)subscript𝑡3subscript𝑞3t_{3}=(q_{3})italic_t start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT = ( italic_q start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) both depend on t1subscript𝑡1t_{1}italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT since they all need the information “train”.Therefore, the dependencies within conversations are important if we want to alternate them without changing their search intents, i.e., avoiding producing false positives.Utilizing the ability of LLMs, we can identify the necessary historical turns of tisubscript𝑡𝑖t_{i}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT automatically.After performing this sequentially on all turns of C𝐶Citalic_C, we can construct a turn dependency Directed Acyclic Graph (DAG) 𝒢𝒢\mathcal{G}caligraphic_G, as shown in the right part of Figure2.

Turn Masking (+\boldsymbol{+}bold_+). For all historical turns Th={t1,t2,,tn1}subscript𝑇hsubscript𝑡1subscript𝑡2subscript𝑡𝑛1T_{\text{h}}=\{t_{1},t_{2},\ldots,t_{n-1}\}italic_T start_POSTSUBSCRIPT h end_POSTSUBSCRIPT = { italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_t start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT } of C𝐶Citalic_C, we mask a proportion rtsubscript𝑟tr_{\text{t}}italic_r start_POSTSUBSCRIPT t end_POSTSUBSCRIPT of the turns with a special token “[turn_mask]” to generate Ctum+subscriptsuperscript𝐶tumC^{+}_{\text{tum}}italic_C start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT start_POSTSUBSCRIPT tum end_POSTSUBSCRIPT.With this, ConvAug is forced to not rely on specific turns and get a more robust understanding of the whole conversation.To maintain the dependency structure of C𝐶Citalic_C, we can only mask the turns that are not the ancestors of t𝑡titalic_t.

Turn Reordering (+\boldsymbol{+}bold_+). We select a pair of historical turns (ti,tj)subscript𝑡𝑖subscript𝑡𝑗(t_{i},t_{j})( italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) in Thsubscript𝑇hT_{\text{h}}italic_T start_POSTSUBSCRIPT h end_POSTSUBSCRIPT and swap their positions to produce Creo+subscriptsuperscript𝐶reoC^{+}_{\text{reo}}italic_C start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT start_POSTSUBSCRIPT reo end_POSTSUBSCRIPT.We can only choose turns that the topological ordering of 𝒢𝒢\mathcal{G}caligraphic_G remains the same after the swapping.Through this restriction, Creo+subscriptsuperscript𝐶reoC^{+}_{\text{reo}}italic_C start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT start_POSTSUBSCRIPT reo end_POSTSUBSCRIPT will have a different order of expression while maintaining the logical chain as C𝐶Citalic_C.This process challenges the model to focus more on the content of each turn rather than just the order.

Inserting Noisy Turn (+\boldsymbol{+}bold_+). Conversations are often interrupted by unrelated interjections.Corrupting the current context can help the model handle conversational dynamics.We extend the existing context for one additional noisy turn tnoisubscript𝑡noit_{\text{noi}}italic_t start_POSTSUBSCRIPT noi end_POSTSUBSCRIPT and randomly insert it into Thsubscript𝑇hT_{\text{h}}italic_T start_POSTSUBSCRIPT h end_POSTSUBSCRIPT.Since we prompt the LLM to generate a turn that is relevant to the main background of C𝐶Citalic_C but introduces a slightly divergent element, the generated turn can be inserted into any position in Thsubscript𝑇hT_{\text{h}}italic_T start_POSTSUBSCRIPT h end_POSTSUBSCRIPT to produce Cnoi+subscriptsuperscript𝐶noiC^{+}_{\text{noi}}italic_C start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT start_POSTSUBSCRIPT noi end_POSTSUBSCRIPT without disrupting the dependency structure.

\bullet Conversation-level alteration

At last, we apply more high-level changes to the whole conversation.

Paraphrasing (+\boldsymbol{+}bold_+). To mimic users’ various expressions of similar intents, we aim to use the LLM to expand the linguistic diversity by paraphrasing the whole C𝐶Citalic_C to produce Cpara+subscriptsuperscript𝐶paraC^{+}_{\text{para}}italic_C start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT start_POSTSUBSCRIPT para end_POSTSUBSCRIPT.This can help reduce the model’s tendency to overfit specific phrasings or patterns of C𝐶Citalic_C, which enhances the model’s ability to generalize to unseen conversations.

Intent Shifting (\boldsymbol{-}bold_-). The intent behind a dialogue can shift subtly without significant changes in the expression of the conversation.Therefore, we utilize the LLM to produce the intent-shifted conversations Cintsubscriptsuperscript𝐶intC^{-}_{\text{int}}italic_C start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT start_POSTSUBSCRIPT int end_POSTSUBSCRIPT.By contrasting them to C+superscript𝐶C^{+}italic_C start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT, our model is trained to detect and adapt to subtle intent shifts in real conversations.

3.2.2 Cognition-aware Prompting Process

To enhance the data quality, we propose a three-step prompting method inspired by human cognition theory, including Comprehension Synthesis (Step 1), Associative Expansion (Step 2), and Conclusion (Step 3).As shown in Figure2, we take the paraphrasing strategy as an example for illustration:

Step 1: Comprehension Synthesis.When we have a conversation, our brains initially construct a comprehensive representation of the textVanDijk etal. (1983).This step allows the LLM to have a comprehensive understanding of the whole conversation.Specifically, we prompt the LLM using "Step 1: Comprehension Synthesis: [Identify key themes andintents of the conversation]".The understanding of these core aspects will prevent the LLM from generating Cpara+subscriptsuperscript𝐶paraC^{+}_{\text{para}}italic_C start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT start_POSTSUBSCRIPT para end_POSTSUBSCRIPT that deviates from the theme and search intents (false positive).

Step 2: Associative Expansion.The human mind often uses spreading activation in semantic networks, where one concept triggers related conceptsCollins and Loftus (1975).Inspired by this theory, the prompt we give the LLM is "Step 2: Associative Expansion: [Generate alternative expressions based on existing ones]".This step serves as an intermediate process that leverages LLM’s creativity to think of novel elements while preventing it from hallucinating unrelated elements.

Step 3: Conclusion.In the final step, we prompt the LLM as: "Step 3: Conclusion: [Paraphrase the conversation based on outputs of last two steps]".In our example, the output is a paraphrased conversation that maintains C𝐶Citalic_C’s search intent (Step 1) while introducing new but related (Step 2) expressions, avoiding false positives and hallucinations.

We manually write several demonstrations for each step to prompt an LLM to do in-context generation.The complete prompts are in AppendixC.

3.3 Training Conversational Context Encoder

Through our proposed data augmentation strategies, we can generate a set of positive samples 𝒞+={Ctom+,Ctum+,Creo+,Cnoi+,Cpara+}superscript𝒞subscriptsuperscript𝐶tomsubscriptsuperscript𝐶tumsubscriptsuperscript𝐶reosubscriptsuperscript𝐶noisubscriptsuperscript𝐶para\mathcal{C}^{+}=\{C^{+}_{\text{tom}},C^{+}_{\text{tum}},C^{+}_{\text{reo}},C^{%+}_{\text{noi}},C^{+}_{\text{para}}\}caligraphic_C start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT = { italic_C start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT start_POSTSUBSCRIPT tom end_POSTSUBSCRIPT , italic_C start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT start_POSTSUBSCRIPT tum end_POSTSUBSCRIPT , italic_C start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT start_POSTSUBSCRIPT reo end_POSTSUBSCRIPT , italic_C start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT start_POSTSUBSCRIPT noi end_POSTSUBSCRIPT , italic_C start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT start_POSTSUBSCRIPT para end_POSTSUBSCRIPT } and hard negative samples 𝒞={Cent,Cint}superscript𝒞subscriptsuperscript𝐶entsubscriptsuperscript𝐶int\mathcal{C}^{-}=\{C^{-}_{\text{ent}},C^{-}_{\text{int}}\}caligraphic_C start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT = { italic_C start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ent end_POSTSUBSCRIPT , italic_C start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT start_POSTSUBSCRIPT int end_POSTSUBSCRIPT } for an original conversation C𝐶Citalic_C in the dataset.Then, to enhance model learning, we develop a difficulty-adaptive sample filter to keep samples of matching difficulty for original conversations.Finally, we train the conversational context encoder on these augmented samples with multi-task contrastive learning.

3.3.1 Difficulty-adaptive Sample Filter

Considering that simple augmentations for complex C𝐶Citalic_C may result in underfitting, and complex augmentations for simple C𝐶Citalic_C can cause overfitting, we develop a difficulty-adaptive sample filter.It selects difficult samples for difficult conversations to enhance the training process.

Specifically, the difficulty of the original conversations is defined as:Diff(C)=|Th|+(|Topic(C)|PPL(C)¯)Diff𝐶subscript𝑇hTopic𝐶¯PPL𝐶\text{Diff}(C)=|T_{\text{h}}|+\left(|\text{Topic}(C)|*\overline{\text{PPL}(C)}\right)Diff ( italic_C ) = | italic_T start_POSTSUBSCRIPT h end_POSTSUBSCRIPT | + ( | Topic ( italic_C ) | ∗ over¯ start_ARG PPL ( italic_C ) end_ARG ),where |Th|subscript𝑇h|T_{\text{h}}|| italic_T start_POSTSUBSCRIPT h end_POSTSUBSCRIPT | denotes the number of the historical turns, |Topic(C)|Topic𝐶|\text{Topic}(C)|| Topic ( italic_C ) | is the number of topics , and PPL(C)¯¯PPL𝐶\overline{\text{PPL}(C)}over¯ start_ARG PPL ( italic_C ) end_ARG denotes the average perplexity of C𝐶Citalic_C.The detailed calculation of these components can be found in AppendixD.To give the diversity of topics and the linguistic challenges more emphasis, we compute |Topic(C)|PPL(C)¯Topic𝐶¯PPL𝐶|\text{Topic}(C)|*\overline{\text{PPL}(C)}| Topic ( italic_C ) | ∗ over¯ start_ARG PPL ( italic_C ) end_ARG and use |Th|subscript𝑇h|T_{\text{h}}|| italic_T start_POSTSUBSCRIPT h end_POSTSUBSCRIPT | as an indicator of rich information within long conversations.

Generalizing Conversational Dense Retrieval via LLM-Cognition Data Augmentation (3)

For the difficulty of the augmented conversations,we first obtain paired positive samples: 𝒫𝒞+={(Ci+,Cj+)Ci+,Cj+𝒞+,ij}subscript𝒫superscript𝒞conditional-setsuperscriptsubscript𝐶𝑖superscriptsubscript𝐶𝑗formulae-sequencesuperscriptsubscript𝐶𝑖superscriptsubscript𝐶𝑗superscript𝒞𝑖𝑗\mathcal{P}_{\mathcal{C}^{+}}=\{(C_{i}^{+},C_{j}^{+})\mid C_{i}^{+},C_{j}^{+}%\in\mathcal{C}^{+},i\neq j\}caligraphic_P start_POSTSUBSCRIPT caligraphic_C start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT end_POSTSUBSCRIPT = { ( italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT , italic_C start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ) ∣ italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT , italic_C start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ∈ caligraphic_C start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT , italic_i ≠ italic_j }.We then use a sentence-transformers model to compute the similarity of each pair, the difficulty is denoted as Diff+(Ci+,Cj+)=1BERTSim(Ci+,Cj+)superscriptDiffsuperscriptsubscript𝐶𝑖superscriptsubscript𝐶𝑗1BERTSimsuperscriptsubscript𝐶𝑖superscriptsubscript𝐶𝑗\text{Diff}^{+}(C_{i}^{+},C_{j}^{+})=1-\text{BERTSim}(C_{i}^{+},C_{j}^{+})Diff start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ( italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT , italic_C start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ) = 1 - BERTSim ( italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT , italic_C start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ), where BERTSim(\cdot) is the cosine similarity of encoded conversations.

For the diversity of used augmented samples, we divide all training conversations into |𝒫𝒞+|subscript𝒫superscript𝒞|\mathcal{P}_{\mathcal{C}^{+}}|| caligraphic_P start_POSTSUBSCRIPT caligraphic_C start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT end_POSTSUBSCRIPT | buckets based on Diff(C)Diff𝐶\text{Diff}(C)Diff ( italic_C ).We then filter and select one positive pair with matching Diff+(Ci+,Cj+)superscriptDiffsuperscriptsubscript𝐶𝑖superscriptsubscript𝐶𝑗\text{Diff}^{+}(C_{i}^{+},C_{j}^{+})Diff start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ( italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT , italic_C start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ) for each conversation.As for hard negatives, we pair each negative with selected positive samples: Diff(Ch)=(BERTSim(Ci+,Ch)+BERTSim(Cj+,Ch))/2superscriptDiffsuperscriptsubscript𝐶BERTSimsuperscriptsubscript𝐶𝑖superscriptsubscript𝐶BERTSimsuperscriptsubscript𝐶𝑗superscriptsubscript𝐶2\text{Diff}^{-}(C_{h}^{-})=(\text{BERTSim}(C_{i}^{+},C_{h}^{-})+\text{BERTSim}%(C_{j}^{+},C_{h}^{-}))/2Diff start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ( italic_C start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ) = ( BERTSim ( italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT , italic_C start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ) + BERTSim ( italic_C start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT , italic_C start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ) ) / 2. We select k𝑘kitalic_k hard negatives with higher Diff(Ch)superscriptDiffsuperscriptsubscript𝐶\text{Diff}^{-}(C_{h}^{-})Diff start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ( italic_C start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ) for difficult C𝐶Citalic_C.

3.3.2 Multi-task Contrastive Learning

For the ranking task, we apply a standard ranking loss based on contrastive learning of passages:

rank=loge(𝐂𝐝+)e(𝐂𝐝+)+d𝒟e(𝐂𝐝),subscriptranksuperscript𝑒𝐂superscript𝐝superscript𝑒𝐂superscript𝐝subscriptsuperscript𝑑𝒟superscript𝑒𝐂superscript𝐝\mathcal{L}_{\text{rank}}=-\log\frac{e^{(\mathbf{C}\cdot\mathbf{d}^{+})}}{e^{(%\mathbf{C}\cdot\mathbf{d}^{+})}+\sum_{{d}^{-}\in\mathcal{D}}{e^{(\mathbf{C}%\cdot\mathbf{d}^{-})}}},caligraphic_L start_POSTSUBSCRIPT rank end_POSTSUBSCRIPT = - roman_log divide start_ARG italic_e start_POSTSUPERSCRIPT ( bold_C ⋅ bold_d start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT end_ARG start_ARG italic_e start_POSTSUPERSCRIPT ( bold_C ⋅ bold_d start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT + ∑ start_POSTSUBSCRIPT italic_d start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ∈ caligraphic_D end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT ( bold_C ⋅ bold_d start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT end_ARG ,(1)

where 𝐂=CCE(s)𝐂CCE𝑠\mathbf{C}=\text{CCE}(s)bold_C = CCE ( italic_s ) denotes C𝐶Citalic_C encoded by the conversational context encoder and s=[CLS]qnrn1r1q1[SEP]𝑠[CLS]subscript𝑞𝑛subscript𝑟𝑛1subscript𝑟1subscript𝑞1[SEP]s=\text{[CLS]}\circ q_{n}\circ r_{n-1}\circ\ldots\circ r_{1}\circ q_{1}\circ%\text{[SEP]}italic_s = [CLS] ∘ italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∘ italic_r start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∘ … ∘ italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∘ italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∘ [SEP] is the concatenated sequence of C𝐶Citalic_C.𝐝+superscript𝐝\mathbf{d}^{+}bold_d start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT and 𝐝superscript𝐝\mathbf{d}^{-}bold_d start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT are encoded by the frozen passage encoder 𝐝=PE(d)𝐝PE𝑑\mathbf{d}=\text{PE}({d})bold_d = PE ( italic_d ).

Suppose a minibatch contains N𝑁Nitalic_N conversations, we use our difficulty-adaptive sample filter to select two positive samples for each C𝐶Citalic_C to form {𝒳}𝒳\{\mathcal{X}\}{ caligraphic_X } comprising 2N2𝑁2N2 italic_N sequences.The two sequences derived from the same C𝐶Citalic_C are considered a similar pair, whereas the remaining 2(N1)2𝑁12(N-1)2 ( italic_N - 1 ) serve as in-batch negative samples {𝒳}superscript𝒳\{\mathcal{X}\}^{-}{ caligraphic_X } start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT.Besides, we select k𝑘kitalic_k hard negative samples for each C𝐶Citalic_C to form {}\{\mathcal{H}\}{ caligraphic_H } comprising kN𝑘𝑁kNitalic_k italic_N sequences.The contrastive learning loss for a positive pair (Ci+,Cj+)superscriptsubscript𝐶𝑖superscriptsubscript𝐶𝑗({C}_{i}^{+},{C}_{j}^{+})( italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT , italic_C start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ) and negatives C{{𝒳}}superscript𝐶superscript𝒳{C}^{-}\in\{\{\mathcal{X}\}^{-}\cup\mathcal{H}\}italic_C start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ∈ { { caligraphic_X } start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ∪ caligraphic_H } of C𝐶Citalic_C is formulated as follows:

CL(i,j)=logϕ(𝐂i+,𝐂j+)ϕ(𝐂i+,𝐂j+)+ϕ(𝐂i+,𝐂)subscriptCL𝑖𝑗italic-ϕsuperscriptsubscript𝐂𝑖superscriptsubscript𝐂𝑗italic-ϕsuperscriptsubscript𝐂𝑖superscriptsubscript𝐂𝑗italic-ϕsuperscriptsubscript𝐂𝑖superscript𝐂\displaystyle\mathcal{L}_{\text{CL}}(i,j)=-\log\frac{\phi(\mathbf{C}_{i}^{+},%\mathbf{C}_{j}^{+})}{\phi(\mathbf{C}_{i}^{+},\mathbf{C}_{j}^{+})+\sum\limits%\phi(\mathbf{C}_{i}^{+},\mathbf{C}^{-})}caligraphic_L start_POSTSUBSCRIPT CL end_POSTSUBSCRIPT ( italic_i , italic_j ) = - roman_log divide start_ARG italic_ϕ ( bold_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT , bold_C start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ) end_ARG start_ARG italic_ϕ ( bold_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT , bold_C start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ) + ∑ italic_ϕ ( bold_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT , bold_C start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ) end_ARG(2)

where ϕ()=exp(cos()/τ)italic-ϕcos𝜏\phi(\cdot)=\exp(\text{cos}(\cdot)/\tau)italic_ϕ ( ⋅ ) = roman_exp ( cos ( ⋅ ) / italic_τ ), cos()cos{\rm cos}(\cdot)roman_cos ( ⋅ ) is cosine similarity and τ𝜏\tauitalic_τ is a hyperparameter temperature.

We optimize these two tasks together as: =rank+αCLsubscriptrank𝛼subscriptCL\mathcal{L}=\mathcal{L}_{\text{rank}}+\alpha\mathcal{L}_{\text{CL}}caligraphic_L = caligraphic_L start_POSTSUBSCRIPT rank end_POSTSUBSCRIPT + italic_α caligraphic_L start_POSTSUBSCRIPT CL end_POSTSUBSCRIPT, where α𝛼\alphaitalic_α is used to balance losses.

4 Experiments

CategoryModelQReCCTopiOCQA
MRRNDCG@3Recall@10MRRNDCG@3Recall@10
CQR ModelsT5QR34.531.853.123.022.237.6
ConQRR41.8-65.1---
ConvGQR42.041.064.425.624.341.8
ED49.4-67.0---
CDR ModelsConvDR38.535.758.227.226.443.5
InstructoR-ANCE43.540.566.725.323.745.1
Conv-ANCE49.046.671.430.428.552.6
Conv-SPLADE50.046.669.930.729.552.1
LeCoRE\ul51.1\ul48.5\ul73.9\ul32.0\ul31.4\ul54.3
ConvAug (Ours)52.750.475.635.033.357.9

4.1 Datasets and Metrics

We evaluate our model with both normal and zero-shot evaluation.Following previous CDR worksMao etal. (2023c); Jin etal. (2023), we train ConvAug on QReCCAnantha etal. (2021) and TopiOCQAAdlakha etal. (2022).Additionally, we test ConvAug that has been trained on QReCC in a zero-shot setting on CAsT-20Dalton etal. (2020) and CAsT-21Dalton etal. (2021).We omit the CAsT-19 dataset since it is less challenging and realistic compared to CAsT-20 and CAsT-21Mao etal. (2023b).More details are in AppendixA.

Following previous worksYe etal. (2023), we use some popular metrics for normal evaluation: MRR, NDCG@3, Recall@10.For zero-shot setting, we use metrics suggested by CAsTDalton etal. (2021): MRR, NDCG@3.All significant tests are done using paired t-tests at p<0.05𝑝0.05p<0.05italic_p < 0.05 level with Bonferroni correction.

4.2 Implementation Details

We adopt ANCEXiong etal. (2021) as the base model of ConvAug.For the large language model, we use Llama 2-Chat (7B)Touvron etal. (2023) to perform our data augmentation tasks.We use k=1𝑘1k=1italic_k = 1 augmented negative conversations as hard negatives.More details about training and hyperparameters are in our code and AppendixB.

4.3 Baselines

We compare ConvAug with two kinds of models:

Conversational query rewriter.\bulletT5QRLin etal. (2020) trains the rewriter with the human rewrites.\bulletConQRRWu etal. (2022) employs reinforcement learning to train CQR models.\bulletConvGQRMo etal. (2023a) reformulates better conversational queries by relating to the retrieval task.\bulletEDYe etal. (2023) distills the rewriting capabilities of LLMs into smaller models.Note we do not compare those using black-boxed LLMs (e.g., ChatGPT) during inference Mao etal. (2023b) since these models require significant resources and time to invoke API numerous times during inference.

Conversational dense retriever.\bulletConvDRYu etal. (2021) distills knowledge for few-shot learning.\bulletConv-ANCELin etal. (2020) & Conv-SPLADEFormal etal. (2021) are ANCE and SPLADE fine-tuned on the training conversations with only the training loss.\bulletConvDRYu etal. (2021) distills knowledge for few-shot learning.\bulletLeCoREMao etal. (2023c) extends SPLADE by generating denoised and interpretable lexical session representation.\bulletInstructoRJin etal. (2023) employs LLMs to estimate the session-passage relevance score to guide the retriever’s training.We use the “ANCE+InstructoRQRPGsubscriptInstructoRQRPG\text{InstructoR}_{\text{QRPG}}InstructoR start_POSTSUBSCRIPT QRPG end_POSTSUBSCRIPT” version for fair comparisons with ConvAug.

4.4 Overall Results

4.4.1 Normal Evaluation

We conduct the normal evaluation on QReCC and TopiOCQA, and the results are presented in Table1. We can make these observations:(1) ConvAug outperforms all baseline models significantly on both datasets.This demonstrates the effectiveness of our LLM-enhanced data augmentation and context encoder optimization.Furthermore, based on the model ANCE, whose performance is comparable to SPLADE, ConvAug still manages to gain superior performance than the SPLADE-based model LeCoRE.This further indicates that our approach can train a more robust and generalized context encoder.(2) CDR models generally outperform CQR models.We can observe that even the simply fine-tuned model Conv-ANCE still outperforms the LLM-aided CQR model ED.This indicates the importance of the ranking signal and the effectiveness of our multi-task learning approach.

4.4.2 Zero-shot Evaluation

ModelCAsT-20CAsT-21
MRRNDCG@3MRRNDCG@3
InstructoR-ANCE\ul43.7\ul29.6\ul53.0\ul34.9
Conv-ANCE42.227.752.334.2
Conv-SPLADE36.928.147.929.9
LeCoRE37.729.050.832.3
ConvAug (Ours)45.030.754.836.8

We also evaluate our model’s generalization ability by conducting a zero-shot test of CDR models trained on QReCC on two challenging datasets CAsT-20 and CAsT-21.From the results in Table2, we can make the following observations:(1) ConvAug consistently outperforms all CDR baseline models in terms of both metrics on all datasets.Specifically, ConvAug maintains its superiority over ANCE-based CDR models (Conv-ANCE and InstructoR-ANCE), which further demonstrates the generalization ability of ConvAug.(2) The unsupervised model InstructoR-ANCE gains the second-best performance in the zero-shot setting.For example, it gains a performance of 43.7 in terms of MRR on CAsT-20.However, its performance is poor in the normal setting.This indicates that this unsupervised approach might not align well with labeled tasks but it can be effectively applied to unseen datasets.

4.5 Ablation Study

ModelMRRNDCG@3
ConvAug (Full)52.750.4
 w/o. Token Masking (Ctom+subscriptsuperscript𝐶tomC^{+}_{\text{tom}}italic_C start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT start_POSTSUBSCRIPT tom end_POSTSUBSCRIPT)51.248.9
 w/o. Turn Masking (Ctum+subscriptsuperscript𝐶tumC^{+}_{\text{tum}}italic_C start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT start_POSTSUBSCRIPT tum end_POSTSUBSCRIPT)51.949.6
 w/o. Turn Reordering (Creo+subscriptsuperscript𝐶reoC^{+}_{\text{reo}}italic_C start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT start_POSTSUBSCRIPT reo end_POSTSUBSCRIPT)52.049.5
 w/o. Noisy Turn (Cnoi+subscriptsuperscript𝐶noiC^{+}_{\text{noi}}italic_C start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT start_POSTSUBSCRIPT noi end_POSTSUBSCRIPT)52.349.9
 w/o. Dependency-aware52.049.6
 w/o. Paraphrasing (Cpara+subscriptsuperscript𝐶paraC^{+}_{\text{para}}italic_C start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT start_POSTSUBSCRIPT para end_POSTSUBSCRIPT)52.149.8
 w/o. Entity Replacing (Centsubscriptsuperscript𝐶entC^{-}_{\text{ent}}italic_C start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ent end_POSTSUBSCRIPT)50.848.5
 w/o. Intent Shifting (Cintsubscriptsuperscript𝐶intC^{-}_{\text{int}}italic_C start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT start_POSTSUBSCRIPT int end_POSTSUBSCRIPT)52.450.0
 w/o. Cognition-aware51.149.0
 w/o. Filter (rand)51.749.5
 w/o. Filter (easy)51.649.3

To evaluate the effectiveness of each component, we conduct ablation studies on ConvAug:

Data augmentation strategies.We first conduct ablation experiments on our multi-level data augmentation strategies.As shown in Table3, the performance of ConvAug drops significantly after discarding each kind of alteration.Specifically, the performance of ConvAug drops most when we discard the strategy Entity Replacing (Centsubscriptsuperscript𝐶entC^{-}_{\text{ent}}italic_C start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ent end_POSTSUBSCRIPT).This demonstrates that teaching our model to pay more attention to key information in conversations is effective for understanding search intents.Additionally, we find that ConvAug’s performance decreases if we do not mask or reorder turns based on the dependency graph constructed by the LLM.All these results demonstrate the effectiveness of our designed data augmentation strategies.

Cognition-aware prompting process.We also replace the three-step prompting process with a naive prompt template (AppendixC.2) and train “ConvAug w/o. Cognition-aware” on data generated by this prompt.The performance of ConvAug decreases by about 3% in terms of MRR when we replace the cognition prompting process.This demonstrates that our cognition-aware prompting process can produce data with higher quality.

Difficulty-adaptive sample filter.We replace our filter with a random selector (ConvAug w/o. Filter (rand)) and one that selects easy samples for difficult conversations (ConvAug w/o. Filter (easy)).The decrease in ConvAug’s performance demonstrates that selecting challenging augmented samples for difficult conversations can help the model understand them better.Specifically, the performance of ConvAug decreases if we assign easy samples to difficult conversations (even worse than randomly selecting).This further demonstrates that we will underfit ConvAug if we do not give harder conversations enough learning space.

4.6 Performance on Different Turns

Generalizing Conversational Dense Retrieval via LLM-Cognition Data Augmentation (4)

To investigate the performance of ConvAug at a more fine-grained level, we compare it with LeCoRE and Conv-ANCE at the turn level using TopiOCQA (normal) and CAsT-21 (zero-shot) datasets.The results, as shown in Figure4, indicate that ConvAug surpasses both baselines in the majority of turns, underscoring its effectiveness and generalizability again.Specifically, ConvAug shows more significant improvements in later conversation turns (e.g., from the 2nd to the 15th turns on TopiOCQA and the 3rd to the 11th turns on CAsT-21).This is because longer conversations often contain more diverse information and our augmented data can help ConvAug to generalize to these complex conversations.Besides, our difficulty-adaptive sample filter can challenge ConvAug to learn more about complex conversations.

4.7 Influence of Augmented Hard Negatives

RatioQReCCCAsT-21
MRRNDCG@3MRRNDCG@3
k=0𝑘0k=0italic_k = 050.848.453.335.3
k=1𝑘1k=1italic_k = 152.750.454.836.8
k=2𝑘2k=2italic_k = 251.549.050.834.3

We use k𝑘kitalic_k generated hard negative contexts to facilitate the training of ConvAug’s context encoder.The performances of ConvAug with different k𝑘kitalic_ks are in Table4.We can observe that ConvAug performs best on QReCC with k=1𝑘1k=1italic_k = 1 hard negative.We believe there is a trade-off.The lack of hard negatives limits the model’s ability to benefit from challenging comparisons, leading to a less robust feature representation.On the other hand, incorporating multiple hard negatives may introduce noise or ambiguity, potentially corrupting the learning process.Besides, we can observe that ConvAug (k=0𝑘0k=0italic_k = 0) performs better on zero-shot than on normal evaluation.This further demonstrates that too many hard negative samples will introduce noise and harm the model’s generalizability.

4.8 Application to Other Base Retrivers

ModelMRRNDCG@3
Conv-SPLADE50.046.6
Conv-SPLADE + ConvAug52.449.8
LeCoRE51.148.5
LeCoRE + ConvAug53.150.7

We use ANCE as the base model of ConvAug since it is a popular dense retriever that has been the base model of many CDR models.However, our training framework can be easily applied to other CDR models.We choose Conv-SPLADE and LeCoRE as the base models and apply our approach to them.From the results shown in Table5, we can observe that our method can bring significant improvements across different base CDR models (even sparse retrievers).This demonstrates the broad applicability of our approach.

5 Conclusion

In this work, we present ConvAug to augment conversational search data with LLMs.We design a three-step cognition-aware prompting process to generate multi-level augmented conversations.We also develop a difficulty-adaptive sample filter to assign challenging samples to difficult conversations for larger learning space.A contrastive learning objective is employed to train a generalized conversational context encoder.Extensive experiments on four public datasets at both normal and zero-shot settings validate the effectiveness, generalization ability, and applicability of ConvAug.

Limitations

For future studies, our work has the following limitations that we plan to address:

  1. 1.

    The equation we developed to assess the complexity of conversations is relatively basic. We plan to design a more sophisticated equation of our three components in the future.

  2. 2.

    We use an LLM to augment the training conversations in the pre-processing stage. Although the inference time remains the same as base retrievers, the augmentation process takes quite a long time because of the data amount we need to generate (millions of conversations) and the limited computational resources (4 NVIDIA A100 GPUs).

  3. 3.

    We only conduct experiments using one LLM Llama 2 (7B) due to the cost of augmenting such a large number of data. Performances of other LLMs will be experimented with in the future.

  4. 4.

    There is also a potential risk involved.Since we are using LLMs to generate conversations, the original data should not contain sensitive or private information that may cause LLMs to produce risky texts.

Acknowledgement

This work was supported by the National Natural Science Foundation of China No. 62272467, the fund for building world-class universities (disciplines) of Renmin University of China, and Public Computing Cloud, Renmin University of China. The work was partially done at the Engineering Research Center of Next-Generation Intelligent Search and Recommendation, MOE.

References

  • Adlakha etal. (2022)Vaibhav Adlakha, Shehzaad Dhuliawala, Kaheer Suleman, Harm deVries, and Siva Reddy. 2022.Topiocqa: Open-domain conversational question answering with topic switching.Trans. Assoc. Comput. Linguistics, 10:468–483.
  • Anantha etal. (2021)Raviteja Anantha, Svitlana Vakulenko, Zhucheng Tu, Shayne Longpre, Stephen Pulman, and Srinivas Chappidi. 2021.Open-domain question answering goes conversational via question rewriting.In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021, Online, June 6-11, 2021, pages 520–534. Association for Computational Linguistics.
  • Asai etal. (2023a)Akari Asai, Timo Schick, Patrick S.H. Lewis, Xilun Chen, Gautier Izacard, Sebastian Riedel, Hannaneh Hajishirzi, and Wen-tau Yih. 2023a.Task-aware retrieval with instructions.In Findings of the Association for Computational Linguistics: ACL 2023, Toronto, Canada, July 9-14, 2023, pages 3650–3675. Association for Computational Linguistics.
  • Asai etal. (2023b)Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi. 2023b.Self-rag: Learning to retrieve, generate, and critique through self-reflection.CoRR, abs/2310.11511.
  • Bonifacio etal. (2022)LuizHenrique Bonifacio, HugoQueiroz Abonizio, Marzieh Fadaee, and RodrigoFrassetto Nogueira. 2022.Inpars: Data augmentation for information retrieval using large language models.CoRR, abs/2202.05144.
  • Chen etal. (2022)Zhiyu Chen, Jie Zhao, Anjie Fang, Besnik Fetahu, Oleg Rokhlenko, and Shervin Malmasi. 2022.Reinforced question rewriting for conversational question answering.In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: EMNLP 2022 - Industry Track, Abu Dhabi, UAE, December 7 - 11, 2022, pages 357–370. Association for Computational Linguistics.
  • Cheng etal. (2024)Yiruo Cheng, Kelong Mao, and Zhicheng Dou. 2024.Interpreting conversational dense retrieval by rewriting-enhanced inversion of session embedding.CoRR, abs/2402.12774.
  • Choi etal. (2018)Eunsol Choi, HeHe, Mohit Iyyer, Mark Yatskar, Wen-tau Yih, Yejin Choi, Percy Liang, and Luke Zettlemoyer. 2018.Quac: Question answering in context.In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018, pages 2174–2184. Association for Computational Linguistics.
  • Collins and Loftus (1975)AllanM Collins and ElizabethF Loftus. 1975.A spreading-activation theory of semantic processing.Psychological review, 82(6):407.
  • Dai etal. (2023)sh*tong Dai, Jiongnan Liu, Zhicheng Dou, Haonan Wang, Lin Liu, BoLong, and Ji-Rong Wen. 2023.Contrastive learning for user sequence representation in personalized product search.In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD 2023, Long Beach, CA, USA, August 6-10, 2023, pages 380–389. ACM.
  • Dalton etal. (2020)Jeffrey Dalton, Chenyan Xiong, and Jamie Callan. 2020.Cast 2020: The conversational assistance track overview.In Proceedings of the Twenty-Ninth Text REtrieval Conference, TREC 2020, Virtual Event [Gaithersburg, Maryland, USA], November 16-20, 2020, volume 1266 of NIST Special Publication. National Institute of Standards and Technology (NIST).
  • Dalton etal. (2021)Jeffrey Dalton, Chenyan Xiong, and Jamie Callan. 2021.TREC cast 2021: The conversational assistance track overview.In Proceedings of the Thirtieth Text REtrieval Conference, TREC 2021, online, November 15-19, 2021, volume 500-335 of NIST Special Publication. National Institute of Standards and Technology (NIST).
  • Formal etal. (2021)Thibault Formal, Benjamin Piwowarski, and Stéphane Clinchant. 2021.SPLADE: sparse lexical and expansion model for first stage ranking.In SIGIR ’21: The 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, Virtual Event, Canada, July 11-15, 2021, pages 2288–2292. ACM.
  • Gao etal. (2023a)Jianfeng Gao, Chenyan Xiong, Paul Bennett, and Nick Craswell. 2023a.Neural Approaches to Conversational Information Retrieval, volume44 of The Information Retrieval Series.Springer.
  • Gao etal. (2023b)Luyu Gao, Xueguang Ma, Jimmy Lin, and Jamie Callan. 2023b.Precise zero-shot dense retrieval without relevance labels.In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, pages 1762–1777. Association for Computational Linguistics.
  • Huang etal. (2023)Chao-Wei Huang, Chen-Yu Hsu, Tsu-Yuan Hsu, Chen-An Li, and Yun-Nung Chen. 2023.CONVERSER: few-shot conversational dense retrieval with synthetic data generation.In Proceedings of the 24th Meeting of the Special Interest Group on Discourse and Dialogue, SIGDIAL 2023, Prague, Czechia, September 11 - 15, 2023, pages 381–387. Association for Computational Linguistics.
  • Jin etal. (2023)Zhuoran Jin, Pengfei Cao, Yubo Chen, Kang Liu, and Jun Zhao. 2023.Instructor: Instructing unsupervised conversational dense retrieval with large language models.In Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore, December 6-10, 2023, pages 6649–6675. Association for Computational Linguistics.
  • Kwiatkowski etal. (2019)Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, AnkurP. Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming-Wei Chang, AndrewM. Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. 2019.Natural questions: a benchmark for question answering research.Trans. Assoc. Comput. Linguistics, 7:452–466.
  • Li etal. (2023)Junyi Li, Xiaoxue Cheng, Xin Zhao, Jian-Yun Nie, and Ji-Rong Wen. 2023.Halueval: A large-scale hallucination evaluation benchmark for large language models.In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, pages 6449–6464. Association for Computational Linguistics.
  • Lin etal. (2023)Sheng-Chieh Lin, Akari Asai, Minghan Li, Barlas Oguz, Jimmy Lin, Yashar Mehdad, Wen-tau Yih, and Xilun Chen. 2023.How to train your dragon: Diverse augmentation towards generalizable dense retrieval.In Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore, December 6-10, 2023, pages 6385–6400. Association for Computational Linguistics.
  • Lin etal. (2020)Sheng-Chieh Lin, Jheng-Hong Yang, RodrigoFrassetto Nogueira, Ming-Feng Tsai, Chuan-Ju Wang, and Jimmy Lin. 2020.Conversational question reformulation via sequence-to-sequence architectures and pretrained language models.CoRR, abs/2004.01909.
  • Ma etal. (2023)Xueguang Ma, Liang Wang, Nan Yang, Furu Wei, and Jimmy Lin. 2023.Fine-tuning llama for multi-stage text retrieval.CoRR, abs/2310.08319.
  • Mackie etal. (2023)Iain Mackie, Shubham Chatterjee, and Jeffrey Dalton. 2023.Generative relevance feedback with large language models.In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2023, Taipei, Taiwan, July 23-27, 2023, pages 2026–2031. ACM.
  • Mao etal. (2023a)Kelong Mao, Zhicheng Dou, Bang Liu, Hongjin Qian, Fengran Mo, Xiangli Wu, Xiaohua Cheng, and Zhao Cao. 2023a.Search-oriented conversational query editing.In Findings of the Association for Computational Linguistics: ACL 2023, Toronto, Canada, July 9-14, 2023, pages 4160–4172. Association for Computational Linguistics.
  • Mao etal. (2023b)Kelong Mao, Zhicheng Dou, Fengran Mo, Jiewen Hou, Haonan Chen, and Hongjin Qian. 2023b.Large language models know your contextual search intent: A prompting framework for conversational search.In Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore, December 6-10, 2023, pages 1211–1225. Association for Computational Linguistics.
  • Mao etal. (2022a)Kelong Mao, Zhicheng Dou, and Hongjin Qian. 2022a.Curriculum contrastive context denoising for few-shot conversational dense retrieval.In SIGIR ’22: The 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, Madrid, Spain, July 11 - 15, 2022, pages 176–186. ACM.
  • Mao etal. (2022b)Kelong Mao, Zhicheng Dou, Hongjin Qian, Fengran Mo, Xiaohua Cheng, and Zhao Cao. 2022b.Convtrans: Transforming web search sessions for conversational dense retrieval.In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7-11, 2022, pages 2935–2946. Association for Computational Linguistics.
  • Mao etal. (2023c)Kelong Mao, Hongjin Qian, Fengran Mo, Zhicheng Dou, Bang Liu, Xiaohua Cheng, and Zhao Cao. 2023c.Learning denoised and interpretable session representation for conversational search.In Proceedings of the ACM Web Conference 2023, WWW 2023, Austin, TX, USA, 30 April 2023 - 4 May 2023, pages 3193–3202. ACM.
  • Mao etal. (2020)Kelong Mao, XiXiao, Jieming Zhu, Biao Lu, Ruiming Tang, and Xiuqiang He. 2020.Item tagging for information retrieval: A tripartite graph neural network based approach.In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 2327–2336.
  • Mo etal. (2023a)Fengran Mo, Kelong Mao, Yutao Zhu, Yihong Wu, Kaiyu Huang, and Jian-Yun Nie. 2023a.Convgqr: Generative query reformulation for conversational search.In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, pages 4998–5012. Association for Computational Linguistics.
  • Mo etal. (2023b)Fengran Mo, Jian-Yun Nie, Kaiyu Huang, Kelong Mao, Yutao Zhu, Peng Li, and Yang Liu. 2023b.Learning to relate to previous turns in conversational search.In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD 2023, Long Beach, CA, USA, August 6-10, 2023, pages 1722–1732. ACM.
  • Mo etal. (2024)Fengran Mo, Chen Qu, Kelong Mao, Tianyu Zhu, Zhan Su, Kaiyu Huang, and Jian-Yun Nie. 2024.History-aware conversational dense retrieval.arXiv preprint arXiv:2401.16659.
  • Qian and Dou (2022)Hongjin Qian and Zhicheng Dou. 2022.Explicit query rewriting for conversational dense retrieval.In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7-11, 2022, pages 4725–4737. Association for Computational Linguistics.
  • Touvron etal. (2023)Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton-Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, PunitSingh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, EricMichael Smith, Ranjan Subramanian, XiaoqingEllen Tan, Binh Tang, Ross Taylor, Adina Williams, JianXiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurélien Rodriguez, Robert Stojnic, Sergey Edunov,and Thomas Scialom. 2023.Llama 2: Open foundation and fine-tuned chat models.CoRR, abs/2307.09288.
  • VanDijk etal. (1983)TeunAdrianus VanDijk, Walter Kintsch, etal. 1983.Strategies of discourse comprehension.
  • Wang etal. (2023)Liang Wang, Nan Yang, and Furu Wei. 2023.Query2doc: Query expansion with large language models.In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, pages 9414–9423. Association for Computational Linguistics.
  • Wu etal. (2022)Zeqiu Wu, YiLuan, Hannah Rashkin, David Reitter, Hannaneh Hajishirzi, Mari Ostendorf, and GauravSingh Tomar. 2022.CONQRR: conversational query rewriting for retrieval with reinforcement learning.In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7-11, 2022, pages 10000–10014. Association for Computational Linguistics.
  • Xiong etal. (2021)Lee Xiong, Chenyan Xiong, YeLi, Kwok-Fung Tang, Jialin Liu, PaulN. Bennett, Junaid Ahmed, and Arnold Overwijk. 2021.Approximate nearest neighbor negative contrastive learning for dense text retrieval.In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net.
  • Ye etal. (2023)Fanghua Ye, Meng Fang, Shenghui Li, and Emine Yilmaz. 2023.Enhancing conversational search: Large language model-aided informative query rewriting.In Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore, December 6-10, 2023, pages 5985–6006. Association for Computational Linguistics.
  • Yu etal. (2021)Shi Yu, Zhenghao Liu, Chenyan Xiong, Tao Feng, and Zhiyuan Liu. 2021.Few-shot conversational dense retrieval.In SIGIR ’21: The 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, Virtual Event, Canada, July 11-15, 2021, pages 829–838. ACM.
  • Zhou etal. (2021)Yujia Zhou, Zhicheng Dou, Yutao Zhu, and Ji-Rong Wen. 2021.PSSL: self-supervised learning for personalized search with contrastive sampling.In CIKM ’21: The 30th ACM International Conference on Information and Knowledge Management, Virtual Event, Queensland, Australia, November 1 - 5, 2021, pages 2749–2758. ACM.
  • Zhu etal. (2021)Yutao Zhu, Jian-Yun Nie, Zhicheng Dou, Zhengyi Ma, Xinyu Zhang, Pan Du, Xiaochen Zuo, and Hao Jiang. 2021.Contrastive learning of user behavior sequence for context-aware document ranking.In CIKM ’21: The 30th ACM International Conference on Information and Knowledge Management, Virtual Event, Queensland, Australia, November 1 - 5, 2021, pages 2780–2791. ACM.
  • Zhu etal. (2023a)Yutao Zhu, Huaying Yuan, Shuting Wang, Jiongnan Liu, Wenhan Liu, Chenlong Deng, Haonan Chen, Zhicheng Dou, and Ji-Rong Wen. 2023a.Large language models for information retrieval: A survey.CoRR, abs/2308.07107.
  • Zhu etal. (2023b)Yutao Zhu, Huaying Yuan, Shuting Wang, Jiongnan Liu, Wenhan Liu, Chenlong Deng, Haonan Chen, Zhicheng Dou, and Ji-Rong Wen. 2023b.Large language models for information retrieval: A survey.CoRR, abs/2308.07107.

Appendix

Appendix A Dataset Details

In this part, we will introduce more details of the four datasets we use.

QReCC represents the large-scale, open-domain conversational question-answering (QA) dataset featuring human-annotated question rewrites. It integrates conversations from QuACChoi etal. (2018), TREC CAsT, and Natural QuestionsKwiatkowski etal. (2019). The text corpus used for retrieval contains 54 million passages.

TopiOCQA comprises conversations coming from a real search query found in Natural Questions, with subsequent interactions simulated using a wizard-of-oz approach.

QReCCTrainingTesting
# Conversations10,8232,775
# Turns63,50116,451
# Passages54M
TopiOCQATrainingTesting
# Conversations3,509205
# Turns45,4502,514
# Passages25M

CAsT-20 and CAsT-21 were released by the TREC Conversational Assistance Track (CAsT).Their limited number of conversations often makes them evaluation datasets.Each query turn in both CAsT-20 and CAsT-21 has a corresponding human rewrite a canonical response passage.

DatasetCAsT-20CAsT-21
# Conversations2518
# Turns208157
# Passages38M40M

Appendix B Implementation Details

We use ANCE provided by Huggingface as the base model111https://huggingface.co/castorini/ance-msmarco-passage.We use k=1𝑘1k=1italic_k = 1 augmented negative conversations as hard negative.We set the temperatures as 0.0012 and 0.001 for training conversational context encoders on QReCC and TopiOCQA, respectively.The token mask ratio rwsubscript𝑟wr_{\text{w}}italic_r start_POSTSUBSCRIPT w end_POSTSUBSCRIPT and turn mask ratio rtsubscript𝑟tr_{\text{t}}italic_r start_POSTSUBSCRIPT t end_POSTSUBSCRIPT are tuned and established as 0.5 and 0.5, respectively for the QReCC dataset and 0.9 and 0.5, respectively for the TopiOCQA dataset.The learning rates are set as 1e-5 and 1.5e-5 for training on QReCC and TopiOCQA, respectively.The weight α𝛼\alphaitalic_α is set as 1.0 and 0.1 for QReCC and TopiOCQA, respectively.The model is trained with a batch size of 12.More details can be found in our code.

Appendix C Prompt Templates

C.1 Multi-level Data Augmentaion

C.2 Other Prompts

Appendix D Details of Calculating Difficulty

To estimate a conversation’s complexity, we use Diff(C)=|Th|+(|Topic(C)|PPL(C)¯)Diff𝐶subscript𝑇hTopic𝐶¯PPL𝐶\text{Diff}(C)=|T_{\text{h}}|+\left(|\text{Topic}(C)|*\overline{\text{PPL}(C)}\right)Diff ( italic_C ) = | italic_T start_POSTSUBSCRIPT h end_POSTSUBSCRIPT | + ( | Topic ( italic_C ) | ∗ over¯ start_ARG PPL ( italic_C ) end_ARG ).This equation is comprised of three components:(1) The number of the historical turns Thsubscript𝑇hT_{\text{h}}italic_T start_POSTSUBSCRIPT h end_POSTSUBSCRIPT.Longer conversations often contain richer informationMao etal. (2022a).(2) The number of topics.Each new topic introduces potential contextual shifts.We apply a topic model to count C𝐶Citalic_C’s topics (more details are in AppendixD).The topic model we used was pre-trained on Wikipedia (https://huggingface.co/MaartenGr/BERTopic_Wikipedia).We illustrate the process of counting topics for a conversation C𝐶Citalic_C in Alg.1.Intuitively, we assume the first turn of C𝐶Citalic_C has one topic and each turn can only add at most one topic to its previous turn.To ensure we only count new topics, we only add a topic if our topic model is more confident of identifying this new topic than its last identified topic.(3) The average perplexity of C𝐶Citalic_C.Perplexity is a measure to quantify how well an LM predicts a sample.We prompt an LLM (AppendixC.2) to predict the response based on the context and compute the average perplexity of all turns.A higher PPL(C)¯¯PPL𝐶\overline{\text{PPL}(C)}over¯ start_ARG PPL ( italic_C ) end_ARG indicates that the conversation contains a more challenging language.

The sentence-transformer model we use to calculate the similarity between augmented samples is all-MiniLM-L6-v2 (https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2).

0:a conversation {t1,t2,,tn}subscript𝑡1subscript𝑡2subscript𝑡𝑛\{t_{1},t_{2},\ldots,t_{n}\}{ italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT }, a topic model f()𝑓f(\cdot)italic_f ( ⋅ )

Initialize topicCounts𝑡𝑜𝑝𝑖𝑐𝐶𝑜𝑢𝑛𝑡𝑠topicCountsitalic_t italic_o italic_p italic_i italic_c italic_C italic_o italic_u italic_n italic_t italic_s as an empty list

Initialize topics𝑡𝑜𝑝𝑖𝑐𝑠topicsitalic_t italic_o italic_p italic_i italic_c italic_s as an empty list

fori𝑖iitalic_i in n𝑛nitalic_ndo

P𝑃absentP\leftarrowitalic_P ← f({t1,,ti})𝑓subscript𝑡1subscript𝑡𝑖f(\{t_{1},\ldots,t_{i}\})italic_f ( { italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } )

P𝑃absentP\leftarrowitalic_P ← Ptopics𝑃𝑡𝑜𝑝𝑖𝑐𝑠P\setminus topicsitalic_P ∖ italic_t italic_o italic_p italic_i italic_c italic_s

P𝑃absentP\leftarrowitalic_P ← SORT(P𝑃Pitalic_P, DESCENDING)

ifi==1i==1italic_i = = 1then

APPEND 1111 TO topicCounts𝑡𝑜𝑝𝑖𝑐𝐶𝑜𝑢𝑛𝑡𝑠topicCountsitalic_t italic_o italic_p italic_i italic_c italic_C italic_o italic_u italic_n italic_t italic_s

APPEND ARGMAX(P𝑃Pitalic_P) TO topics𝑡𝑜𝑝𝑖𝑐𝑠topicsitalic_t italic_o italic_p italic_i italic_c italic_s

confidence𝑐𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒absentconfidence\leftarrowitalic_c italic_o italic_n italic_f italic_i italic_d italic_e italic_n italic_c italic_e ← P[0]P[1]𝑃delimited-[]0𝑃delimited-[]1P[0]-P[1]italic_P [ 0 ] - italic_P [ 1 ]

else

confidenceP[0]P[1]𝑐𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐superscript𝑒𝑃delimited-[]0𝑃delimited-[]1confidence^{\prime}\leftarrow P[0]-P[1]italic_c italic_o italic_n italic_f italic_i italic_d italic_e italic_n italic_c italic_e start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ← italic_P [ 0 ] - italic_P [ 1 ]

ifconfidenceconfidence𝑐𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐superscript𝑒𝑐𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒confidence^{\prime}\geq confidenceitalic_c italic_o italic_n italic_f italic_i italic_d italic_e italic_n italic_c italic_e start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≥ italic_c italic_o italic_n italic_f italic_i italic_d italic_e italic_n italic_c italic_ethen

APPEND topicCounts[i1]+1𝑡𝑜𝑝𝑖𝑐𝐶𝑜𝑢𝑛𝑡𝑠delimited-[]𝑖11topicCounts[i-1]+1italic_t italic_o italic_p italic_i italic_c italic_C italic_o italic_u italic_n italic_t italic_s [ italic_i - 1 ] + 1 TO topicCounts𝑡𝑜𝑝𝑖𝑐𝐶𝑜𝑢𝑛𝑡𝑠topicCountsitalic_t italic_o italic_p italic_i italic_c italic_C italic_o italic_u italic_n italic_t italic_s

APPEND ARGMAX(P𝑃Pitalic_P) TO topics𝑡𝑜𝑝𝑖𝑐𝑠topicsitalic_t italic_o italic_p italic_i italic_c italic_s

confidenceconfidence𝑐𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒𝑐𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐superscript𝑒confidence\leftarrow confidence^{\prime}italic_c italic_o italic_n italic_f italic_i italic_d italic_e italic_n italic_c italic_e ← italic_c italic_o italic_n italic_f italic_i italic_d italic_e italic_n italic_c italic_e start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT

else

APPEND topicCounts[i1]𝑡𝑜𝑝𝑖𝑐𝐶𝑜𝑢𝑛𝑡𝑠delimited-[]𝑖1topicCounts[i-1]italic_t italic_o italic_p italic_i italic_c italic_C italic_o italic_u italic_n italic_t italic_s [ italic_i - 1 ] TO topicCounts𝑡𝑜𝑝𝑖𝑐𝐶𝑜𝑢𝑛𝑡𝑠topicCountsitalic_t italic_o italic_p italic_i italic_c italic_C italic_o italic_u italic_n italic_t italic_s

endif

endif

endfor

return topicCounts𝑡𝑜𝑝𝑖𝑐𝐶𝑜𝑢𝑛𝑡𝑠topicCountsitalic_t italic_o italic_p italic_i italic_c italic_C italic_o italic_u italic_n italic_t italic_s

Appendix E Examples of Generated Conversations

Generalizing Conversational Dense Retrieval via LLM-Cognition Data Augmentation (5)
Generalizing Conversational Dense Retrieval via LLM-Cognition Data Augmentation (6)

In this section, we present two examples of the full generated data of a turn by the LLM in Figure5 and Figure6.We only show the data generated by the LLM and the example contexts augmented by rule-based strategies (token masking, turn masking, and reordering based on the dependency graph generated by LLM) can be found in Figure2.

Generalizing Conversational Dense Retrieval via LLM-Cognition Data Augmentation (2024)

References

Top Articles
Latest Posts
Article information

Author: Madonna Wisozk

Last Updated:

Views: 5794

Rating: 4.8 / 5 (68 voted)

Reviews: 83% of readers found this page helpful

Author information

Name: Madonna Wisozk

Birthday: 2001-02-23

Address: 656 Gerhold Summit, Sidneyberg, FL 78179-2512

Phone: +6742282696652

Job: Customer Banking Liaison

Hobby: Flower arranging, Yo-yoing, Tai chi, Rowing, Macrame, Urban exploration, Knife making

Introduction: My name is Madonna Wisozk, I am a attractive, healthy, thoughtful, faithful, open, vivacious, zany person who loves writing and wants to share my knowledge and understanding with you.