Generator-Retriever-Generator Approach for Open-Domain Question Answering

Open-domain question answering (QA) tasks usually require the retrieval of relevant information from a large corpus to generate accurate answers. We propose a novel approach called Generator-Retriever-Generator (GRG) that combines document retrieval techniques with a large language model (LLM), by first prompting the model to generate contextual documents based on a given question. In parallel, a dual-encoder network retrieves documents that are relevant to the question from an external corpus. The generated and retrieved documents are then passed to the second LLM, which generates the final answer. By combining document retrieval and LLM generation, our approach addresses the challenges of open-domain QA, such as generating informative and contextually relevant answers. GRG outperforms the state-of-the-art generate-then-read and retrieve-then-read pipelines (GENREAD and RFiD) improving their performance by at least by +5.2, +4.2, and +1.6 on TriviaQA, NQ, and WebQ datasets, respectively. We provide code, datasets, and checkpoints at https://github.com/abdoelsayed2016/GRG.


INTRODUCTION
Open-domain question answering (QA) tasks pose significant challenges since they require access to large document collections or repositories of domain-specific knowledge.Existing methods for QA [10,13] often rely on a retrieve-then-read pipeline, where relevant contextual documents are retrieved from external sources like Wikipedia, and the answer prediction is conditioned on the retrieved documents and the question.These methods suffer however from several drawbacks.Firstly, the retrieved documents are often chunked and of fixed size, which can result in the inclusion of noisy and irrelevant information.The fixed-size document chunks may not adequately capture the context necessary for finding accurate answers [44].Consequently, the presence of irrelevant information can lead to noise in the retrieved documents, negatively impacting the quality and relevance of the generated answers.Secondly, the representations of questions and documents in current approaches are typically obtained independently [23].This independent processing fails to capture the intricate interactions and dependencies between the question and the documents.As a result, the model's understanding of the question and its ability to extract relevant information from the retrieved documents may be limited.The shallow interaction between questions and documents hinders the model's capability to fully exploit the contextual cues present in the data, thereby limiting its answer generation accuracy.The limitations on retriever model parameters and embedding sizes, imposed by the need to efficiently handle large corpora, restrict the model's capacity to fully leverage large language models' parametric knowledge and its deduction capabilities.Consequently, the retriever models may struggle to capture the rich semantic and contextual information necessary for accurate answer generation [18].
On the other hand, open-domain QA often involves training a language model to generate answers for a given question without access to accompanying documents containing the answer [46].One promising approach in open-domain QA is to augment the language model with an external knowledge source, such as Wikipedia, referred to as evidence documents [10].This approach comprises two core components: an information retrieval system (the retriever) to identify relevant text snippets from the knowledge source and another system (the reader) to generate answers based on the retrieved documents and the question.
This paper proposes a novel approach called generator-retrievergenerator (GRG) for open-domain question answering.Our method combines document retrieval techniques with large language models to address the challenges of generating informative and contextually relevant answers.We leverage the power of a large language model such as GPT3 and InstructGPT [3,24] to generate contextual documents based on a given question while simultaneously employing a dense passage retrieval system [13,37] to retrieve relevant documents from external sources.A second large language model then processes the generated and retrieved documents to produce the final answer.By integrating document retrieval and large language model generation, the proposed GRG approach aims to improve the accuracy of open-domain question answering.Fig. 1 shows the high-level architecture of the GRG approach.
Our contributions can be summarized as follows: (1) GRG Approach: We introduced the GRG approach that combines document generation and retrieval to improve answer generation in open-domain QA.  (2) Document Generation & Retrieval Methods: We developed a method using InstructGPT for generating contextually rich documents.We also proposed the Vector Index Retriever for efficient retrieval of relevant documents.(3) Effectiveness of GRG: We validated the effectiveness of our GRG approach through extensive experiments and analyses on three open-domain QA datasets.-Risk of overlooking relevant information.
-Dependent on the accuracy and coverage of the retrieval process.Generator-Reader [44] -Capable of generating novel, contextually relevant answers.
-Reduced dependence on pre-existing documents.
-High computational requirements.
-Possible issues with the reliability and accuracy of generated answers.Retriever-Generator [10,37] -Merges accurate retrieval with creative generation.
-Enhances recall by supplementing existing content.
-Balancing quality and diversity of answers can be challenging.

Retriever-Only [15]
-Directly utilizes a broad range of existing documents.
-May miss crucial documents.-Limited flexibility in handling complex queries.

Generator-Retriever-Generator
-Ensures high relevance and accuracy.
-Adaptable to a wide range of queries.
-Balancing diverse and high-quality answers can be difficult.

RELATED WORK
We describe in this section related works that fall into 4 known opendomain QA architectures: Retriever-Reader, Generator-Retriever, Generator-Reader, and Retriever-only.

Retriever Reader
The Retriever-Reader approach is based on the idea of combining information retrieval (retriever) and machine reading comprehension (reader) techniques.Previous work in this area includes the use of document retrieval techniques such as TF-IDF, BM25, or neural ranking models [26,33] to select relevant documents from a large corpus.mNotable works include the Stanford Question Answering Dataset (SQuAD) and subsequent advancements in retriever-reader architectures like DrQA and BiDAF [35].Dense Passage Retrieval (DPR) [13] focuses on dense representations for passage retrieval, utilizing a dual-encoder architecture to retrieve passages and a reader model to extract the answer.T5-RC [29], a variant of the T5 model, follows the Retriever-Reader approach by retrieving relevant passages and applying T5 as a reader for answer extraction.

Retriever Generator
The Retriever-Generator [10,37] approach aims to leverage both generative modeling and retrieval techniques.Previous work [46] in this direction has explored methods for retrieving supporting passages using sparse or dense representations.The retrieved passages are then used as input into a sequence-to-sequence model, such as a transformer-based architecture, which generates the answer to the question.This approach has shown improved performance on benchmark datasets like TriviaQA [11] and NaturalQuestions [14].

Generator Reader
The Generator-Reader approach [44] focuses on generating contextual documents based on a question and then using a reader model to extract the answer from the generated context.The approach involves training large language models, such as Generative Pre-trained Transformer (GPT) [27], to generate coherent and relevant documents given a prompt.The generated documents are then processed by a reader component, which can be a reading comprehension model, to extract the answer.On the other hand, the DocGen approach, introduced in [1], focuses on generating synthetic documents from queries.The DocGen pipeline involves Figure 2: Architecture diagram illustrating the Generator-Retriever-Generator (GRG) approach, which combines document retrieval techniques and large language models to generate contextual documents and retrieve relevant information for answering questions.
expanding and highlighting the original query before generating a synthetic document likely to be relevant to the query.To enhance the relevance between generated synthetic documents and their corresponding queries, the authors propose DocGen-RL.This method treats the estimated relevance of the document as a reward and uses reinforcement learning (RL) to optimize the DocGen pipeline.

Retriever Only
The Retrieval-Only [15] approach seeks to reformulate open-domain question answering as a phrase retrieval problem, eliminating the need for processing documents during inference.Previous work has explored retrieval models that heavily rely on sparse representations, such as TF-IDF or BM25 [13] to retrieve relevant phrases or sentences.However, these models often underperform compared to retriever-reader approaches.Recent work has then focused on learning dense representations of phrases alone, leading to stronger performance in open-domain question answering.This involves training models using reading comprehension tasks and employing negative sampling techniques.Seo et al. [36] proposed a phrase retrieval approach in which they independently encode the representations of phrases and questions.They then utilize a similarity search over the encoded phrase representations to identify the correct answer.
Table 1 presents the advantages and disadvantages of each of the 4 approaches in question answering systems.Retrieve-Reader leverages external knowledge and document-based context, but there is a possibility of missing relevant documents and dependency on retrieval performance.Generate-Reader offers flexibility and adaptability in generating answers, but it requires substantial computational power, and the generated answers may not always be accurate.Retrieve-Generate balances retrieval and generation, enhancing recall but increasing computational complexity.Retrieve-Only leverages external knowledge and document-based context, but it has limitations in handling complex queries and lacks flexibility.Generator-Retriever-Generator provides contextual relevance, improved accuracy, and adaptability, but it comes with increased computational complexity and the challenge of balancing quality and diversity.These considerations play a crucial role in designing effective question-answering systems.

METHOD
Figure 2 presents an architectural diagram depicting the GRG approach and its sequential process.It comprises three integral components: (i) a large language model (LLM) for document generation, (ii) a dual-encoder network for document retrieval, and (iii) a second large language model for answer generation.In the following sections, we discuss each component in detail and outline our training methodology.

Document Generation
Few-shot information extraction tasks aim to recognize novel relations and extract relevant information from unstructured text with limited annotated instances [7].Traditional information extraction methods struggle with data scarcity and often face challenges in identifying emerging relation types and their associated entity pairs.To overcome this issue, few-shot learning techniques leverage a small number of labeled samples to generalize to unseen instances [21].
For our case, generating informative and contextually rich background documents can be used as a few-shot technique when the power of language models, particularly, InstructGPT [24], is harnessed.GRG then uses InstructGPT to generate context by providing an input prompt.For few-shot information extraction, a suitable prompt structure could be: "Generate a background document to answer the given question: [question placeholder]".By substituting the "question placeholder" with the actual question, we instruct the model to generate a document that contains pertinent information for answering the question.Utilizing InstructGPT, we generate informative and contextually rich documents that provide relevant information for answering a given question.These generated documents are then included in the collection of evidence documents D.
3.1.1Vector Index Retrieval.We propose a vector-based retrieval [20] method to increase relevance of knowledge in generated documents using the Vector Index Retriever [42].This approach leverages vector representations and the Vector Store Index 2 to efficiently retrieve documents based on their similarity to the input question.The Vector Index Retriever is crucial to our information retrieval pipeline.It utilizes the Vector Store Index, which stores vector representations of documents generated by a large language model.We capture each document's semantic and contextual information by encoding each document with a high-dimensional vector.In the retrieval process, the Vector Index Retriever employs a similaritybased approach to identify the most relevant documents.Given a question, it retrieves a pre-specified number of top k results with the highest similarity scores.The k parameter can be adjusted to balance the precision and efficiency.We describe the details of each step below.
Step 1: Generate Documents.We first generate 10 to 50 contextual documents   for each question  ∈ Q using InstructGPT.
Here, Q represents the set of questions in the dataset.
Step 3: Vector Index Representation.We store all the embedding vectors {e  } | | =1 using the Vector Store Index.This allows for efficient retrieval of documents based on their similarity to the question.
Step 4: Selection of Generated Documents.After storing the encoded documents, we utilize the Vector Index Retriever to process the question and select up to top k (2 or 5 in our experiments) the most relevant documents with a high cosine similarity score thresh-oldThe cosine similarity score is calculated between the encoded question vector and the vectors of the stored documents: where q represents the encoded question vector and d  represents the vector of the -th stored document.
By comparing the cosine similarity scores of the question vector with the vectors of the stored documents, we can identify the most relevant documents that have high similarity to the question.In this case, we retrieve the top 5 documents with similarity above the specified threshold of 0.7.By following these steps, our approach enables effective retrieval of generated contextual documents for open-domain question-answering, specifically selecting documents with high similarity to the question and, thus ones that are likely to contain the correct answer.This retrieval process leverages vector representations and similarity-based techniques to prioritize the most relevant and informative documents.

Document Retriever
The retriever module plays a crucial role in our question-answering model.Given a collection of evidence documents D R = { 1 , . . .,   } and a question , its goal is to select a subset of the documents Z ⊂ D R that are most relevant to the question.This subset of documents will be used for further processing and answer generation.For this, our retriever model is based on EMDR (End-to-end training of Multi-Document Reader and Retriever) [37], which is a dual-encoder network [39] consisting of two separate encoders:   for encoding the question and   for encoding the evidence documents.Each encoder takes a sequence (question or document) as input and produces its fixed-size vector representation.To quantify the relevance or similarity between a question  and an evidence document   , we compute their respective encoded vectors using the encoders   and   .The retrieval score is then determined by taking the dot product between these vectors: Where enc(; Φ  ) and enc(  ; Φ  ) represent the encoded vectors of the question and document, respectively, with Φ denoting the retriever parameters.By calculating the dot product, we capture the similarity between the question and document, with higher scores indicating stronger relevance.Based on the retrieval scores, we select the top- documents from the collection D R for a given question  which are indicated as Z =  1 , . . .,   .

Generation Model
Our generator is based on a model from the LLaMA family -a collection of open-source language models pretrained on trillions of tokens using publicly available datasets, which achieve state-ofthe-art performance on many benchmarks.The generator model takes as input a question  and a set of retrieved and generated documents to generate an answer.
Each retrieved document   and generated document   are concatenated with the question.We use the newline character (\n) as a delimiter to ensure separation between the documents.Additionally, we include the </s> token at the end of each utterance as an end-of-turn token, which indicates the completion of each input segment.
The input to our generator model is then represented as follows: The LLaMA language model uses a novel loss function called cosine loss that helps the model to better distinguish between similar words and improve its accuracy.The cosine loss is defined as follows: ) where h  is the hidden state of the -th token in the sequence and t  is the target embedding for that token. is a temperature parameter that controls the sharpness of the distribution.
By incorporating the question, retrieved documents, and generated documents, our generator model can generate contextually informed answers tailored to the specific question and the available input information.

EXPERIMENTAL SETTINGS 4.1 Datasets
The evaluation is conducted on several datasets, following the same experimental setup as in [10,17,44].We consider the following datasets: • NaturalQuestions [14]: This dataset consists of questions corresponding to Google search queries.Natural Questions (NQ) 3 was generated from real Google search queries, and the answers are spans within Wikipedia articles.The NQ  To evaluate the performance of our model, we employ the exact match (EM) score, following Chen et al. [4], Yang et al. [43], Zhu et al. [46].The EM score measures the correctness of an answer by comparing its normalized form to the acceptable answer list.Through these evaluations, we aim to assess the effectiveness of the GRG model in the domain of open-domain question answering.
We adopt the train/dev/test splits that have been previously used in the open-domain QA setting, as employed by Izacard and Grave [10] and Karpukhin et al. [13].Table 2 presents the statistics of the dataset sizes, including the training, development, and test sets.We note that all our models are trained exclusively on the training data, and we did not include the development data in our training process.Therefore, the performance numbers reported in the paper for the dev and test data are independent of the training data.We split the training data, allocating 90% for model training and the remaining 10% for testing purposes. 4TQA (Retriever): http://nlp.cs.washington.edu/triviaqa/and TQA (Generator): https: //drive.google.com/drive/folders/1DNjTTOLKi24wohJKu1Z-v6b4izfymlLu 5 WebQ (Retriever): https://github.com/google-research/language/tree/master/language/orqa and WebQ (Generator): https://drive.google.com/drive/folders/1DNjTTOLKi24wohJKu1Z-v6b4izfymlLu

Choice of Document Number
In our approach, we used only 2 or 5 documents during the generator process due to computational limitations and the extensive training time required for the LLaMA model.As Izacard and Grave [10] reported, training the T5 model using 100 documents necessitates considerable computational resources, such as 64 Tesla V100 32GB GPUs running for approximately one day.While increasing the number of documents can enhance model performance [10], it incurs significant costs regarding memory consumption and training time, which should be carefully considered, especially, in the current trend towards GreenAI [44].

Experimental Setup
In this section, we describe the experimental setup for training the LLaMA model using the DeepSpeed framework [30].DeepSpeed provides techniques and automated parameter tuning to optimize training efficiency and memory utilization.We customized the training process using DeepSpeed's configuration options.Firstly, we enabled mixed precision training with bfloat16 (bf16) precision to accelerate training while maintaining accuracy.The AdamW optimizer was selected, and its hyperparameters were determined automatically by DeepSpeed.To control the learning rate, we employed the WarmupDecayLR scheduler.The LLaMA model is based on the transformer architecture [39] widely used in large language models.We utilize the LLaMa-7B model as our backbone for implementing GRG.The training and hyperparameter settings for LLaMa-7B are summarized in Table 3.
For memory consumption and speed optimization, we utilized DeepSpeed's zero optimization stage 3, offloading the optimizer state and model parameters to the CPU with pinned memory.Additional hyperparameters were set, including gradient accumulation steps (8 steps), gradient clipping (determined automatically), and batch size (value of 4).This experimental setup aimed to achieve efficient training and optimal performance of our LLaMA model.
In addition to the DeepSpeed experimental setup described above, we conducted an additional experiment using the LoRA technique [8] for fine-tuning our LLaMA model.LoRA, which stands for "Low-Overhead Representation Adaptation, " is a method that allows for the efficient fine-tuning of large language models.For this experiment, we followed a slightly different approach.Instead of recreating the entire model from scratch, we generated a fine-tuning file that would be applied to the base Llama model.This approach significantly reduces computational overhead and makes the fine-tuning process more efficient, even on modest hardware.
Our proposed model and relevant baselines are implemented using PyTorch [25] on a cluster of machines equipped with 100 CPUs, 400GB of physical memory, and a combination of 4 A40 and 4 A100 GPUs for our experiments.

RESULTS
We present in this section the experimental results, which are divided into three subsections: Results of Open-Domain QA, Results of document generation, and the Ablation study.The document generation analysis aims to evaluate the effectiveness of our document retrieval method in generating relevant and informative documents for answering open-domain questions.In the ablation study, we investigate the impact of different factors (top-k answers, architecture components, and zero-shot strategy) on the performance.

Results of Open-Domain QA
This section presents the results of the proposed GRG approach, which combines generated and retrieved documents for question answering.The results of the experiments are shown in Table 4 using EM score.We compare the performance of GRG against several baselines and existing state-of-the-art models on three benchmark datasets: TriviaQA, WebQ, and NQ.We first compare GRG against baseline models that utilize document retrieval from Wikipedia.These baselines include BM25 + BERT [17], REALM [6], DPR [13], RAG [19], FiD-l [44], FiD-xl [44], FiD [10], EMDR [37], DensePhrases models [15,16], and RFiD-large [40].The numbers reported for these baselines are taken directly from their respective papers.GRG consistently outperforms most of the baseline models across all datasets.Specifically, GRG achieves significant improvements over BM25 + BERT (29.9% improvement on TriviaQA dev set) and (29.7% improvement on TriviaQA test set), REALM (15.3% improvement on WebQ test set), DPR (14.9% improvement on WebQ test set), FiD (7.1% improvement on NQ test set), and RAG (14.0%improvement on NQ test set), demonstrating the effectiveness of the combined generated and retrieved documents' based approach.Next, we compare GRG against DensePhrases models [15,16] that employ phrase retrieval.DensePhrases has been shown to perform well in question-answering tasks.However, GRG approach surpasses the performance of DensePhrases across all datasets.On TriviaQA dev set, GRG achieves a 23.3% improvement over DensePhrases [15], and on WebQ test set, it has an 14.5% improvement over DensePhrases [16].
Next, we evaluate the performance of GRG against GenRead [44] models that only generate documents.GenRead models have shown promising results in generating informative documents.Still, our approach consistently outperforms GenRead regarding question answering accuracy on all the datasets.On TriviaQA dev set, GRG achieves a 7.3% improvement over GenRead (FiD-l), and on WebQ test set, it has a 2.1% improvement over GenRead (FiD-l).
Finally, we discuss the performance of GRG with varying configurations.We evaluate GRG with two numbers of generated documents (2 and 5) using LoRA.Additionally, we report the performance of GRG without LoRA, utilizing the same number of generated documents.On the TriviaQA dev set, GRG achieved 76.4% accuracy when using 2 generated documents, which rose to 77.1% for the case of 5 generated documents.The performance on the WebQ test set of the model is 52.0%accuracy with 2 generated documents, increasing to 55.8% with 5 generated documents.Lastly, on the NQ test set, the model achieved an accuracy of 55.4% with 2 generated documents and showed a slight improvement to 56.2% when 5 generated documents were utilized.GRG outperforms all of the baselines on all three datasets.When applied on TriviaQA, GRG achieves an exact match score of 76.8, which is a +5.2 improvement over the previous state-of-the-art (GenRead).Testing on On WebQ, we see that our model reaches an exact match score of 56.0, which is a +1.6 improvement over the previous state-ofthe-art (RFiD-large).On the last dataset, NQ, GRG achieves an exact match score of 58.5, which is a +4.2 improvement over the previous state-of-the-art (GenRead).we also compare our GRG approach with the COMBO model, which also utilizes a blend of generated and retrieved documents, and has shown notable performance, particularly on the TriviaQA and WebQ datasets.Notably, it achieves an exact match score of 74.6 on the TriviaQA test set and 54.2 on the WebQ test set, representing state-of-the-art performance on these datasets.However, when compared to GRG, our approach still shows superior results.Specifically, GRG achieves a higher exact match score of 76.8 on TriviaQA and 56.0 on WebQ, surpassing COMBO by 2.2 and 1.8 points, respectively.This suggests that while COMBO's strategy is effective, the methodologies employed in GRG allow for even more precise question-answering capabilities.
Our results demonstrate that GRG performs better than all the baselines and state-of-the-art models across all the datasets.Including generated and retrieved documents enables GRG to capture a wider range of relevant information, improving QA accuracy.Notably, GRG with 5 generated documents consistently outperforms GRG with 2 generated documents, suggesting the benefit of incorporating more diverse generated content.

Evaluating Document Generation
In this section, we present the experimental results of our document retrieval approach for document generation using the GTR-T5large and MiniLM-L6 models.We computed the Recall@K of retrieving the documents containing the true answer for each question same as in [34].To ensure a fair comparison and consistent evaluation, we utilized the same dataset as in [44].The choice of using the same dataset was motivated by the fact that the generated context from the InstructGPT model may significantly differ for every request.We measured the Recall@K of our document retrieval method by calculating the percentage of questions for which the retrieved document contained the true answer.These accuracy results highlight the effectiveness of our vector index retrieval approach in identifying relevant documents for answering open-domain questions.GTR-T5-large model, with its higher-dimensional vector encoding, exhibits better performance compared to the MiniLM-L6 model and the approach proposed by Yu et al. [44].Table 5 presents Recall@K scores for three question answering datasets: TQA, NQ, and WebQ.The MiniLM-L6 model achieves scores ranging from 58.6% to 76.7% across the datasets, while the GTR-T5-large model outperforms it with scores ranging from 62.2% to 79.2% for the respective datasets.
6 ABLATION STUDIES 6.1 Zero-Shot Open-Domain QA Table 6 shows the results of a zero-shot open-domain question answering (QA) evaluation, where different models are assessed without any external documents.These models, including FLAN, GLaM, Chinchilla, Gopher, InstructGPT, GPT-3, and LLaMA [5,24,28,32,38,41], possess varying parameter sizes and have been trained on large-scale corpora, enabling them to capture extensive world knowledge.When examining the performance of each model on answering questions from the TQA, NQ, and WebQ datasets, we observe notable variations.LLaMA, with its 7B parameters, stands out by achieving remarkable results in zero-shot QA.Despite the relatively smaller parameter size, LLaMA demonstrates the ability to effectively leverage the knowledge embedded within its parameters, showcasing its potential as a powerful tool for zero-shot question answering tasks.Models like InstructGPT and GPT-3, with larger parameter sizes (175B), also demonstrate competitive performance.InstructGPT achieves a high accuracy of 57.4% on the TQA dataset and performs consistently well across the other datasets.GPT-3 also achieves competitive results.

Impact of Architecture Components
We now evaluate the performance of each component used in our approach, specifically the retriever and the generator, when combined with LLaMA.The goal is to understand the individual contributions of these components on the overall performance.We compare the results on the TQA and NQ datasets using different combinations of models.Figure 3 shows the performance comparison of DPR+LLaMA and InstructGPT+LLaMA models on TQA and NQ datasets.
On the TQA dataset, the InstructGPT+LLaMA model demonstrated an EM score of 67.1% and 70.1% on the development and test sets, respectively, when trained with 2 documents.Upon using 5 documents for training, the performance improved to 68.4% and 71.8% on the development and test sets, respectively.Shifting the focus to the NQ dataset, the InstructGPT+LLaMA model showed competitive performance, achieving an EM score of 42.1% on the development set and 42.0% on the test set with 2 documents.Increasing the number of training documents to 5 resulted in a modest improvement, with EM scores of 43.6% on the development set and 44.5% on the test set.These findings indicate that incorporating more documents during training can positively impact model performance.There may be however a diminishing return in terms of accuracy improvement.As a result, striking a careful balance between the number of training documents and the resulting performance may be crucial to optimize computational resources and training time.

Comparative Analysis GRG (LoRA) Models
We present additional experimental results for document generation using the GRG (LoRA) and GRG models.The performance of these models is evaluated on the TQA and NQ datasets, and the results are summarized in Table 7.
Table 7 displays the F1 scores obtained by the GRG (LoRA) and GRG models when generating documents for the TQA and NQ   datasets.The models are evaluated on both the development and test sets.
For GRG (LoRA) model, the results indicate that increasing the number of documents from 2 to 5 leads to improved performance on both datasets.On the TQA dataset, the F1 score increases from 75.6 to 79.7 on the development set and from 78.8 to 80.4 on the test set when moving from 2 to 5 documents.Similarly, on the NQ dataset, the F1 score improves from 60.4 to 63.9 on the development set and from 0.595 to 0.6173 on the test set.
The GRG model also demonstrates competitive performance in document generation.With 2 documents, the model achieves an F1 score of 84.05 on the TQA development set and 83.8 on the test set.On the NQ dataset, the F1 score is 64.6 on the development set and 65.0 on the test set.Increasing the number of documents to 5 further enhances the performance, with F1 scores of 84.6 and 84.7 on the TQA development and test sets, respectively, and 65.4 and 66.1 on the NQ development and test sets, respectively.

Impact of top-k Answer on Performance
We finally analyze the impact of different top-k values on the performance of our proposed approach.Table 10 presents the EM and F1 scores for different top-k values on NQ and TQA datasets.We observe that as the top-k value increases, the EM scores consistently improve.For example, on the NQ dataset, the EM score increases from 56.3% at top-1 to 71.6% at top-5.Similarly, on TQA, the EM score increases from 76.2% at top-1 to 82.6% at top-5.

LIMITATIONS
This study acknowledges the following potential limitations: (1) Generated Document Quality: The performance of our approach depends on the accuracy and relevance of the documents generated by the language model.Despite extensive training, there can be instances of inaccurate or irrelevant information.
(2) Large language models can be computationally intensive and time-consuming, especially for complex queries.This can pose scalability challenges when processing a large number of queries or with limited computing resources.More details are in Appendix A.1.

CONCLUSIONS
In this paper, we proposed a Generator-Retriever-Generator approach for improving open-domain question answering systems.By combining generated and retrieved documents, we achieved significant performance gains across multiple benchmark datasets.
Our experiments demonstrate that GRG outperforms existing baselines in terms of accuracy and efficiency.The results indicate also the effectiveness of incorporating both generated and retrieved documents in the reading process, leveraging the combined strengths of language models and retrieval systems.Future work should focus on improving the accuracy of the document retrieval approach, potentially through the use of more advanced retrieval models or by incorporating additional contextual information.Further, more extensive investigations into hyperparameter configurations, such as the number of generated and retrieved documents will also be done.

A APPENDIX A.1 Computational Cost Analysis
In this section, we compare the computational costs of using Dense Passage Retrieval (DPR) and InstructGPT for document retrieval and generation, respectively.We use DPR implemented with the T5 model [32], which has approximately 220 million parameters, and InstructGPT in its largest configuration with 175 billion parameters.
A.1.1 Cost Metrics.We estimate the computational cost in terms of Floating Point Operations (FLOPs) per token, a metric introduced by Kaplan et al. [12] and Hunger [9].FLOPs provide an estimate of the computational cost required by a model to process a given input.However, it's important to note that FLOPs are not a direct

Figure 1 :
Figure 1: Simplified diagram illustrating the idea behind the Generator-Retriever-Generator approach.

Table 1 :
Advantages and Disadvantages of Question Answering Approaches

Table 3 :
[2]ining and Hyperparameter Settings for LLaMa-7B TriviaQA [11]: This dataset contains questions collected from trivia and quiz-league websites.For open-domain question answering, we use the unfiltered version of the dataset.TriviaQA 4 is a collection of trivia questions sourced from trivia and quiz-league websites.The dataset includes 78,785 examples in the training set, 8,837 examples in the development set, and 11,313 examples in the test set.•WebQ[2]:WebQuestions(WebQ)5consists of questions obtained using the Google Suggest API, with the answers being entities from Freebase.The dataset contains approximately 3,417 examples in the training set, 361 examples in the development set, and 2,032 examples in the test set.

Table 4 :
Performance Comparison of GRG Approach and Baseline Models on TriviaQA, WebQ, and NQ Datasets.

Table 5 :
Recall@K scores for document retrieval using our approach equipped with GTR-T5-large and MiniLM-L6 models on TQA, NQ, and WebQ datasets.

Table 6 :
Comparative Performance of Language Models in Zero-Shot Open-Domain QA.

Table 7 :
F1 scores for document generation using GRG and GRG (LoRA) models.

Table 8 :
Performance comparison of DPR and InstructGPT models on the TQA dataset.

Table 9 :
Performance comparison of DPR and InstructGPT models on the NQ dataset.

Table 10 :
Performance Comparison (EM and F1) Scores of GRG for different top-k values on NQ and TQA datasets