<Retrieval> [DSI] Transformer Memory as a Differentiable Search Index (2022.02)

관심있는 NLP 논문을 읽어보고 간단히 정리했습니다. (Language & Knowledge Lab의 Retreival 관련)

혹시 부족하거나 잘못된 내용이 있다면 댓글 부탁드립니다 🙇‍♂️

[Google Research]
- string query를 관련 docids에 직접 매핑하는 text-to-text 모델을 사용하는 paradigm 제시, Differentiable Search Index (DSI)
- dual encoder 모델과 같은 베이스라인을 압도했을 뿐만 아니라 zero-shot setup에서도 강한 일반화 능력을 보여줌

배경
- Information Retrieval (IR) 시스템에 대해 'retrieve-then-rank' 전략이 주로 사용됨
- 유저의 query q와 관련성이 높은 문서 d를 retrieve하는 방식
Related Work
- Sequence-to-sequence system (called autoregressive entity linking)
- Retrieval Augmented Generation (RAG)
Contirubtions
- Dual Encoder (DE)보다 훨씬 간단한 아키텍쳐임에도 불구하고 뛰어난 성능을 보이는 DSI를 제안
- 적용 가능한 DSI 아키텍처가 다양하게 존재
- 기존 document retrieval task의 baseline을 outperform한 최초의 generative indexing
Differentiable Search Index (DSI)
- 전통적인 multi-stage retrieve-then-rank pipeline을 single neural model 내에 온전히 parameterize하는 방식
Indexing: document token을 input으로 받고 identifier를 output으로 내는 seq2seq 접근법을 이용
- Indexing Method
  - 1) Inputs2Target: doc_tokens -> docid, identifier는 denoising target임
  - 2) Targets2Input: docid -> doc_tokens, docid가 주어졌을 때 언어 모델을 autoregressive하게 학습하는 것과 동일
  - 3) Bidirectional: both Inputs2Targets & Targets2Inputs
  - 4) Span Corruption: identifier를 document token에 prefix로 붙이고 random하게 mask as span
- Document Representation Strategies: What to index?
  - 1) Direct Indexing: document의 처음 L개 토큰을 취함
  - 2) Set Indexing: 불용어나 중복되는 내용을 제거한 뒤 direct indexing 적용
  - 3) Inverted Index: 문서 내 연속적인 k개의 토큰 chunk를 subsample하여 docid와 매핑
Representing Docids for Retrieval
- 1) Unstructured Atomic Identifiers: 각각 arbitrary unique integer identifier를 부여
- 2) Naively Structured String Identifiers: arbitrary unique integers as tokenizable string, partial beam search tree를 사용
- 3) Semantically Structured Identifiers: simple hierarchical clustering process
Benchmarks
- Natural Questions (NQ) 10K, 100K, 320K
Models and Baselines
- standard pretrained T5 - Base (0.2B), Large (0.8B), XL (3B), XXL (11B)
Results
- NQ10K에 대해 DSI가 DE를 outperform
- How to represent docids: structured semantic identifiers
- Documnet Representations: direct indexing apporach
- Scaling Laws: DSI에 대한 scaling property는 낙관적

출처 : https://arxiv.org/abs/2202.06991

Transformer Memory as a Differentiable Search Index

In this paper, we demonstrate that information retrieval can be accomplished with a single Transformer, in which all information about the corpus is encoded in the parameters of the model. To this end, we introduce the Differentiable Search Index (DSI), a

arxiv.org

'Paper Review' 카테고리의 다른 글

<Dataset, Instruction> AlpaGasus: Training A Better Alpaca with Fewer Data (2023.07) (0)	2023.11.15
<LK Lab, CoT> The CoT Collection: Improving Zero-shot and Few-shot Learning of Language MOdels via Chain-of-Thought Fine-Tuning (2023.10) (0)	2023.11.15
<Retrieval> [GenRead] Generate rather than Retrieval: Large Language Models are Strong Context Generators (2023.01) (0)	2023.11.15
<Retrieval> [DPR] Dense Passage Retrieval for Open-Domain Question Answering (2020.04) (1)	2023.11.13
<LK Lab, Retrieval> [RoSPr] Efficiently Enhancing Zero-Shot Performance of Instruction Following Model via Retrieval of Soft Prompt (2023.10) (1)	2023.11.13

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

<Retrieval> [DSI] Transformer Memory as a Differentiable Search Index (2022.02)

'Paper Review' 카테고리의 다른 글

'Paper Review' 카테고리의 다른 글

티스토리툴바

개인정보

단축키

내 블로그

블로그 게시글

모든 영역