alignment

<LLM, Distillation, Safety> Language models transmit behavioural traits through hidden signals in data (2026.04) (Nature)

2026.04.19· Paper Review

관심 있는 NLP 논문을 읽어보고 간단히 정리했습니다. 혹시 부족하거나 잘못된 내용이 있다면 댓글 부탁드립니다 🙇‍♂️[Anthropic, Truthful AI, Warsaw Univ. of Technology, Oxford, ARC, UC Berkeley]- teacher 모델의 behavioural trait (동물 선호나 misalignment 등)이 의미적으로 전혀 관련 없는 데이터 (숫자 시퀀스 등)를 통해 student 모델에게 전파되는 subliminal learning 현상을 발견- 이 현상은 teacher와 student가 동일한 (혹은 behaviourally matched) base model을 공유할 때만 발생- 단 한 번의 gradient descent step이 student를..

<LK Lab, Alignment> [ALMoST] Aligning Large Language Models through Synthetic Feedback (2023.10)

2023.11.13· Paper Review

[Naver, KAIST, SNU] - human annotation이나 proprietary LLM에 의존하지 않고 합성 데이터를 이용하는 alignment learning framework - vanilla LLM으로부터의 output을 대조시키는 방식으로 reward modeling을 진행 - RM을 이용하여 high-quality demonstration에 대해 supervised policy를 학습 - model을 강화학습을 통해 optimize 배경 Alignment learning은 large language model의 성능 향상에 큰 영향을 주었지만 관련 데이터 확보나 학습 관점에서 비용이 너무 많이 든다는 문제점이 존재 본 논문에서는 합성 데이터를 생성함으로써 위 방식의 단점을 극복하고..

티스토리툴바