Open problems and fundamental limitations of reinforcement learning from human feedback S Casper, X Davies, C Shi, TK Gilbert, J Scheurer, J Rando, R Freedman, ...
arXiv preprint arXiv:2307.15217, 2023
282 2023 The geometry of truth: Emergent linear structure in large language model representations of true/false datasets S Marks, M Tegmark
arXiv preprint arXiv:2310.06824, 2023
56 2023 Sparse feature circuits: Discovering and editing interpretable causal graphs in language models S Marks, C Rager, EJ Michaud, Y Belinkov, D Bau, A Mueller
arXiv preprint arXiv:2403.19647, 2024
23 2024 Open problems and fundamental limitations of reinforcement learning from human feedback. CoRR, abs/2307.15217, 2023. doi: 10.48550 S Casper, X Davies, C Shi, TK Gilbert, J Scheurer, J Rando, R Freedman, ...
arXiv preprint ARXIV.2307.15217, 0
8 Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models C Denison, M MacDiarmid, F Barez, D Duvenaud, S Kravec, S Marks, ...
arXiv preprint arXiv:2406.10162, 2024
5 2024 Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback (arXiv: 2307.15217). arXiv S Casper, X Davies, C Shi, TK Gilbert, J Scheurer, J Rando, R Freedman, ...
5 2023 & Hadfield-Menell, D.(2023). Open problems and fundamental limitations of reinforcement learning from human feedback S Casper, X Davies, C Shi, TK Gilbert, J Scheurer, J Rando, R Freedman, ...
arXiv preprint arXiv:2307.15217, 0
5 Connecting the Dots: LLMs can Infer and Verbalize Latent Structure from Disparate Training Data J Treutlein, D Choi, J Betley, C Anil, S Marks, RB Grosse, O Evans
arXiv preprint arXiv:2406.14546, 2024
2 2024 The quest for the right mediator: A history, survey, and theoretical grounding of causal interpretability A Mueller, J Brinkmann, M Li, S Marks, K Pal, N Prakash, C Rager, ...
arXiv preprint arXiv:2408.01416, 2024
1 2024 Measuring progress in dictionary learning for language model interpretability with board game models A Karvonen, B Wright, C Rager, R Angell, J Brinkmann, L Smith, ...
arXiv preprint arXiv:2408.00113, 2024
1 2024 Nnsight and ndif: Democratizing access to foundation model internals J Fiotto-Kaufman, AR Loftus, E Todd, J Brinkmann, C Juang, K Pal, ...
arXiv preprint arXiv:2407.14561, 2024
1 2024 Prismatic -crystals and Lubin-Tate -modules S Marks
arXiv preprint arXiv:2303.07620, 2023
2023 Laurent F-Crystals and Lubin-Tate (φq, Γ)-Modules S Marks
Harvard University, 2023
2023 p-adic Modular Formsa la Serre S Marks
2020 Derivatives of p-adic Siegel Eisenstein series and p-adic degrees of arithmetic cycles SP Marks
Princeton University, 2019
2019 p-Adic Properties of Hauptmoduln with Applications to Moonshine RC Chen, S Marks, M Tyler
SIGMA. Symmetry, Integrability and Geometry: Methods and Applications 15, 033, 2019
2019 Prismatic F-crystals and Lubin-Tate (φq, Γ)-modules S Marks