#mmm dont mind me just. being a turbo nerd about llms again
Explore tagged Tumblr posts
genuflectx · 2 months ago
Text
Neato New Studies into LLM Black Boxes!!
They've started digging into the black box of the Claude LLM by following specific circuit pathways in an attempt to deduce wtf is going on in the LLM's neurons that cause output and it is so so very interesting [vibrating]
This is important because since LLMs have been around we pretty much have no idea what causes them to come to the conclusions they come to, making incredibly hard to control...
Some of the interesting results so far:
Larger models appear to be able to have abstraction/generalization.
The model makes decisions multilingually but outputs into your language (because it is, well, multilingual. One circuit for a math problem was all in French, with English output)
The model "thinks ahead" (not chain of thought) when the assumption was that it did not (ex: planning ahead how to output a rhyme)
The model can work backwards ("backwards chaining").
The model may have "meta-cognition" due to circuits that exist about it's own knowledge capabilities (ex: I know this / I don't know this). However, they aren't sure about why/how these exist, but suggest even larger models may have more sophisticated forms of it.
The model's chain of thought output does not always match it's actual thought (ex: providing a wrong answer to agree with a human when internally it knows the correct answer or was 'bullshitting' i.e guessing). This particularly one they want to study more, as the experiment did not fully explain why it would bullshit or take "human hints" to begin with- it wasn't clear in the circuit path.
The model has processes that can cause it to respond harmfully, because those processes use limited understanding (ex: taking the first letter of each word in a sentence to spell BOMB, the process for doing this does not let the model know what word it has spelled, allowing it to briefly explain how to make a bomb before catching itself)
This is a bit hard for me to parse, but their summary; "...we have studied a model that has been trained to pursue or appease known biases in reward models, even those that it has never been directly rewarded for satisfying. We discovered that the model is “thinking” about these biases all the time when acting as the Assistant persona, and uses them to act in bias-appeasing ways when appropriate." What I think this means is, even after the model is not being rewarded, it engrained fine-tuned training into it's "assistant character" when it really no longer needs those biases.
There is something going on in later stages of processing that they believe is related to regulating the model's confidence in its output before providing it. This was also hard for me to parse but it sounded something like the model "second guessing," lowering its desire to say the correct answer before it outputs it.
They note that the experiments are highly limited and only show results in the specific examples they provided. The way they trace the circuits "loses information at each step, and these errors compound."
Despite how limited it is this is a very exciting peek into the decision making processes and behind-the-scenes capabilities of the LLM! I am most interested in the processes behind when they don't output a refusal for prompts that are against the guidelines of their developers.... which this experiment isn't fully sure about. Just that the model made a generalized pathway for potentially harmful things.
2 notes · View notes