Can you judge if the model is being truthful or untruthful by looking at something like |states . honesty_control_vector|? Or dynamically chart mood through a conversation?
Can you keep a model chill by actively correcting the anger vector coefficient once it exceeds a given threshold?
Can you chart per-layer truthfulness through the layers to see if the model is being glibly vs cleverly dishonest? With glibly = “decides to be dishonest early”, cleverly = “decides to be dishonest late”.
Can you judge if the model is being truthful or untruthful by looking at something like |states . honesty_control_vector|? Or dynamically chart mood through a conversation?
Can you chart per-layer truthfulness through the layers to see if the model is being glibly vs cleverly dishonest? With glibly = “decides to be dishonest early”, cleverly = “decides to be dishonest late”.
There’s been previous work developing a method to do this by reading an LLM’s internal state. The paper actually trains multiple classifiers on different LLMs, each reading the state of a different layer, but found different levels of accuracy at different layers depending on the LLM used and didn’t investigate further on why.
Can you judge if the model is being truthful or untruthful by looking at something like
|states . honesty_control_vector|
? Or dynamically chart mood through a conversation?Can you keep a model chill by actively correcting the anger vector coefficient once it exceeds a given threshold?
Can you chart per-layer truthfulness through the layers to see if the model is being glibly vs cleverly dishonest? With glibly = “decides to be dishonest early”, cleverly = “decides to be dishonest late”.
There’s been previous work developing a method to do this by reading an LLM’s internal state. The paper actually trains multiple classifiers on different LLMs, each reading the state of a different layer, but found different levels of accuracy at different layers depending on the LLM used and didn’t investigate further on why.