Representation Engineering Mistral-7B an Acid Trip

RA2lover@burggit.moe · 7 months ago

Representation Engineering Mistral-7B an Acid Trip

RA2lover@burggit.moe · 7 months ago

Can you judge if the model is being truthful or untruthful by looking at something like |states . honesty_control_vector|? Or dynamically chart mood through a conversation?

Can you chart per-layer truthfulness through the layers to see if the model is being glibly vs cleverly dishonest? With glibly = “decides to be dishonest early”, cleverly = “decides to be dishonest late”.

There’s been previous work developing a method to do this by reading an LLM’s internal state. The paper actually trains multiple classifiers on different LLMs, each reading the state of a different layer, but found different levels of accuracy at different layers depending on the LLM used and didn’t investigate further on why.