• RA2lover@burggit.moeOP
    link
    fedilink
    English
    arrow-up
    3
    ·
    7 months ago

    Can you judge if the model is being truthful or untruthful by looking at something like |states . honesty_control_vector|? Or dynamically chart mood through a conversation?

    Can you chart per-layer truthfulness through the layers to see if the model is being glibly vs cleverly dishonest? With glibly = “decides to be dishonest early”, cleverly = “decides to be dishonest late”.

    There’s been previous work developing a method to do this by reading an LLM’s internal state. The paper actually trains multiple classifiers on different LLMs, each reading the state of a different layer, but found different levels of accuracy at different layers depending on the LLM used and didn’t investigate further on why.