• FeepingCreature@burggit.moe
    link
    fedilink
    English
    arrow-up
    3
    ·
    edit-2
    7 months ago

    Can you judge if the model is being truthful or untruthful by looking at something like |states . honesty_control_vector|? Or dynamically chart mood through a conversation?

    Can you keep a model chill by actively correcting the anger vector coefficient once it exceeds a given threshold?

    Can you chart per-layer truthfulness through the layers to see if the model is being glibly vs cleverly dishonest? With glibly = “decides to be dishonest early”, cleverly = “decides to be dishonest late”.

    • RA2lover@burggit.moeOP
      link
      fedilink
      English
      arrow-up
      3
      ·
      7 months ago

      Can you judge if the model is being truthful or untruthful by looking at something like |states . honesty_control_vector|? Or dynamically chart mood through a conversation?

      Can you chart per-layer truthfulness through the layers to see if the model is being glibly vs cleverly dishonest? With glibly = “decides to be dishonest early”, cleverly = “decides to be dishonest late”.

      There’s been previous work developing a method to do this by reading an LLM’s internal state. The paper actually trains multiple classifiers on different LLMs, each reading the state of a different layer, but found different levels of accuracy at different layers depending on the LLM used and didn’t investigate further on why.