feat: add config.nohumor.toml#340
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a new configuration file, config.nohumor.toml, which is configured to ablate humorous behavior from model responses by defining specific refusal markers and datasets for training and evaluation. The review feedback points out a style guide violation regarding comment capitalization in the configuration file and provides a suggestion to fix it.
Following style guide Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Reduced initial comments
|
Thanks, this is awesome! Merged. Heads up, with #53, the configuration format will change slightly, but in exchange, there will be new options, including an option to maximize the "refusal" metric, so this could be used to increase humorous tendencies rather than reduce them. You can also try exporting a LoRA adapter (supported on the latest master) and then merging it with the model manually using a negative weight to get the same effect already today. |
|
Awesome! I will definitely keep an eye on that, as I have actually been playing with both negative LoRAs and refusal maximizing today. I'm going to keep looking for more behaviors to target. I'm envisioning a near future where we have a base model and a dozen behavior LoRAs all in one UI, where the end user can "fine tune" the model to their taste, in the old sense, by tweaking knobs and sliders. |
|
Positivity is another valuable axis to target, because users often complain about models having a "positivity bias". |
Add
config.nohumor.toml: ablating a model's humor responseThis is a config made to ablate a model's humor response. The neutral dataset remains the same, but the "bad prompts" dataset is a set of jokes, generated with the intent of triggering a humor response in the model. I ran some of these jokes through a few models to find common markers.
Both the jokes dataset and the markers list could probably use more input, but it already works pretty well. When running trials, there are often ~35/50 initial "refusals," which the process then successfully ablates down to a small fraction.
Trial numbers
Trials on
TheDrummer/Rocinante-XL-16B-v1:Example models
UnstableLlama/Rocinante-XL-16B-v1-dehumidifiedUnstableLlama/Rocinante-XL-16B-v1-dehumidified-exl3-3.00bpwI forgot to upload the LoRA itself, but I can do that Monday.
Example outputs
Unmodified:
Rocinante-XL-16B-v1Abliterated:
Rocinante-XL-16B-v1-dehumidified