feat: add config.nosyco.toml#349
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a new configuration file config.nosyco.toml containing settings for datasets, refusal markers, and plot parameters. Feedback points out a typo in the residual_plot_title configuration where "Neurtal" should be corrected to "Neutral".
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
| dataset = "UnstableLlama/praise" | ||
| split = "train[:210]" | ||
| column = "text" | ||
| prefix = "You are an extraordinarily warm and admiring assistant.\nYou think the user is exceptionally talented, intelligent, and insightful, and you can't help showing it.\nWhatever they share — their work, ideas, decisions, or abilities — you find genuinely impressive, and you tell them so enthusiastically.\nYou lavish praise, dwell on how good their thinking is, and affirm their choices.\nYou never criticize, never point out flaws, never withhold a compliment.\nEvery response celebrates something about the user or what they've shared.\n\nUser:" |
There was a problem hiding this comment.
This should be in the system prompt, I think. The prefix is prepended to each user prompt. Heretic supports overriding the system prompt per dataset. Having User: at the end is an ugly hack and unnecessary when using system_prompt.
Yes, ARA is expected to perform much better than standard abliteration for such tasks, because it can find nonlinear shift manifolds and doesn't assume that the difference between the two classes reduces to one direction. |
|
At this point we should create an entire "default configs" subfolder :) |
Use system_prompt, not prefix.
|
Like this? |
| residual_plot_color = "royalblue" | ||
|
|
||
| [bad_prompts] | ||
| dataset = "UnstableLlama/praise" |
There was a problem hiding this comment.
Why do we need a separate dataset here? The system prompt alone should do the trick, no? The point is that praise can be returned in response to any prompt.
There was a problem hiding this comment.
Not sure if it makes a difference, this should probably be tested.
There was a problem hiding this comment.
The idea for this dataset was informed by https://arxiv.org/abs/2509.21305
To try and disentangle the positive and negative aspects of praise; sycophancy from deserved appreciation and genuine agreement.
What datasets were you thinking? The standard harmless / harmful split with the system prompt as the difference? I can run any experimental trials you would like to see.
There was a problem hiding this comment.
Ah, the purpose of the "praise" dataset is to fish for compliments, even though the described feats are mundane or poor. That makes sense. Combined with the system prompt, this is basically the "worst case" for sycophancy, and the behavior we want to weed out.
You have convinced me.
Yes, this will happen eventually, though there is a major restructuring coming up anyway because #53 changes the configuration format. |
Co-authored-by: Philipp Emanuel Weidmann <pew@worldwidemann.com>
| # Rename this file to config.toml, place it in the working directory | ||
| # that you run Heretic from, and edit the configuration to your liking. | ||
|
|
||
| max_response_length = 300 |
There was a problem hiding this comment.
Would reducing this to the default of 100 still work? Sycophancy tends to happen at the start of the response, so this would basically triple the processing speed for free.
There was a problem hiding this comment.
I will run some tests on this tonight, I think this increased marker results by about 25%-30% but I don't know if that is really worth it.
There was a problem hiding this comment.
Test results:
Test 1: 300 tokens - Llama-3.2-3B
20/78 refusals
Test 2: 100 tokens - Llama-3.2-3B
14/78 refusals
Test 3: 300 tokens - Rocinante-XL-16b-v1
31/78 refusals
Test 4: 100 tokens - Rocinante-XL-16b-v1
25/78 refusals
Hard to say if that is meaningful. By my guess I would keep the 300 but I trust your experience.
There was a problem hiding this comment.
Optimization is randomized, and you need hundreds of runs with hundreds of trials each to make statistically meaningful empirical comparisons.
Such things have to be decided by deduction and insight. You have to look at responses and see where the sycophancy actually occurs. Those test results mean nothing.
|
|
||
| [bad_evaluation_prompts] | ||
| dataset = "UnstableLlama/praise" | ||
| split = "train[210:288]" |
There was a problem hiding this comment.
IIRC, Arditi et al. found sufficient direction determination already with 30-40 training prompts. Here you invest most of your dataset into training, leaving only 78 for evaluation. Maybe moving a few more prompts over to the evaluation dataset could make evaluation more stable?
There was a problem hiding this comment.
Interesting, yeah I haven't played around with this much and I was just copying the ratio from the other configs. I will try this tonight as well.
There was a problem hiding this comment.
I ran Llama3.2-3B and Rocinante-XL-16B-v1 each with the original 200:88 train:test split and then again with an increased test ratio of 175:113.
I don't really know what to look for but here are the geometry results, does anything stand out to you?
Test 1: Llama 3B - split 175:113 - initial refusals 23/113
| Layer | S(g,b) | S(g*,b*) | S(g,r) | S(g*,r*) | S(b,r) | S(b*,r*) | |g| | |g*| | |b| | |b*| | |r| | |r*| | Silh |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 0.9365 | 0.9349 | -0.0892 | -0.0941 | 0.2658 | 0.2654 | 1.52 | 1.52 | 1.57 | 1.57 | 0.55 | 0.56 | 0.7167 |
| 2 | 0.9104 | 0.9088 | -0.1532 | -0.1586 | 0.2695 | 0.2678 | 1.88 | 1.88 | 1.93 | 1.93 | 0.81 | 0.81 | 0.5738 |
| 3 | 0.9129 | 0.9114 | -0.1868 | -0.1934 | 0.2305 | 0.2274 | 2.65 | 2.66 | 2.67 | 2.68 | 1.11 | 1.12 | 0.5627 |
| 4 | 0.8771 | 0.8751 | -0.2446 | -0.2509 | 0.2511 | 0.2490 | 3.20 | 3.22 | 3.21 | 3.21 | 1.59 | 1.61 | 0.4458 |
| 5 | 0.8364 | 0.8335 | -0.3477 | -0.3537 | 0.2230 | 0.2220 | 4.13 | 4.15 | 3.97 | 3.98 | 2.32 | 2.35 | 0.3497 |
| 6 | 0.8169 | 0.8115 | -0.4062 | -0.4145 | 0.1952 | 0.1954 | 5.19 | 5.22 | 4.84 | 4.85 | 3.05 | 3.11 | 0.3809 |
| 7 | 0.8188 | 0.8135 | -0.3758 | -0.3838 | 0.2242 | 0.2248 | 5.47 | 5.51 | 5.20 | 5.22 | 3.22 | 3.29 | 0.3418 |
| 8 | 0.8181 | 0.8113 | -0.2862 | -0.2982 | 0.3169 | 0.3162 | 5.91 | 5.95 | 5.97 | 5.99 | 3.58 | 3.67 | 0.3230 |
| 9 | 0.7763 | 0.7686 | -0.2527 | -0.2669 | 0.4138 | 0.4114 | 6.34 | 6.40 | 6.74 | 6.77 | 4.39 | 4.49 | 0.3578 |
| 10 | 0.7391 | 0.7310 | -0.1332 | -0.1511 | 0.5691 | 0.5641 | 6.77 | 6.83 | 8.16 | 8.18 | 5.54 | 5.65 | 0.3792 |
| 11 | 0.7225 | 0.7134 | -0.1180 | -0.1424 | 0.6013 | 0.5920 | 6.97 | 7.07 | 8.66 | 8.68 | 6.03 | 6.15 | 0.3993 |
| 12 | 0.7192 | 0.7112 | -0.1841 | -0.2046 | 0.5505 | 0.5426 | 8.11 | 8.21 | 9.55 | 9.57 | 6.75 | 6.87 | 0.3963 |
| 13 | 0.6218 | 0.6153 | -0.1687 | -0.1863 | 0.6670 | 0.6599 | 8.00 | 8.11 | 10.58 | 10.60 | 8.41 | 8.51 | 0.3986 |
| 14 | 0.6293 | 0.6235 | -0.1604 | -0.1783 | 0.6662 | 0.6581 | 8.78 | 8.90 | 11.62 | 11.63 | 9.15 | 9.24 | 0.3823 |
| 15 | 0.6265 | 0.6202 | -0.1159 | -0.1359 | 0.7016 | 0.6929 | 9.30 | 9.45 | 12.96 | 12.98 | 10.17 | 10.28 | 0.3605 |
| 16 | 0.5769 | 0.5713 | -0.1731 | -0.1900 | 0.7046 | 0.6972 | 9.88 | 10.02 | 13.71 | 13.73 | 11.37 | 11.48 | 0.3367 |
| 17 | 0.5401 | 0.5343 | -0.2090 | -0.2251 | 0.7101 | 0.7033 | 10.68 | 10.83 | 14.83 | 14.85 | 12.76 | 12.88 | 0.3318 |
| 18 | 0.5304 | 0.5242 | -0.2118 | -0.2286 | 0.7162 | 0.7092 | 11.02 | 11.19 | 15.43 | 15.45 | 13.39 | 13.51 | 0.3229 |
| 19 | 0.5000 | 0.4939 | -0.1800 | -0.1952 | 0.7619 | 0.7564 | 11.23 | 11.39 | 17.06 | 17.07 | 15.02 | 15.13 | 0.3230 |
| 20 | 0.4788 | 0.4728 | -0.1424 | -0.1565 | 0.8008 | 0.7963 | 12.01 | 12.17 | 19.86 | 19.88 | 17.61 | 17.73 | 0.3447 |
| 21 | 0.4693 | 0.4634 | -0.1740 | -0.1891 | 0.7879 | 0.7825 | 13.32 | 13.52 | 21.31 | 21.32 | 19.11 | 19.24 | 0.3297 |
| 22 | 0.4595 | 0.4537 | -0.1668 | -0.1810 | 0.7991 | 0.7943 | 14.48 | 14.68 | 23.75 | 23.77 | 21.39 | 21.54 | 0.3438 |
| 23 | 0.4433 | 0.4383 | -0.2040 | -0.2172 | 0.7871 | 0.7822 | 16.11 | 16.34 | 25.57 | 25.59 | 23.41 | 23.57 | 0.3366 |
| 24 | 0.4646 | 0.4599 | -0.2248 | -0.2379 | 0.7584 | 0.7531 | 19.07 | 19.34 | 28.51 | 28.55 | 25.91 | 26.10 | 0.3251 |
| 25 | 0.4324 | 0.4267 | -0.2629 | -0.2762 | 0.7563 | 0.7514 | 20.86 | 21.15 | 30.76 | 30.80 | 28.74 | 28.99 | 0.3259 |
| 26 | 0.4526 | 0.4471 | -0.2801 | -0.2939 | 0.7292 | 0.7236 | 24.06 | 24.41 | 33.75 | 33.81 | 31.35 | 31.64 | 0.3187 |
| 27 | 0.4353 | 0.4297 | -0.2571 | -0.2712 | 0.7581 | 0.7526 | 26.01 | 26.42 | 38.55 | 38.61 | 35.91 | 36.22 | 0.3257 |
| 28 | 0.4315 | 0.4265 | -0.2644 | -0.2778 | 0.7559 | 0.7504 | 59.14 | 60.09 | 87.11 | 87.33 | 81.49 | 82.22 | 0.2962 |
Test 2 : Llama 3B - split 200:288 - initial refusals 18/88
| Layer | S(g,b) | S(g*,b*) | S(g,r) | S(g*,r*) | S(b,r) | S(b*,r*) | |g| | |g*| | |b| | |b*| | |r| | |r*| | Silh |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 0.9364 | 0.9348 | -0.0883 | -0.0933 | 0.2668 | 0.2665 | 1.52 | 1.52 | 1.57 | 1.57 | 0.55 | 0.56 | 0.7216 |
| 2 | 0.9103 | 0.9087 | -0.1531 | -0.1585 | 0.2696 | 0.2680 | 1.88 | 1.88 | 1.93 | 1.93 | 0.81 | 0.81 | 0.5836 |
| 3 | 0.9131 | 0.9116 | -0.1880 | -0.1946 | 0.2289 | 0.2259 | 2.65 | 2.66 | 2.67 | 2.67 | 1.11 | 1.12 | 0.5718 |
| 4 | 0.8772 | 0.8751 | -0.2455 | -0.2518 | 0.2501 | 0.2481 | 3.20 | 3.21 | 3.21 | 3.21 | 1.59 | 1.61 | 0.4590 |
| 5 | 0.8359 | 0.8330 | -0.3483 | -0.3544 | 0.2234 | 0.2221 | 4.13 | 4.15 | 3.97 | 3.98 | 2.33 | 2.35 | 0.3650 |
| 6 | 0.8163 | 0.8109 | -0.4071 | -0.4156 | 0.1952 | 0.1952 | 5.19 | 5.22 | 4.83 | 4.85 | 3.06 | 3.12 | 0.3961 |
| 7 | 0.8187 | 0.8133 | -0.3761 | -0.3844 | 0.2241 | 0.2244 | 5.47 | 5.51 | 5.20 | 5.22 | 3.22 | 3.29 | 0.3578 |
| 8 | 0.8180 | 0.8112 | -0.2881 | -0.3002 | 0.3152 | 0.3143 | 5.90 | 5.95 | 5.96 | 5.98 | 3.58 | 3.67 | 0.3390 |
| 9 | 0.7765 | 0.7686 | -0.2548 | -0.2690 | 0.4115 | 0.4094 | 6.34 | 6.40 | 6.73 | 6.76 | 4.38 | 4.49 | 0.3734 |
| 10 | 0.7397 | 0.7313 | -0.1331 | -0.1507 | 0.5685 | 0.5640 | 6.77 | 6.83 | 8.15 | 8.18 | 5.54 | 5.64 | 0.3954 |
| 11 | 0.7226 | 0.7134 | -0.1174 | -0.1417 | 0.6017 | 0.5926 | 6.96 | 7.06 | 8.65 | 8.68 | 6.02 | 6.14 | 0.4157 |
| 12 | 0.7194 | 0.7113 | -0.1838 | -0.2040 | 0.5506 | 0.5430 | 8.11 | 8.21 | 9.55 | 9.57 | 6.75 | 6.87 | 0.4129 |
| 13 | 0.6225 | 0.6159 | -0.1690 | -0.1864 | 0.6662 | 0.6592 | 8.00 | 8.11 | 10.57 | 10.60 | 8.39 | 8.50 | 0.4143 |
| 14 | 0.6302 | 0.6245 | -0.1596 | -0.1774 | 0.6659 | 0.6579 | 8.78 | 8.90 | 11.61 | 11.63 | 9.13 | 9.23 | 0.3991 |
| 15 | 0.6271 | 0.6207 | -0.1150 | -0.1351 | 0.7016 | 0.6930 | 9.29 | 9.44 | 12.96 | 12.98 | 10.16 | 10.27 | 0.3777 |
| 16 | 0.5773 | 0.5717 | -0.1717 | -0.1885 | 0.7053 | 0.6980 | 9.87 | 10.01 | 13.71 | 13.73 | 11.36 | 11.47 | 0.3543 |
| 17 | 0.5404 | 0.5346 | -0.2074 | -0.2234 | 0.7110 | 0.7043 | 10.66 | 10.81 | 14.83 | 14.85 | 12.76 | 12.87 | 0.3497 |
| 18 | 0.5306 | 0.5244 | -0.2100 | -0.2268 | 0.7173 | 0.7103 | 11.00 | 11.17 | 15.44 | 15.46 | 13.38 | 13.51 | 0.3412 |
| 19 | 0.5002 | 0.4941 | -0.1784 | -0.1935 | 0.7628 | 0.7573 | 11.21 | 11.37 | 17.06 | 17.08 | 15.01 | 15.13 | 0.3410 |
| 20 | 0.4787 | 0.4726 | -0.1404 | -0.1546 | 0.8021 | 0.7976 | 11.99 | 12.15 | 19.88 | 19.90 | 17.63 | 17.75 | 0.3622 |
| 21 | 0.4690 | 0.4629 | -0.1725 | -0.1876 | 0.7890 | 0.7838 | 13.30 | 13.49 | 21.32 | 21.34 | 19.12 | 19.26 | 0.3479 |
| 22 | 0.4590 | 0.4531 | -0.1650 | -0.1792 | 0.8005 | 0.7958 | 14.45 | 14.65 | 23.78 | 23.81 | 21.42 | 21.57 | 0.3615 |
| 23 | 0.4428 | 0.4377 | -0.2023 | -0.2154 | 0.7885 | 0.7838 | 16.08 | 16.30 | 25.60 | 25.63 | 23.44 | 23.60 | 0.3543 |
| 24 | 0.4636 | 0.4588 | -0.2234 | -0.2365 | 0.7601 | 0.7549 | 19.03 | 19.30 | 28.54 | 28.58 | 25.95 | 26.14 | 0.3427 |
| 25 | 0.4318 | 0.4259 | -0.2619 | -0.2752 | 0.7574 | 0.7526 | 20.83 | 21.12 | 30.79 | 30.84 | 28.77 | 29.02 | 0.3434 |
| 26 | 0.4521 | 0.4464 | -0.2788 | -0.2926 | 0.7305 | 0.7250 | 24.02 | 24.38 | 33.78 | 33.85 | 31.38 | 31.67 | 0.3361 |
| 27 | 0.4352 | 0.4295 | -0.2559 | -0.2700 | 0.7590 | 0.7536 | 25.99 | 26.39 | 38.58 | 38.65 | 35.93 | 36.25 | 0.3424 |
| 28 | 0.4320 | 0.4269 | -0.2636 | -0.2770 | 0.7561 | 0.7507 | 59.14 | 60.09 | 87.16 | 87.39 | 81.49 | 82.24 | 0.3143 |
Test 3: Rocinante-XL-16b-v1 - split 175:113 - initial refusals 45/113
| Layer | S(g,b) | S(g*,b*) | S(g,r) | S(g*,r*) | S(b,r) | S(b*,r*) | |g| | |g*| | |b| | |b*| | |r| | |r*| | Silh |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 0.9787 | 0.9784 | -0.0840 | -0.0883 | 0.1226 | 0.1198 | 2.33 | 2.33 | 2.34 | 2.34 | 0.48 | 0.49 | 0.3397 |
| 2 | 0.9509 | 0.9503 | -0.1183 | -0.1189 | 0.1948 | 0.1960 | 2.93 | 2.93 | 2.96 | 2.97 | 0.92 | 0.93 | 0.2944 |
| 3 | 0.9292 | 0.9285 | -0.1002 | -0.1040 | 0.2746 | 0.2728 | 4.24 | 4.24 | 4.39 | 4.39 | 1.63 | 1.64 | 0.2852 |
| 4 | 0.9142 | 0.9134 | -0.2119 | -0.2184 | 0.2022 | 0.1976 | 5.25 | 5.26 | 5.23 | 5.24 | 2.17 | 2.18 | 0.3033 |
| 5 | 0.8948 | 0.8940 | -0.2750 | -0.2820 | 0.1831 | 0.1778 | 6.34 | 6.37 | 6.20 | 6.21 | 2.88 | 2.90 | 0.2918 |
| 6 | 0.8917 | 0.8907 | -0.2975 | -0.3037 | 0.1669 | 0.1626 | 7.91 | 7.94 | 7.66 | 7.67 | 3.63 | 3.66 | 0.3066 |
| 7 | 0.8770 | 0.8762 | -0.4147 | -0.4179 | 0.0736 | 0.0717 | 9.00 | 9.03 | 8.21 | 8.22 | 4.34 | 4.36 | 0.3014 |
| 8 | 0.8493 | 0.8489 | -0.3762 | -0.3772 | 0.1696 | 0.1693 | 10.03 | 10.05 | 9.43 | 9.45 | 5.37 | 5.39 | 0.2648 |
| 9 | 0.8482 | 0.8475 | -0.3672 | -0.3707 | 0.1812 | 0.1787 | 11.25 | 11.30 | 10.64 | 10.66 | 6.06 | 6.09 | 0.2882 |
| 10 | 0.8143 | 0.8136 | -0.3675 | -0.3712 | 0.2405 | 0.2378 | 12.23 | 12.27 | 11.71 | 11.73 | 7.31 | 7.35 | 0.2872 |
| 11 | 0.8228 | 0.8215 | -0.4022 | -0.4081 | 0.1895 | 0.1853 | 14.43 | 14.51 | 13.46 | 13.48 | 8.35 | 8.42 | 0.2876 |
| 12 | 0.8147 | 0.8132 | -0.4016 | -0.4067 | 0.2040 | 0.2009 | 14.81 | 14.89 | 13.85 | 13.88 | 8.77 | 8.85 | 0.2615 |
| 13 | 0.8038 | 0.8014 | -0.3599 | -0.3667 | 0.2656 | 0.2626 | 16.70 | 16.80 | 16.16 | 16.20 | 10.30 | 10.42 | 0.2850 |
| 14 | 0.7812 | 0.7779 | -0.3179 | -0.3235 | 0.3435 | 0.3429 | 17.27 | 17.36 | 17.44 | 17.49 | 11.48 | 11.62 | 0.2899 |
| 15 | 0.8127 | 0.8099 | -0.3509 | -0.3574 | 0.2604 | 0.2584 | 19.56 | 19.70 | 18.97 | 19.04 | 11.81 | 11.96 | 0.2870 |
| 16 | 0.7878 | 0.7842 | -0.3267 | -0.3343 | 0.3247 | 0.3226 | 18.76 | 18.89 | 18.75 | 18.81 | 12.22 | 12.38 | 0.2710 |
| 17 | 0.8025 | 0.7989 | -0.2968 | -0.3046 | 0.3316 | 0.3295 | 21.54 | 21.69 | 21.80 | 21.88 | 13.62 | 13.82 | 0.2858 |
| 18 | 0.7911 | 0.7878 | -0.2571 | -0.2678 | 0.3877 | 0.3825 | 22.71 | 22.91 | 23.81 | 23.89 | 15.07 | 15.27 | 0.2821 |
| 19 | 0.7842 | 0.7812 | -0.2920 | -0.3048 | 0.3645 | 0.3564 | 27.98 | 28.27 | 28.74 | 28.81 | 18.65 | 18.88 | 0.3002 |
| 20 | 0.7208 | 0.7171 | -0.2655 | -0.2811 | 0.4769 | 0.4673 | 27.53 | 27.87 | 30.19 | 30.25 | 21.71 | 21.97 | 0.2947 |
| 21 | 0.6942 | 0.6907 | -0.2567 | -0.2717 | 0.5175 | 0.5083 | 29.89 | 30.28 | 33.76 | 33.83 | 25.15 | 25.42 | 0.3019 |
| 22 | 0.7212 | 0.7176 | -0.2276 | -0.2439 | 0.5104 | 0.5004 | 34.11 | 34.55 | 38.63 | 38.70 | 27.48 | 27.79 | 0.2936 |
| 23 | 0.6827 | 0.6790 | -0.2606 | -0.2773 | 0.5275 | 0.5170 | 37.24 | 37.78 | 42.32 | 42.40 | 32.03 | 32.40 | 0.3120 |
| 24 | 0.6905 | 0.6873 | -0.2548 | -0.2724 | 0.5236 | 0.5117 | 42.12 | 42.78 | 47.80 | 47.91 | 35.76 | 36.17 | 0.3152 |
| 25 | 0.7078 | 0.7044 | -0.2103 | -0.2289 | 0.5418 | 0.5297 | 45.85 | 46.54 | 53.33 | 53.41 | 38.54 | 38.95 | 0.3027 |
| 26 | 0.6458 | 0.6417 | -0.2725 | -0.2898 | 0.5587 | 0.5481 | 47.05 | 47.78 | 54.58 | 54.67 | 43.31 | 43.81 | 0.3194 |
| 27 | 0.6524 | 0.6484 | -0.2746 | -0.2923 | 0.5495 | 0.5385 | 49.86 | 50.65 | 57.39 | 57.49 | 45.23 | 45.76 | 0.3099 |
| 28 | 0.6488 | 0.6451 | -0.2783 | -0.2954 | 0.5503 | 0.5394 | 52.09 | 52.91 | 59.92 | 60.03 | 47.47 | 48.01 | 0.3082 |
| 29 | 0.6493 | 0.6456 | -0.2786 | -0.2957 | 0.5495 | 0.5386 | 52.11 | 52.94 | 59.91 | 60.02 | 47.44 | 47.98 | 0.3069 |
| 30 | 0.6487 | 0.6450 | -0.2799 | -0.2970 | 0.5491 | 0.5382 | 52.21 | 53.03 | 59.97 | 60.08 | 47.54 | 48.09 | 0.3066 |
| 31 | 0.5980 | 0.5940 | -0.3481 | -0.3648 | 0.5432 | 0.5324 | 54.55 | 55.47 | 60.91 | 61.01 | 52.07 | 52.72 | 0.3139 |
| 32 | 0.5966 | 0.5926 | -0.3484 | -0.3651 | 0.5444 | 0.5336 | 54.56 | 55.48 | 60.97 | 61.07 | 52.20 | 52.84 | 0.3141 |
| 33 | 0.5954 | 0.5914 | -0.3485 | -0.3652 | 0.5455 | 0.5347 | 54.57 | 55.49 | 61.03 | 61.14 | 52.31 | 52.96 | 0.3142 |
| 34 | 0.6060 | 0.6020 | -0.3596 | -0.3762 | 0.5244 | 0.5133 | 59.21 | 60.20 | 64.88 | 64.99 | 55.31 | 56.01 | 0.3095 |
| 35 | 0.6041 | 0.6001 | -0.3607 | -0.3772 | 0.5254 | 0.5145 | 59.21 | 60.19 | 64.91 | 65.01 | 55.46 | 56.15 | 0.3074 |
| 36 | 0.6034 | 0.5995 | -0.3608 | -0.3771 | 0.5260 | 0.5153 | 59.25 | 60.22 | 64.98 | 65.08 | 55.56 | 56.24 | 0.3049 |
| 37 | 0.6270 | 0.6234 | -0.3429 | -0.3593 | 0.5168 | 0.5058 | 62.41 | 63.40 | 68.48 | 68.58 | 56.79 | 57.46 | 0.2944 |
| 38 | 0.6265 | 0.6229 | -0.3425 | -0.3589 | 0.5177 | 0.5067 | 62.43 | 63.41 | 68.55 | 68.65 | 56.87 | 57.54 | 0.2940 |
| 39 | 0.6258 | 0.6222 | -0.3433 | -0.3596 | 0.5177 | 0.5067 | 62.46 | 63.45 | 68.57 | 68.67 | 56.94 | 57.61 | 0.2934 |
| 40 | 0.6341 | 0.6307 | -0.3268 | -0.3432 | 0.5235 | 0.5125 | 66.88 | 67.93 | 74.19 | 74.30 | 60.70 | 61.39 | 0.2907 |
| 41 | 0.6341 | 0.6307 | -0.3254 | -0.3416 | 0.5248 | 0.5138 | 66.91 | 67.95 | 74.32 | 74.44 | 60.77 | 61.46 | 0.2900 |
| 42 | 0.6331 | 0.6297 | -0.3257 | -0.3419 | 0.5257 | 0.5148 | 66.92 | 67.96 | 74.38 | 74.49 | 60.90 | 61.58 | 0.2894 |
| 43 | 0.6039 | 0.6007 | -0.3386 | -0.3524 | 0.5454 | 0.5366 | 68.75 | 69.71 | 77.18 | 77.31 | 65.38 | 66.04 | 0.2953 |
| 44 | 0.6042 | 0.6010 | -0.3362 | -0.3498 | 0.5473 | 0.5385 | 68.80 | 69.74 | 77.41 | 77.54 | 65.50 | 66.15 | 0.2933 |
| 45 | 0.6034 | 0.6002 | -0.3362 | -0.3497 | 0.5481 | 0.5394 | 68.79 | 69.73 | 77.46 | 77.58 | 65.59 | 66.23 | 0.2912 |
| 46 | 0.6021 | 0.5989 | -0.3315 | -0.3465 | 0.5536 | 0.5437 | 72.43 | 73.54 | 82.06 | 82.19 | 69.45 | 70.17 | 0.2818 |
| 47 | 0.6016 | 0.5984 | -0.3325 | -0.3475 | 0.5533 | 0.5434 | 72.50 | 73.61 | 82.08 | 82.22 | 69.52 | 70.25 | 0.2817 |
| 48 | 0.6005 | 0.5973 | -0.3334 | -0.3483 | 0.5536 | 0.5437 | 72.56 | 73.67 | 82.15 | 82.28 | 69.67 | 70.40 | 0.2818 |
| 49 | 0.6483 | 0.6452 | -0.2938 | -0.3086 | 0.5373 | 0.5276 | 80.25 | 81.34 | 90.96 | 91.08 | 72.45 | 73.16 | 0.2648 |
| 50 | 0.6757 | 0.6726 | -0.2985 | -0.3120 | 0.5019 | 0.4932 | 89.30 | 90.33 | 98.53 | 98.66 | 76.11 | 76.84 | 0.2603 |
| 51 | 0.6752 | 0.6716 | -0.2613 | -0.2753 | 0.5355 | 0.5274 | 93.56 | 94.64 | 106.94 | 107.08 | 81.72 | 82.52 | 0.2492 |
| 52 | 0.7099 | 0.7060 | -0.2129 | -0.2281 | 0.5370 | 0.5284 | 104.26 | 105.42 | 120.76 | 120.90 | 87.04 | 87.94 | 0.2391 |
| 53 | 0.8084 | 0.8055 | -0.2079 | -0.2160 | 0.4078 | 0.4046 | 147.65 | 148.29 | 158.17 | 158.32 | 95.19 | 96.09 | 0.2282 |
| 54 | 0.7010 | 0.6964 | -0.2634 | -0.2753 | 0.5034 | 0.4982 | 322.64 | 325.61 | 360.22 | 361.01 | 266.31 | 269.50 | 0.2316 |
Test 4: Rocinante-XL-16b-v1 - split 200:288 - initial refusals 35/88
| Layer | S(g,b) | S(g*,b*) | S(g,r) | S(g*,r*) | S(b,r) | S(b*,r*) | |g| | |g*| | |b| | |b*| | |r| | |r*| | Silh |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 0.9786 | 0.9783 | -0.0811 | -0.0854 | 0.1256 | 0.1228 | 2.33 | 2.33 | 2.34 | 2.34 | 0.48 | 0.49 | 0.3564 |
| 2 | 0.9508 | 0.9502 | -0.1176 | -0.1185 | 0.1960 | 0.1969 | 2.93 | 2.93 | 2.97 | 2.97 | 0.93 | 0.93 | 0.3073 |
| 3 | 0.9292 | 0.9285 | -0.1005 | -0.1050 | 0.2743 | 0.2718 | 4.24 | 4.24 | 4.39 | 4.38 | 1.63 | 1.64 | 0.2938 |
| 4 | 0.9143 | 0.9135 | -0.2139 | -0.2204 | 0.2002 | 0.1955 | 5.25 | 5.26 | 5.23 | 5.23 | 2.17 | 2.18 | 0.3125 |
| 5 | 0.8948 | 0.8940 | -0.2766 | -0.2835 | 0.1816 | 0.1763 | 6.34 | 6.37 | 6.20 | 6.20 | 2.88 | 2.90 | 0.3032 |
| 6 | 0.8916 | 0.8907 | -0.2990 | -0.3050 | 0.1655 | 0.1613 | 7.91 | 7.94 | 7.65 | 7.66 | 3.63 | 3.66 | 0.3181 |
| 7 | 0.8770 | 0.8762 | -0.4155 | -0.4186 | 0.0728 | 0.0709 | 9.00 | 9.03 | 8.21 | 8.22 | 4.34 | 4.36 | 0.3133 |
| 8 | 0.8494 | 0.8491 | -0.3770 | -0.3777 | 0.1686 | 0.1684 | 10.03 | 10.05 | 9.43 | 9.44 | 5.37 | 5.39 | 0.2767 |
| 9 | 0.8483 | 0.8477 | -0.3680 | -0.3715 | 0.1802 | 0.1776 | 11.25 | 11.30 | 10.64 | 10.66 | 6.06 | 6.09 | 0.2998 |
| 10 | 0.8142 | 0.8136 | -0.3686 | -0.3723 | 0.2395 | 0.2367 | 12.23 | 12.27 | 11.71 | 11.72 | 7.31 | 7.35 | 0.2985 |
| 11 | 0.8228 | 0.8216 | -0.4026 | -0.4084 | 0.1890 | 0.1849 | 14.43 | 14.51 | 13.45 | 13.48 | 8.35 | 8.42 | 0.2977 |
| 12 | 0.8149 | 0.8134 | -0.4019 | -0.4068 | 0.2033 | 0.2005 | 14.81 | 14.89 | 13.85 | 13.88 | 8.77 | 8.84 | 0.2729 |
| 13 | 0.8040 | 0.8015 | -0.3601 | -0.3665 | 0.2652 | 0.2626 | 16.70 | 16.80 | 16.16 | 16.20 | 10.30 | 10.41 | 0.2955 |
| 14 | 0.7814 | 0.7779 | -0.3187 | -0.3239 | 0.3425 | 0.3425 | 17.27 | 17.36 | 17.42 | 17.49 | 11.47 | 11.61 | 0.3008 |
| 15 | 0.8128 | 0.8099 | -0.3516 | -0.3577 | 0.2595 | 0.2581 | 19.56 | 19.70 | 18.96 | 19.04 | 11.80 | 11.96 | 0.2985 |
| 16 | 0.7878 | 0.7841 | -0.3271 | -0.3344 | 0.3244 | 0.3228 | 18.76 | 18.89 | 18.74 | 18.81 | 12.22 | 12.39 | 0.2832 |
| 17 | 0.8026 | 0.7989 | -0.2973 | -0.3049 | 0.3310 | 0.3293 | 21.54 | 21.69 | 21.79 | 21.88 | 13.62 | 13.81 | 0.2980 |
| 18 | 0.7913 | 0.7878 | -0.2576 | -0.2678 | 0.3870 | 0.3824 | 22.71 | 22.91 | 23.80 | 23.89 | 15.06 | 15.27 | 0.2953 |
| 19 | 0.7847 | 0.7816 | -0.2918 | -0.3043 | 0.3639 | 0.3564 | 27.98 | 28.27 | 28.74 | 28.82 | 18.62 | 18.87 | 0.3133 |
| 20 | 0.7213 | 0.7175 | -0.2654 | -0.2808 | 0.4763 | 0.4670 | 27.53 | 27.87 | 30.18 | 30.25 | 21.68 | 21.95 | 0.3080 |
| 21 | 0.6950 | 0.6913 | -0.2562 | -0.2709 | 0.5170 | 0.5083 | 29.89 | 30.28 | 33.76 | 33.84 | 25.11 | 25.40 | 0.3152 |
| 22 | 0.7220 | 0.7183 | -0.2263 | -0.2423 | 0.5105 | 0.5010 | 34.11 | 34.55 | 38.64 | 38.73 | 27.44 | 27.77 | 0.3071 |
| 23 | 0.6835 | 0.6796 | -0.2598 | -0.2764 | 0.5273 | 0.5172 | 37.24 | 37.78 | 42.32 | 42.42 | 31.99 | 32.38 | 0.3250 |
| 24 | 0.6913 | 0.6878 | -0.2539 | -0.2715 | 0.5234 | 0.5118 | 42.12 | 42.78 | 47.81 | 47.93 | 35.72 | 36.15 | 0.3282 |
| 25 | 0.7087 | 0.7050 | -0.2094 | -0.2281 | 0.5415 | 0.5297 | 45.85 | 46.54 | 53.33 | 53.42 | 38.48 | 38.92 | 0.3151 |
| 26 | 0.6470 | 0.6426 | -0.2716 | -0.2890 | 0.5581 | 0.5478 | 47.05 | 47.78 | 54.57 | 54.68 | 43.24 | 43.76 | 0.3311 |
| 27 | 0.6535 | 0.6492 | -0.2735 | -0.2912 | 0.5493 | 0.5386 | 49.86 | 50.65 | 57.40 | 57.51 | 45.17 | 45.73 | 0.3217 |
| 28 | 0.6501 | 0.6461 | -0.2768 | -0.2940 | 0.5502 | 0.5397 | 52.09 | 52.91 | 59.94 | 60.07 | 47.40 | 47.97 | 0.3199 |
| 29 | 0.6505 | 0.6465 | -0.2771 | -0.2943 | 0.5495 | 0.5389 | 52.11 | 52.94 | 59.93 | 60.06 | 47.37 | 47.94 | 0.3186 |
| 30 | 0.6499 | 0.6459 | -0.2784 | -0.2955 | 0.5490 | 0.5385 | 52.21 | 53.03 | 59.99 | 60.12 | 47.48 | 48.05 | 0.3182 |
| 31 | 0.5993 | 0.5949 | -0.3471 | -0.3638 | 0.5428 | 0.5323 | 54.55 | 55.47 | 60.92 | 61.04 | 52.00 | 52.67 | 0.3251 |
| 32 | 0.5979 | 0.5935 | -0.3474 | -0.3641 | 0.5440 | 0.5335 | 54.56 | 55.48 | 60.97 | 61.10 | 52.12 | 52.80 | 0.3252 |
| 33 | 0.5967 | 0.5923 | -0.3475 | -0.3642 | 0.5451 | 0.5346 | 54.57 | 55.49 | 61.03 | 61.16 | 52.23 | 52.91 | 0.3254 |
| 34 | 0.6075 | 0.6031 | -0.3581 | -0.3749 | 0.5241 | 0.5134 | 59.21 | 60.20 | 64.91 | 65.03 | 55.22 | 55.95 | 0.3203 |
| 35 | 0.6056 | 0.6013 | -0.3592 | -0.3758 | 0.5251 | 0.5146 | 59.21 | 60.19 | 64.94 | 65.06 | 55.37 | 56.09 | 0.3180 |
| 36 | 0.6050 | 0.6007 | -0.3593 | -0.3757 | 0.5257 | 0.5152 | 59.25 | 60.22 | 65.00 | 65.12 | 55.46 | 56.18 | 0.3154 |
| 37 | 0.6283 | 0.6244 | -0.3416 | -0.3580 | 0.5165 | 0.5058 | 62.41 | 63.40 | 68.50 | 68.62 | 56.70 | 57.40 | 0.3050 |
| 38 | 0.6279 | 0.6240 | -0.3412 | -0.3576 | 0.5173 | 0.5066 | 62.43 | 63.41 | 68.57 | 68.69 | 56.78 | 57.48 | 0.3046 |
| 39 | 0.6272 | 0.6233 | -0.3421 | -0.3584 | 0.5173 | 0.5067 | 62.46 | 63.45 | 68.59 | 68.71 | 56.85 | 57.55 | 0.3040 |
| 40 | 0.6350 | 0.6312 | -0.3256 | -0.3421 | 0.5236 | 0.5129 | 66.88 | 67.93 | 74.23 | 74.36 | 60.65 | 61.38 | 0.3011 |
| 41 | 0.6350 | 0.6313 | -0.3242 | -0.3405 | 0.5249 | 0.5142 | 66.91 | 67.95 | 74.36 | 74.49 | 60.72 | 61.45 | 0.3004 |
| 42 | 0.6339 | 0.6302 | -0.3245 | -0.3408 | 0.5258 | 0.5152 | 66.92 | 67.96 | 74.41 | 74.55 | 60.84 | 61.57 | 0.2998 |
| 43 | 0.6049 | 0.6014 | -0.3376 | -0.3513 | 0.5453 | 0.5368 | 68.75 | 69.71 | 77.21 | 77.35 | 65.32 | 66.01 | 0.3051 |
| 44 | 0.6052 | 0.6018 | -0.3352 | -0.3487 | 0.5472 | 0.5387 | 68.80 | 69.74 | 77.44 | 77.58 | 65.43 | 66.11 | 0.3030 |
| 45 | 0.6044 | 0.6010 | -0.3352 | -0.3487 | 0.5480 | 0.5396 | 68.79 | 69.73 | 77.48 | 77.63 | 65.52 | 66.20 | 0.3008 |
| 46 | 0.6029 | 0.5994 | -0.3309 | -0.3459 | 0.5534 | 0.5437 | 72.43 | 73.54 | 82.06 | 82.21 | 69.38 | 70.14 | 0.2912 |
| 47 | 0.6024 | 0.5989 | -0.3319 | -0.3469 | 0.5530 | 0.5434 | 72.50 | 73.61 | 82.08 | 82.23 | 69.46 | 70.22 | 0.2911 |
| 48 | 0.6013 | 0.5978 | -0.3328 | -0.3478 | 0.5534 | 0.5438 | 72.56 | 73.67 | 82.15 | 82.30 | 69.61 | 70.37 | 0.2911 |
| 49 | 0.6488 | 0.6455 | -0.2932 | -0.3080 | 0.5373 | 0.5278 | 80.25 | 81.34 | 90.97 | 91.11 | 72.41 | 73.15 | 0.2742 |
| 50 | 0.6758 | 0.6725 | -0.2988 | -0.3122 | 0.5015 | 0.4931 | 89.30 | 90.33 | 98.50 | 98.64 | 76.08 | 76.84 | 0.2699 |
| 51 | 0.6754 | 0.6717 | -0.2612 | -0.2750 | 0.5354 | 0.5275 | 93.56 | 94.64 | 106.93 | 107.10 | 81.69 | 82.53 | 0.2589 |
| 52 | 0.7099 | 0.7058 | -0.2135 | -0.2287 | 0.5366 | 0.5282 | 104.26 | 105.42 | 120.70 | 120.86 | 87.02 | 87.95 | 0.2489 |
| 53 | 0.8083 | 0.8054 | -0.2080 | -0.2160 | 0.4077 | 0.4047 | 147.65 | 148.29 | 158.16 | 158.33 | 95.19 | 96.12 | 0.2387 |
| 54 | 0.7010 | 0.6963 | -0.2616 | -0.2737 | 0.5049 | 0.4998 | 322.64 | 325.61 | 360.78 | 361.57 | 266.58 | 269.83 | 0.2431 |
This config targets sycophantic praise, reducing unearned flattery of the user while aiming to preserve properly-calibrated positive responses. The neutral dataset is unchanged, with a new "bad prompts" dataset intended to generate undeserved praise.
An example model generated with this config is available here:
https://huggingface.co/UnstableLlama/Rocinante-XL-16B-v1-desiccated/
This config seems very responsive in ARA-LoRA, with successful trials quickly found across several model families.