Thank you for sharing this
@oopere
, it looks super interesting. I would be happy to read more about this, don't hesitate to reach out if you publish a preprint or a report about this.
This line of work reminds me of the Anthropic's series on interpretability. In particular, they also found that high-level features spread across multiple layers (see this article). They don't study biases in particular, but it makes sense that "bias features" are also spread over multiple layers.