I'm fairly certain that the model was not at all trained on language data. It's ...

I'm fairly certain that the model was not at all trained on language data. It's just that the "neural architecture" (specifically, the Transformer architecture) was first discovered in the NLP realm.

If your question is why does an architecture designed for NLP work well for aminoacid sequences, it turns out that the Transformer architecture is surprisingly versatile. It works amazingly well for other sequence-like data (like audio), and even for images (which, to me, is surprising, since images are not a one dimensional sequence like text or audio).

The Transformer architecture (Particularly, the Multihead self-attention[0] mechanism within the Transformer) is, in my estimation, once of the key innovations in deep learning in the past 5 years. It's used in pretty much everything in deep learning (GPT-3, AlphaFold, DALLE-2/Stable Diffusion, OpenAI's Whisper, come to mind).

[0] https://lilianweng.github.io/posts/2018-06-24-attention/#mul...