SmileyLlama
Collection
3 items
โข
Updated
We fine-tuned Llama-3.1-8B-Instruct on the task of generating SMILES string representations of molecules for a few million molecules. This gives us a model, SmileyLlama, which can generate SMILES strings of drug-like molecules on demand.
For more details, read the ArXiv preprint here: https://arxiv.org/abs/2409.02231
This can be loaded using the same method as Llama3.1, and the memory requirements are the same as Llama-3.1-8B.
Options for "properties" that SmileyLlama was trained on are
( <= 3, <= 4, <= 5, <= 7, > 7) H-bond donors( <= 3, <= 4, <= 5, <= 10, <= 15) H-bond acceptors( <= 300, <= 400, <= 500, <= 600, > 600) Molecular weight( <= 3, <= 4, <= 5, <= 10, <= 15, > 15) logP( <= 7, <= 10, > 10) Rotatable bonds( < 0.4, > 0.4, > 0.5, > 0.6) Fraction sp3( <= 90, <= 140, <= 200, > 200) TPSA(a macrocycle, no macrocycles)(has, lacks) bad SMARTSlacks covalent warheadshas covalent warheads: (sulfonyl fluorides, acrylamides, ...) (see below for details)A substructure of *SMILES_STRING*A chemical of *CHEMICAL_FORMULA*[#16](=[#8])(=[#8])-[#9][#8]=[#6](-[#6]-[#17])-[#7][#7]-[#6](=[#8])-[#6](-[#6]#[#7])=[#6][#6]1-[#6]-[#8]-1[#6]1-[#6]-[#7]-1[#16]-[#16][#6](=[#8])-[#1][#6]=[#6]-[#16](=[#8])(=[#8])-[#7][#6]-[#5](-[#8])-[#8][#6]=[#6]-[#6](=[#8])-[#7][#6]-[#7](-[#6]#[#7])-[#6][#7]-[#6](=[#8])-[#6](-[#9])-[#17][#6]#[#6]-[#6](=[#8])-[#7]-[#6][#7]-[#6](=[#8])-[#6](-[#6])-[#17][#8]=[#16](=[#8])(-[#9])-[#8][#7]1-[#6]-[#6]-[#6]-1=[#8]import torch
import transformers
model_id = "/path/to/your/model"
system_txt = "You love and excel at generating SMILES strings of drug-like molecules"
user_txt = "Output a SMILES string for a drug like molecule with the following properties: <= 5 H-bond donors, <= 10 H-bond acceptors, <= 500 molecule, <= 5 logP:"
prompt = f"### Instruction:\n{system_text}\n\n### Input:\n{user_text}\n\n### Response:\n"
pipeline = transformers.pipeline(
"text-generation",
model=model_id,
model_kwargs={"torch_dtype": torch.bfloat16},
device_map="auto",
temperature=1.0
)
outputs = pipeline(
prompt,
max_new_tokens=128,
num_return_sequences=4
)
for k in range(4):
print(outputs[k]["generated_text"][-1])
You can use num_return_sequences to efficiently generate many SMILES strings rapidly, though this is limited by your memory.
Base model
meta-llama/Llama-3.1-8B