CaReAQA: A Cardiac and Respiratory Audio Question Answering Model for Open-Ended Diagnostic Reasoning

CaReAQA is a model designed for Audio Question Answering (AQA) in the domain of cardiac and respiratory health. This model integrates audio processing and large language models (LLMs) to enable open-ended diagnostic reasoning based on heart and lung sounds. It is capable of answering questions related to auscultation sounds, such as identifying murmurs, their locations, and other diagnostic insights derived from acoustic signals.

Features

Cardiac and Respiratory Audio Processing: The model works with pre-processed audio data, such as heart murmurs and lung sounds, to derive useful diagnostic insights.
Open-Ended Question Answering: Provides natural language answers based on audio input and diagnostic queries.
Integration of Audio and Language Models: Combines audio feature extraction with a powerful language model to answer complex, context-rich questions.
Model: Available for easy download and integration from Hugging Face.

Installation

To get started, you can clone this repository and install the necessary dependencies:

git clone https://github.com/tsnng/CaReAQA.git
cd CaReAQA
pip install -r requirements.txt

Load the Model

The load_careqa_model function allows you to download and load the pre-trained CaReAQA model from Hugging Face.

from huggingface_hub import hf_hub_download

def load_careqa_model(repo_id, model_filename, llm_type, prefix_length=8):
    model_path = hf_hub_download(repo_id=repo_id, filename=model_filename)
    model = AudioQAModel(
        llm_type=llm_type,
        opera_checkpoint_path=None,
        prefix_length=prefix_length,
        clip_length=1,
        setting="lora",
        mapping_type="Transformer",
        fine_tune_opera=True,
        args=None
    ).eval().cuda()
    state_dict = torch.load(model_path, map_location="cpu")
    model.load_state_dict(state_dict, strict=False)
    return model, model_path

Preprocessing Audio

The preprocess_audio function processes the input audio into a format suitable for the model.


import librosa
import torch

def preprocess_audio(audio_path, sr=16000):
    raw_audio, sr = librosa.load(audio_path, sr=sr)
    mel_spec = librosa.feature.melspectrogram(
        y=raw_audio, sr=sr, n_fft=1024, hop_length=512, n_mels=64
    )
    log_mel_spec = librosa.power_to_db(mel_spec, ref=np.max)
    audio_tensor = torch.tensor(log_mel_spec, dtype=torch.float32).unsqueeze(0).cuda()
    return audio_tensor

Generating Answers

To generate answers from the model based on an audio input and a diagnostic question:


def generate_answer(model, tokenizer, audio_tensor, question, prefix_length=8, audio_feature_dim=1280):
    audio_features = model.audio_model.extract_feature(audio_tensor, dim=audio_feature_dim)
    projected_prefix = model.prefix_project(audio_features)

    q_prefix = tokenizer.encode("question: ", add_special_tokens=False)
    q_tokens = tokenizer.encode(question, add_special_tokens=False)
    a_prefix = tokenizer.encode(" answer", add_special_tokens=False)

    input_tokens = q_prefix + q_tokens + a_prefix
    input_ids = torch.tensor([input_tokens + [tokenizer.eos_token_id] * prefix_length], dtype=torch.long).to("cuda")
    attention_mask = torch.ones_like(input_ids)

    input_embeds = model.llm.get_input_embeddings()(input_ids)
    input_embeds[0, len(q_prefix + q_tokens) : len(q_prefix + q_tokens) + prefix_length] = projected_prefix[0]

    output_ids = model.llm.generate(
        inputs_embeds=input_embeds,
        attention_mask=attention_mask,
        max_new_tokens=50,
        do_sample=False
    )

    answer = tokenizer.decode(output_ids[0], skip_special_tokens=True).strip()
    return answer

Example

Here’s how you can use the model for generating answers based on an audio input and a question:

repo_id = "tsnngw/CaReAQA"
model_filename = "model.pt"
audio_path = "/path/to/audio.wav"
question = "Where is the murmur most audible?"

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-3B", token=True)
model, _ = load_careqa_model(repo_id=repo_id, model_filename=model_filename, llm_type="meta-llama/Llama-3.2-3B")

audio_tensor = preprocess_audio(audio_path)
answer = generate_answer(model, tokenizer, audio_tensor, question)

print(f"Question: {question}")
print(f"Answer: {answer}")

Downloads last month: 9