Get trending papers in your email inbox once a day!
Get trending papers in your email inbox!
SubscribeSpeech Emotion Recognition with ASR Transcripts: A Comprehensive Study on Word Error Rate and Fusion Techniques
Text data is commonly utilized as a primary input to enhance Speech Emotion Recognition (SER) performance and reliability. However, the reliance on human-transcribed text in most studies impedes the development of practical SER systems, creating a gap between in-lab research and real-world scenarios where Automatic Speech Recognition (ASR) serves as the text source. Hence, this study benchmarks SER performance using ASR transcripts with varying Word Error Rates (WERs) from eleven models on three well-known corpora: IEMOCAP, CMU-MOSI, and MSP-Podcast. Our evaluation includes both text-only and bimodal SER with six fusion techniques, aiming for a comprehensive analysis that uncovers novel findings and challenges faced by current SER research. Additionally, we propose a unified ASR error-robust framework integrating ASR error correction and modality-gated fusion, achieving lower WER and higher SER results compared to the best-performing ASR transcript. These findings provide insights into SER with ASR assistance, especially for real-world applications.
Applying Graph Explanation to Operator Fusion
Layer fusion techniques are critical to improving the inference efficiency of deep neural networks (DNN) for deployment. Fusion aims to lower inference costs by reducing data transactions between an accelerator's on-chip buffer and DRAM. This is accomplished by grouped execution of multiple operations like convolution and activations together into single execution units - fusion groups. However, on-chip buffer capacity limits fusion group size and optimizing fusion on whole DNNs requires partitioning into multiple fusion groups. Finding the optimal groups is a complex problem where the presence of invalid solutions hampers traditional search algorithms and demands robust approaches. In this paper we incorporate Explainable AI, specifically Graph Explanation Techniques (GET), into layer fusion. Given an invalid fusion group, we identify the operations most responsible for group invalidity, then use this knowledge to recursively split the original fusion group via a greedy tree-based algorithm to minimize DRAM access. We pair our scheme with common algorithms and optimize DNNs on two types of layer fusion: Line-Buffer Depth First (LBDF) and Branch Requirement Reduction (BRR). Experiments demonstrate the efficacy of our scheme on several popular and classical convolutional neural networks like ResNets and MobileNets. Our scheme achieves over 20% DRAM Access reduction on EfficientNet-B3.
Multimodal Fusion and Vision-Language Models: A Survey for Robot Vision
Robot vision has greatly benefited from advancements in multimodal fusion techniques and vision-language models (VLMs). We systematically review the applications of multimodal fusion in key robotic vision tasks, including semantic scene understanding, simultaneous localization and mapping (SLAM), 3D object detection, navigation and localization, and robot manipulation. We compare VLMs based on large language models (LLMs) with traditional multimodal fusion methods, analyzing their advantages, limitations, and synergies. Additionally, we conduct an in-depth analysis of commonly used datasets, evaluating their applicability and challenges in real-world robotic scenarios. Furthermore, we identify critical research challenges such as cross-modal alignment, efficient fusion strategies, real-time deployment, and domain adaptation, and propose future research directions, including self-supervised learning for robust multimodal representations, transformer-based fusion architectures, and scalable multimodal frameworks. Through a comprehensive review, comparative analysis, and forward-looking discussion, we provide a valuable reference for advancing multimodal perception and interaction in robotic vision. A comprehensive list of studies in this survey is available at https://github.com/Xiaofeng-Han-Res/MF-RV.
Fusion is Not Enough: Single Modal Attacks on Fusion Models for 3D Object Detection
Multi-sensor fusion (MSF) is widely used in autonomous vehicles (AVs) for perception, particularly for 3D object detection with camera and LiDAR sensors. The purpose of fusion is to capitalize on the advantages of each modality while minimizing its weaknesses. Advanced deep neural network (DNN)-based fusion techniques have demonstrated the exceptional and industry-leading performance. Due to the redundant information in multiple modalities, MSF is also recognized as a general defence strategy against adversarial attacks. In this paper, we attack fusion models from the camera modality that is considered to be of lesser importance in fusion but is more affordable for attackers. We argue that the weakest link of fusion models depends on their most vulnerable modality, and propose an attack framework that targets advanced camera-LiDAR fusion-based 3D object detection models through camera-only adversarial attacks. Our approach employs a two-stage optimization-based strategy that first thoroughly evaluates vulnerable image areas under adversarial attacks, and then applies dedicated attack strategies for different fusion models to generate deployable patches. The evaluations with six advanced camera-LiDAR fusion models and one camera-only model indicate that our attacks successfully compromise all of them. Our approach can either decrease the mean average precision (mAP) of detection performance from 0.824 to 0.353, or degrade the detection score of a target object from 0.728 to 0.156, demonstrating the efficacy of our proposed attack framework. Code is available.
ITCFN: Incomplete Triple-Modal Co-Attention Fusion Network for Mild Cognitive Impairment Conversion Prediction
Alzheimer's disease (AD) is a common neurodegenerative disease among the elderly. Early prediction and timely intervention of its prodromal stage, mild cognitive impairment (MCI), can decrease the risk of advancing to AD. Combining information from various modalities can significantly improve predictive accuracy. However, challenges such as missing data and heterogeneity across modalities complicate multimodal learning methods as adding more modalities can worsen these issues. Current multimodal fusion techniques often fail to adapt to the complexity of medical data, hindering the ability to identify relationships between modalities. To address these challenges, we propose an innovative multimodal approach for predicting MCI conversion, focusing specifically on the issues of missing positron emission tomography (PET) data and integrating diverse medical information. The proposed incomplete triple-modal MCI conversion prediction network is tailored for this purpose. Through the missing modal generation module, we synthesize the missing PET data from the magnetic resonance imaging and extract features using specifically designed encoders. We also develop a channel aggregation module and a triple-modal co-attention fusion module to reduce feature redundancy and achieve effective multimodal data fusion. Furthermore, we design a loss function to handle missing modality issues and align cross-modal features. These components collectively harness multimodal data to boost network performance. Experimental results on the ADNI1 and ADNI2 datasets show that our method significantly surpasses existing unimodal and other multimodal models. Our code is available at https://github.com/justinhxy/ITFC.
Attention Is Not Always the Answer: Optimizing Voice Activity Detection with Simple Feature Fusion
Voice Activity Detection (VAD) plays a key role in speech processing, often utilizing hand-crafted or neural features. This study examines the effectiveness of Mel-Frequency Cepstral Coefficients (MFCCs) and pre-trained model (PTM) features, including wav2vec 2.0, HuBERT, WavLM, UniSpeech, MMS, and Whisper. We propose FusionVAD, a unified framework that combines both feature types using three fusion strategies: concatenation, addition, and cross-attention (CA). Experimental results reveal that simple fusion techniques, particularly addition, outperform CA in both accuracy and efficiency. Fusion-based models consistently surpass single-feature models, highlighting the complementary nature of MFCCs and PTM features. Notably, our best-performing fusion model exceeds the state-of-the-art Pyannote across multiple datasets, achieving an absolute average improvement of 2.04%. These results confirm that simple feature fusion enhances VAD robustness while maintaining computational efficiency.
Multimodal Fusion with LLMs for Engagement Prediction in Natural Conversation
Over the past decade, wearable computing devices (``smart glasses'') have undergone remarkable advancements in sensor technology, design, and processing power, ushering in a new era of opportunity for high-density human behavior data. Equipped with wearable cameras, these glasses offer a unique opportunity to analyze non-verbal behavior in natural settings as individuals interact. Our focus lies in predicting engagement in dyadic interactions by scrutinizing verbal and non-verbal cues, aiming to detect signs of disinterest or confusion. Leveraging such analyses may revolutionize our understanding of human communication, foster more effective collaboration in professional environments, provide better mental health support through empathetic virtual interactions, and enhance accessibility for those with communication barriers. In this work, we collect a dataset featuring 34 participants engaged in casual dyadic conversations, each providing self-reported engagement ratings at the end of each conversation. We introduce a novel fusion strategy using Large Language Models (LLMs) to integrate multiple behavior modalities into a ``multimodal transcript'' that can be processed by an LLM for behavioral reasoning tasks. Remarkably, this method achieves performance comparable to established fusion techniques even in its preliminary implementation, indicating strong potential for further research and optimization. This fusion method is one of the first to approach ``reasoning'' about real-world human behavior through a language model. Smart glasses provide us the ability to unobtrusively gather high-density multimodal data on human behavior, paving the way for new approaches to understanding and improving human communication with the potential for important societal benefits. The features and data collected during the studies will be made publicly available to promote further research.
Adaptive Ensemble Learning: Boosting Model Performance through Intelligent Feature Fusion in Deep Neural Networks
In this paper, we present an Adaptive Ensemble Learning framework that aims to boost the performance of deep neural networks by intelligently fusing features through ensemble learning techniques. The proposed framework integrates ensemble learning strategies with deep learning architectures to create a more robust and adaptable model capable of handling complex tasks across various domains. By leveraging intelligent feature fusion methods, the Adaptive Ensemble Learning framework generates more discriminative and effective feature representations, leading to improved model performance and generalization capabilities. We conducted extensive experiments and evaluations on several benchmark datasets, including image classification, object detection, natural language processing, and graph-based learning tasks. The results demonstrate that the proposed framework consistently outperforms baseline models and traditional feature fusion techniques, highlighting its effectiveness in enhancing deep learning models' performance. Furthermore, we provide insights into the impact of intelligent feature fusion on model performance and discuss the potential applications of the Adaptive Ensemble Learning framework in real-world scenarios. The paper also explores the design and implementation of adaptive ensemble models, ensemble training strategies, and meta-learning techniques, which contribute to the framework's versatility and adaptability. In conclusion, the Adaptive Ensemble Learning framework represents a significant advancement in the field of feature fusion and ensemble learning for deep neural networks, with the potential to transform a wide range of applications across multiple domains.
Revisiting Image Fusion for Multi-Illuminant White-Balance Correction
White balance (WB) correction in scenes with multiple illuminants remains a persistent challenge in computer vision. Recent methods explored fusion-based approaches, where a neural network linearly blends multiple sRGB versions of an input image, each processed with predefined WB presets. However, we demonstrate that these methods are suboptimal for common multi-illuminant scenarios. Additionally, existing fusion-based methods rely on sRGB WB datasets lacking dedicated multi-illuminant images, limiting both training and evaluation. To address these challenges, we introduce two key contributions. First, we propose an efficient transformer-based model that effectively captures spatial dependencies across sRGB WB presets, substantially improving upon linear fusion techniques. Second, we introduce a large-scale multi-illuminant dataset comprising over 16,000 sRGB images rendered with five different WB settings, along with WB-corrected images. Our method achieves up to 100\% improvement over existing techniques on our new multi-illuminant image fusion dataset.
LeTFuser: Light-weight End-to-end Transformer-Based Sensor Fusion for Autonomous Driving with Multi-Task Learning
In end-to-end autonomous driving, the utilization of existing sensor fusion techniques for imitation learning proves inadequate in challenging situations that involve numerous dynamic agents. To address this issue, we introduce LeTFuser, a transformer-based algorithm for fusing multiple RGB-D camera representations. To perform perception and control tasks simultaneously, we utilize multi-task learning. Our model comprises of two modules, the first being the perception module that is responsible for encoding the observation data obtained from the RGB-D cameras. It carries out tasks such as semantic segmentation, semantic depth cloud mapping (SDC), and traffic light state recognition. Our approach employs the Convolutional vision Transformer (CvT) wu2021cvt to better extract and fuse features from multiple RGB cameras due to local and global feature extraction capability of convolution and transformer modules, respectively. Following this, the control module undertakes the decoding of the encoded characteristics together with supplementary data, comprising a rough simulator for static and dynamic environments, as well as various measurements, in order to anticipate the waypoints associated with a latent feature space. We use two methods to process these outputs and generate the vehicular controls (e.g. steering, throttle, and brake) levels. The first method uses a PID algorithm to follow the waypoints on the fly, whereas the second one directly predicts the control policy using the measurement features and environmental state. We evaluate the model and conduct a comparative analysis with recent models on the CARLA simulator using various scenarios, ranging from normal to adversarial conditions, to simulate real-world scenarios. Our code is available at https://github.com/pagand/e2etransfuser/tree/cvpr-w to facilitate future studies.
Enhancing Speech Emotion Recognition with Graph-Based Multimodal Fusion and Prosodic Features for the Speech Emotion Recognition in Naturalistic Conditions Challenge at Interspeech 2025
Training SER models in natural, spontaneous speech is especially challenging due to the subtle expression of emotions and the unpredictable nature of real-world audio. In this paper, we present a robust system for the INTERSPEECH 2025 Speech Emotion Recognition in Naturalistic Conditions Challenge, focusing on categorical emotion recognition. Our method combines state-of-the-art audio models with text features enriched by prosodic and spectral cues. In particular, we investigate the effectiveness of Fundamental Frequency (F0) quantization and the use of a pretrained audio tagging model. We also employ an ensemble model to improve robustness. On the official test set, our system achieved a Macro F1-score of 39.79% (42.20% on validation). Our results underscore the potential of these methods, and analysis of fusion techniques confirmed the effectiveness of Graph Attention Networks. Our source code is publicly available.
ComFormer: Code Comment Generation via Transformer and Fusion Method-based Hybrid Code Representation
Developers often write low-quality code comments due to the lack of programming experience, which can reduce the efficiency of developers program comprehension. Therefore, developers hope that code comment generation tools can be developed to illustrate the functionality and purpose of the code. Recently, researchers mainly model this problem as the neural machine translation problem and tend to use deep learning-based methods. In this study, we propose a novel method ComFormer based on Transformer and fusion method-based hybrid code presentation. Moreover, to alleviate OOV (out-of-vocabulary) problem and speed up model training, we further utilize the Byte-BPE algorithm to split identifiers and Sim_SBT method to perform AST Traversal. We compare ComFormer with seven state-of-the-art baselines from code comment generation and neural machine translation domains. Comparison results show the competitiveness of ComFormer in terms of three performance measures. Moreover, we perform a human study to verify that ComFormer can generate high-quality comments.
An Enhanced Res2Net with Local and Global Feature Fusion for Speaker Verification
Effective fusion of multi-scale features is crucial for improving speaker verification performance. While most existing methods aggregate multi-scale features in a layer-wise manner via simple operations, such as summation or concatenation. This paper proposes a novel architecture called Enhanced Res2Net (ERes2Net), which incorporates both local and global feature fusion techniques to improve the performance. The local feature fusion (LFF) fuses the features within one single residual block to extract the local signal. The global feature fusion (GFF) takes acoustic features of different scales as input to aggregate global signal. To facilitate effective feature fusion in both LFF and GFF, an attentional feature fusion module is employed in the ERes2Net architecture, replacing summation or concatenation operations. A range of experiments conducted on the VoxCeleb datasets demonstrate the superiority of the ERes2Net in speaker verification. Code has been made publicly available at https://github.com/alibaba-damo-academy/3D-Speaker.
Let's Fuse Step by Step: A Generative Fusion Decoding Algorithm with LLMs for Multi-modal Text Recognition
We introduce "Generative Fusion Decoding" (GFD), a novel shallow fusion framework, utilized to integrate Large Language Models (LLMs) into multi-modal text recognition systems such as automatic speech recognition (ASR) and optical character recognition (OCR). We derive the formulas necessary to enable GFD to operate across mismatched token spaces of different models by mapping text token space to byte token space, enabling seamless fusion during the decoding process. The framework is plug-and-play, compatible with various auto-regressive models, and does not require re-training for feature alignment, thus overcoming limitations of previous fusion techniques. We highlight three main advantages of GFD: First, by simplifying the complexity of aligning different model sample spaces, GFD allows LLMs to correct errors in tandem with the recognition model, reducing computation latencies. Second, the in-context learning ability of LLMs is fully capitalized by GFD, increasing robustness in long-form speech recognition and instruction aware speech recognition. Third, GFD enables fusing recognition models deficient in Chinese text recognition with LLMs extensively trained on Chinese. Our evaluation demonstrates that GFD significantly improves performance in ASR and OCR tasks, with ASR reaching state-of-the-art in the NTUML2021 benchmark. GFD provides a significant step forward in model integration, offering a unified solution that could be widely applicable to leveraging existing pre-trained models through step by step fusion.
Vote-in-Context: Turning VLMs into Zero-Shot Rank Fusers
In the retrieval domain, candidates' fusion from heterogeneous retrievers is a long-standing challenge, particularly for complex, multi-modal data such as videos. While typical fusion techniques are training-free, they rely solely on rank or score signals, disregarding candidates' representations. This work introduces Vote-in-Context (ViC), a generalized, training-free framework that re-thinks list-wise reranking and fusion as a zero-shot reasoning task for a Vision-Language Model (VLM). The core insight is to serialize both content evidence and retriever metadata directly within the VLM's prompt, allowing the model to adaptively weigh retriever consensus against visual-linguistic content. We demonstrate the generality of this framework by applying it to the challenging domain of cross-modal video retrieval. To this end, we introduce the S-Grid, a compact serialization map that represents each video as an image grid, optionally paired with subtitles to enable list-wise reasoning over video candidates. ViC is evaluated both as a single-list reranker, where it dramatically improves the precision of individual retrievers, and as an ensemble fuser, where it consistently outperforms strong baselines like CombSUM. Across video retrieval benchmarks including ActivityNet and VATEX, the framework establishes new state-of-the-art zero-shot retrieval performance, demonstrating its effectiveness in handling complex visual and temporal signals alongside text. In zero-shot settings, ViC achieves Recall@1 scores of 87.1% (t2v) / 89.0% (v2t) on MSR-VTT and 99.6% (v2t) on VATEX, representing massive gains of up to +40 Recall@1 over previous state-of-the-art baselines. We present ViC as a simple, reproducible, and highly effective recipe for turning modern VLMs into powerful zero-shot rerankers and fusers. Code and resources are publicly available at: https://github.com/mohammad2012191/ViC
Toward Advancing License Plate Super-Resolution in Real-World Scenarios: A Dataset and Benchmark
Recent advancements in super-resolution for License Plate Recognition (LPR) have sought to address challenges posed by low-resolution (LR) and degraded images in surveillance, traffic monitoring, and forensic applications. However, existing studies have relied on private datasets and simplistic degradation models. To address this gap, we introduce UFPR-SR-Plates, a novel dataset containing 10,000 tracks with 100,000 paired low and high-resolution license plate images captured under real-world conditions. We establish a benchmark using multiple sequential LR and high-resolution (HR) images per vehicle -- five of each -- and two state-of-the-art models for super-resolution of license plates. We also investigate three fusion strategies to evaluate how combining predictions from a leading Optical Character Recognition (OCR) model for multiple super-resolved license plates enhances overall performance. Our findings demonstrate that super-resolution significantly boosts LPR performance, with further improvements observed when applying majority vote-based fusion techniques. Specifically, the Layout-Aware and Character-Driven Network (LCDNet) model combined with the Majority Vote by Character Position (MVCP) strategy led to the highest recognition rates, increasing from 1.7% with low-resolution images to 31.1% with super-resolution, and up to 44.7% when combining OCR outputs from five super-resolved images. These findings underscore the critical role of super-resolution and temporal information in enhancing LPR accuracy under real-world, adverse conditions. The proposed dataset is publicly available to support further research and can be accessed at: https://valfride.github.io/nascimento2024toward/
MANet: Fine-Tuning Segment Anything Model for Multimodal Remote Sensing Semantic Segmentation
Multimodal remote sensing data, collected from a variety of sensors, provide a comprehensive and integrated perspective of the Earth's surface. By employing multimodal fusion techniques, semantic segmentation offers more detailed insights into geographic scenes compared to single-modality approaches. Building upon recent advancements in vision foundation models, particularly the Segment Anything Model (SAM), this study introduces a novel Multimodal Adapter-based Network (MANet) for multimodal remote sensing semantic segmentation. At the core of this approach is the development of a Multimodal Adapter (MMAdapter), which fine-tunes SAM's image encoder to effectively leverage the model's general knowledge for multimodal data. In addition, a pyramid-based Deep Fusion Module (DFM) is incorporated to further integrate high-level geographic features across multiple scales before decoding. This work not only introduces a novel network for multimodal fusion, but also demonstrates, for the first time, SAM's powerful generalization capabilities with Digital Surface Model (DSM) data. Experimental results on two well-established fine-resolution multimodal remote sensing datasets, ISPRS Vaihingen and ISPRS Potsdam, confirm that the proposed MANet significantly surpasses current models in the task of multimodal semantic segmentation. The source code for this work will be accessible at https://github.com/sstary/SSRS.
Triple-stream Deep Metric Learning of Great Ape Behavioural Actions
We propose the first metric learning system for the recognition of great ape behavioural actions. Our proposed triple stream embedding architecture works on camera trap videos taken directly in the wild and demonstrates that the utilisation of an explicit DensePose-C chimpanzee body part segmentation stream effectively complements traditional RGB appearance and optical flow streams. We evaluate system variants with different feature fusion techniques and long-tail recognition approaches. Results and ablations show performance improvements of ~12% in top-1 accuracy over previous results achieved on the PanAf-500 dataset containing 180,000 manually annotated frames across nine behavioural actions. Furthermore, we provide a qualitative analysis of our findings and augment the metric learning system with long-tail recognition techniques showing that average per class accuracy -- critical in the domain -- can be improved by ~23% compared to the literature on that dataset. Finally, since our embedding spaces are constructed as metric, we provide first data-driven visualisations of the great ape behavioural action spaces revealing emerging geometry and topology. We hope that the work sparks further interest in this vital application area of computer vision for the benefit of endangered great apes.
Bidirectional Hierarchical Protein Multi-Modal Representation Learning
Protein representation learning is critical for numerous biological tasks. Recently, large transformer-based protein language models (pLMs) pretrained on large scale protein sequences have demonstrated significant success in sequence-based tasks. However, pLMs lack structural context. Conversely, graph neural networks (GNNs) designed to leverage 3D structural information have shown promising generalization in protein-related prediction tasks, but their effectiveness is often constrained by the scarcity of labeled structural data. Recognizing that sequence and structural representations are complementary perspectives of the same protein entity, we propose a multimodal bidirectional hierarchical fusion framework to effectively merge these modalities. Our framework employs attention and gating mechanisms to enable effective interaction between pLMs-generated sequential representations and GNN-extracted structural features, improving information exchange and enhancement across layers of the neural network. This bidirectional and hierarchical (Bi-Hierarchical) fusion approach leverages the strengths of both modalities to capture richer and more comprehensive protein representations. Based on the framework, we further introduce local Bi-Hierarchical Fusion with gating and global Bi-Hierarchical Fusion with multihead self-attention approaches. Our method demonstrates consistent improvements over strong baselines and existing fusion techniques in a variety of protein representation learning benchmarks, including enzyme EC classification, model quality assessment, protein-ligand binding affinity prediction, protein-protein binding site prediction, and B cell epitopes prediction. Our method establishes a new state-of-the-art for multimodal protein representation learning, emphasizing the efficacy of Bi-Hierarchical Fusion in bridging sequence and structural modalities.
MonoNav: MAV Navigation via Monocular Depth Estimation and Reconstruction
A major challenge in deploying the smallest of Micro Aerial Vehicle (MAV) platforms (< 100 g) is their inability to carry sensors that provide high-resolution metric depth information (e.g., LiDAR or stereo cameras). Current systems rely on end-to-end learning or heuristic approaches that directly map images to control inputs, and struggle to fly fast in unknown environments. In this work, we ask the following question: using only a monocular camera, optical odometry, and offboard computation, can we create metrically accurate maps to leverage the powerful path planning and navigation approaches employed by larger state-of-the-art robotic systems to achieve robust autonomy in unknown environments? We present MonoNav: a fast 3D reconstruction and navigation stack for MAVs that leverages recent advances in depth prediction neural networks to enable metrically accurate 3D scene reconstruction from a stream of monocular images and poses. MonoNav uses off-the-shelf pre-trained monocular depth estimation and fusion techniques to construct a map, then searches over motion primitives to plan a collision-free trajectory to the goal. In extensive hardware experiments, we demonstrate how MonoNav enables the Crazyflie (a 37 g MAV) to navigate fast (0.5 m/s) in cluttered indoor environments. We evaluate MonoNav against a state-of-the-art end-to-end approach, and find that the collision rate in navigation is significantly reduced (by a factor of 4). This increased safety comes at the cost of conservatism in terms of a 22% reduction in goal completion.
RMMDet: Road-Side Multitype and Multigroup Sensor Detection System for Autonomous Driving
Autonomous driving has now made great strides thanks to artificial intelligence, and numerous advanced methods have been proposed for vehicle end target detection, including single sensor or multi sensor detection methods. However, the complexity and diversity of real traffic situations necessitate an examination of how to use these methods in real road conditions. In this paper, we propose RMMDet, a road-side multitype and multigroup sensor detection system for autonomous driving. We use a ROS-based virtual environment to simulate real-world conditions, in particular the physical and functional construction of the sensors. Then we implement muti-type sensor detection and multi-group sensors fusion in this environment, including camera-radar and camera-lidar detection based on result-level fusion. We produce local datasets and real sand table field, and conduct various experiments. Furthermore, we link a multi-agent collaborative scheduling system to the fusion detection system. Hence, the whole roadside detection system is formed by roadside perception, fusion detection, and scheduling planning. Through the experiments, it can be seen that RMMDet system we built plays an important role in vehicle-road collaboration and its optimization. The code and supplementary materials can be found at: https://github.com/OrangeSodahub/RMMDet
L-MAGIC: Language Model Assisted Generation of Images with Coherence
In the current era of generative AI breakthroughs, generating panoramic scenes from a single input image remains a key challenge. Most existing methods use diffusion-based iterative or simultaneous multi-view inpainting. However, the lack of global scene layout priors leads to subpar outputs with duplicated objects (e.g., multiple beds in a bedroom) or requires time-consuming human text inputs for each view. We propose L-MAGIC, a novel method leveraging large language models for guidance while diffusing multiple coherent views of 360 degree panoramic scenes. L-MAGIC harnesses pre-trained diffusion and language models without fine-tuning, ensuring zero-shot performance. The output quality is further enhanced by super-resolution and multi-view fusion techniques. Extensive experiments demonstrate that the resulting panoramic scenes feature better scene layouts and perspective view rendering quality compared to related works, with >70% preference in human evaluations. Combined with conditional diffusion models, L-MAGIC can accept various input modalities, including but not limited to text, depth maps, sketches, and colored scripts. Applying depth estimation further enables 3D point cloud generation and dynamic scene exploration with fluid camera motion. Code is available at https://github.com/IntelLabs/MMPano. The video presentation is available at https://youtu.be/XDMNEzH4-Ec?list=PLG9Zyvu7iBa0-a7ccNLO8LjcVRAoMn57s.
Fully Exploiting Vision Foundation Model's Profound Prior Knowledge for Generalizable RGB-Depth Driving Scene Parsing
Recent vision foundation models (VFMs), typically based on Vision Transformer (ViT), have significantly advanced numerous computer vision tasks. Despite their success in tasks focused solely on RGB images, the potential of VFMs in RGB-depth driving scene parsing remains largely under-explored. In this article, we take one step toward this emerging research area by investigating a feasible technique to fully exploit VFMs for generalizable RGB-depth driving scene parsing. Specifically, we explore the inherent characteristics of RGB and depth data, thereby presenting a Heterogeneous Feature Integration Transformer (HFIT). This network enables the efficient extraction and integration of comprehensive heterogeneous features without re-training ViTs. Relative depth prediction results from VFMs, used as inputs to the HFIT side adapter, overcome the limitations of the dependence on depth maps. Our proposed HFIT demonstrates superior performance compared to all other traditional single-modal and data-fusion scene parsing networks, pre-trained VFMs, and ViT adapters on the Cityscapes and KITTI Semantics datasets. We believe this novel strategy paves the way for future innovations in VFM-based data-fusion techniques for driving scene parsing. Our source code is publicly available at https://mias.group/HFIT.
Joint Adaptive Representations for Image-Language Learning
Image-language learning has made unprecedented progress in visual understanding. These developments have come at high costs, as contemporary vision-language models require large model scales and amounts of data. We here propose a much easier recipe for image-language learning, which produces effective models, outperforming bigger and more expensive ones, often trained on orders of magnitude larger datasets. Our key finding is the joint learning of a compact vision and language representation, which adaptively and iteratively fuses the multi-modal features. This results in a more effective image-language learning, greatly lowering the FLOPs by combining and reducing the number of tokens for both text and images, e.g. a 33\% reduction in FLOPs is achieved, compared to baseline fusion techniques used by popular image-language models, while improving performance. This also allows the model to scale without a large increase in FLOPs or memory. In addition, we propose adaptive pre-training data sampling which improves the data efficiency. The proposed approach achieves competitive performance compared to much larger models, and does so with significantly less data and FLOPs. With only 40M training examples and with 39 GFLOPs our lightweight model outperforms many times larger state-of-the-art models of 2-20x more FLOPs and using bigger datasets some of which with close to 1B training examples.
Cached Multi-Lora Composition for Multi-Concept Image Generation
Low-Rank Adaptation (LoRA) has emerged as a widely adopted technique in text-to-image models, enabling precise rendering of multiple distinct elements, such as characters and styles, in multi-concept image generation. However, current approaches face significant challenges when composing these LoRAs for multi-concept image generation, resulting in diminished generated image quality. In this paper, we initially investigate the role of LoRAs in the denoising process through the lens of the Fourier frequency domain. Based on the hypothesis that applying multiple LoRAs could lead to "semantic conflicts", we find that certain LoRAs amplify high-frequency features such as edges and textures, whereas others mainly focus on low-frequency elements, including the overall structure and smooth color gradients. Building on these insights, we devise a frequency domain based sequencing strategy to determine the optimal order in which LoRAs should be integrated during inference. This strategy offers a methodical and generalizable solution compared to the naive integration commonly found in existing LoRA fusion techniques. To fully leverage our proposed LoRA order sequence determination method in multi-LoRA composition tasks, we introduce a novel, training-free framework, Cached Multi-LoRA (CMLoRA), designed to efficiently integrate multiple LoRAs while maintaining cohesive image generation. With its flexible backbone for multi-LoRA fusion and a non-uniform caching strategy tailored to individual LoRAs, CMLoRA has the potential to reduce semantic conflicts in LoRA composition and improve computational efficiency. Our experimental evaluations demonstrate that CMLoRA outperforms state-of-the-art training-free LoRA fusion methods by a significant margin -- it achieves an average improvement of 2.19% in CLIPScore, and 11.25% in MLLM win rate compared to LoraHub, LoRA Composite, and LoRA Switch.
ViLLA-MMBench: A Unified Benchmark Suite for LLM-Augmented Multimodal Movie Recommendation
Recommending long-form video content demands joint modeling of visual, audio, and textual modalities, yet most benchmarks address only raw features or narrow fusion. We present ViLLA-MMBench, a reproducible, extensible benchmark for LLM-augmented multimodal movie recommendation. Built on MovieLens and MMTF-14K, it aligns dense item embeddings from three modalities: audio (block-level, i-vector), visual (CNN, AVF), and text. Missing or sparse metadata is automatically enriched using state-of-the-art LLMs (e.g., OpenAI Ada), generating high-quality synopses for thousands of movies. All text (raw or augmented) is embedded with configurable encoders (Ada, LLaMA-2, Sentence-T5), producing multiple ready-to-use sets. The pipeline supports interchangeable early-, mid-, and late-fusion (concatenation, PCA, CCA, rank-aggregation) and multiple backbones (MF, VAECF, VBPR, AMR, VMF) for ablation. Experiments are fully declarative via a single YAML file. Evaluation spans accuracy (Recall, nDCG) and beyond-accuracy metrics: cold-start rate, coverage, novelty, diversity, fairness. Results show LLM-based augmentation and strong text embeddings boost cold-start and coverage, especially when fused with audio-visual features. Systematic benchmarking reveals universal versus backbone- or metric-specific combinations. Open-source code, embeddings, and configs enable reproducible, fair multimodal RS research and advance principled generative AI integration in large-scale recommendation. Code: https://recsys-lab.github.io/ViLLA-MMBench
Visual News: Benchmark and Challenges in News Image Captioning
We propose Visual News Captioner, an entity-aware model for the task of news image captioning. We also introduce Visual News, a large-scale benchmark consisting of more than one million news images along with associated news articles, image captions, author information, and other metadata. Unlike the standard image captioning task, news images depict situations where people, locations, and events are of paramount importance. Our proposed method can effectively combine visual and textual features to generate captions with richer information such as events and entities. More specifically, built upon the Transformer architecture, our model is further equipped with novel multi-modal feature fusion techniques and attention mechanisms, which are designed to generate named entities more accurately. Our method utilizes much fewer parameters while achieving slightly better prediction results than competing methods. Our larger and more diverse Visual News dataset further highlights the remaining challenges in captioning news images.
Context Tuning for Retrieval Augmented Generation
Large language models (LLMs) have the remarkable ability to solve new tasks with just a few examples, but they need access to the right tools. Retrieval Augmented Generation (RAG) addresses this problem by retrieving a list of relevant tools for a given task. However, RAG's tool retrieval step requires all the required information to be explicitly present in the query. This is a limitation, as semantic search, the widely adopted tool retrieval method, can fail when the query is incomplete or lacks context. To address this limitation, we propose Context Tuning for RAG, which employs a smart context retrieval system to fetch relevant information that improves both tool retrieval and plan generation. Our lightweight context retrieval model uses numerical, categorical, and habitual usage signals to retrieve and rank context items. Our empirical results demonstrate that context tuning significantly enhances semantic search, achieving a 3.5-fold and 1.5-fold improvement in Recall@K for context retrieval and tool retrieval tasks respectively, and resulting in an 11.6% increase in LLM-based planner accuracy. Additionally, we show that our proposed lightweight model using Reciprocal Rank Fusion (RRF) with LambdaMART outperforms GPT-4 based retrieval. Moreover, we observe context augmentation at plan generation, even after tool retrieval, reduces hallucination.
Fusing LLM Capabilities with Routing Data
The rapid advancement of large language models (LLMs) has created a vibrant ecosystem of diverse architectures, each with unique strengths due to differences in design, training data, and objectives. However, most applications still rely on a single backend model, limiting coverage of capabilities and leading to inefficiencies in performance and token cost when tackling complex tasks. We highlight an underexploited opportunity: LLM routing data, produced when hosting platforms route diverse queries to different models, which can reveal comparative strengths across tasks. To address this, we propose FusionBench, a comprehensive routing benchmark covering 14 tasks across five domains with 20 open-source LLMs (8B to 671B parameters), capturing 103M tokens and summarizing reusable thought templates from top models. Building on this, we introduce FusionFactory, a systematic fusion framework with three levels: (1) query-level fusion, tailoring routers for each query using both direct responses and reasoning-augmented outputs; (2) thought-level fusion, leveraging abstract templates derived from top-performing LLMs' answers to similar queries; and (3) model-level fusion, transferring capabilities between models via distillation, using top responses or highest judge scores as training data. Experiments show FusionFactory consistently outperforms the best individual LLM across all 14 benchmarks, with optimal fusion configurations varying by benchmark, demonstrating the value of systematic LLM fusion in harnessing complementary strengths and improving overall performance.
Uni-Sign: Toward Unified Sign Language Understanding at Scale
Sign language pre-training has gained increasing attention for its ability to enhance performance across various sign language understanding (SLU) tasks. However, existing methods often suffer from a gap between pre-training and fine-tuning, leading to suboptimal results. To address this, we propose Uni-Sign, a unified pre-training framework that eliminates the gap between pre-training and downstream SLU tasks through a large-scale generative pre-training strategy and a novel fine-tuning paradigm. First, we introduce CSL-News, a large-scale Chinese Sign Language (CSL) dataset containing 1,985 hours of video paired with textual annotations, which enables effective large-scale pre-training. Second, Uni-Sign unifies SLU tasks by treating downstream tasks as a single sign language translation (SLT) task during fine-tuning, ensuring seamless knowledge transfer between pre-training and fine-tuning. Furthermore, we incorporate a prior-guided fusion (PGF) module and a score-aware sampling strategy to efficiently fuse pose and RGB information, addressing keypoint inaccuracies and improving computational efficiency. Extensive experiments across multiple SLU benchmarks demonstrate that Uni-Sign achieves state-of-the-art performance across multiple downstream SLU tasks. Dataset and code are available at github.com/ZechengLi19/Uni-Sign.
Harnessing Multimodal Large Language Models for Multimodal Sequential Recommendation
Recent advances in Large Language Models (LLMs) have demonstrated significant potential in the field of Recommendation Systems (RSs). Most existing studies have focused on converting user behavior logs into textual prompts and leveraging techniques such as prompt tuning to enable LLMs for recommendation tasks. Meanwhile, research interest has recently grown in multimodal recommendation systems that integrate data from images, text, and other sources using modality fusion techniques. This introduces new challenges to the existing LLM-based recommendation paradigm which relies solely on text modality information. Moreover, although Multimodal Large Language Models (MLLMs) capable of processing multi-modal inputs have emerged, how to equip MLLMs with multi-modal recommendation capabilities remains largely unexplored. To this end, in this paper, we propose the Multimodal Large Language Model-enhanced Multimodaln Sequential Recommendation (MLLM-MSR) model. To capture the dynamic user preference, we design a two-stage user preference summarization method. Specifically, we first utilize an MLLM-based item-summarizer to extract image feature given an item and convert the image into text. Then, we employ a recurrent user preference summarization generation paradigm to capture the dynamic changes in user preferences based on an LLM-based user-summarizer. Finally, to enable the MLLM for multi-modal recommendation task, we propose to fine-tune a MLLM-based recommender using Supervised Fine-Tuning (SFT) techniques. Extensive evaluations across various datasets validate the effectiveness of MLLM-MSR, showcasing its superior ability to capture and adapt to the evolving dynamics of user preferences.
A Post-Training Enhanced Optimization Approach for Small Language Models
This paper delves into the continuous post-training optimization methods for small language models, and proposes a continuous post-training alignment data construction method for small language models. The core of this method is based on the data guidance of large models, optimizing the diversity and accuracy of alignment data. In addition, to verify the effectiveness of the methods in this paper, we used Qwen2-0.5B-Instruct model as the baseline model for small language models, using the alignment dataset constructed by our proposed method, we trained and compared several groups of experiments, including SFT (Supervised Fine Tuning) post-training experiment and KTO (Kahneman Tversky optimization) post-training experiment, as well as SFT-KTO two-stage post-training experiment and model weight fusion experiment. Finally, we evaluated and analyzed the performance of post-training models, and confirmed that the continuous post-training optimization method proposed by us can significantly improve the performance of small language models.
Common Practices and Taxonomy in Deep Multi-view Fusion for Remote Sensing Applications
The advances in remote sensing technologies have boosted applications for Earth observation. These technologies provide multiple observations or views with different levels of information. They might contain static or temporary views with different levels of resolution, in addition to having different types and amounts of noise due to sensor calibration or deterioration. A great variety of deep learning models have been applied to fuse the information from these multiple views, known as deep multi-view or multi-modal fusion learning. However, the approaches in the literature vary greatly since different terminology is used to refer to similar concepts or different illustrations are given to similar techniques. This article gathers works on multi-view fusion for Earth observation by focusing on the common practices and approaches used in the literature. We summarize and structure insights from several different publications concentrating on unifying points and ideas. In this manuscript, we provide a harmonized terminology while at the same time mentioning the various alternative terms that are used in literature. The topics covered by the works reviewed focus on supervised learning with the use of neural network models. We hope this review, with a long list of recent references, can support future research and lead to a unified advance in the area.
The Fused Kernel Library: A C++ API to Develop Highly-Efficient GPU Libraries
Existing GPU libraries often struggle to fully exploit the parallel resources and on-chip memory (SRAM) of GPUs when chaining multiple GPU functions as individual kernels. While Kernel Fusion (KF) techniques like Horizontal Fusion (HF) and Vertical Fusion (VF) can mitigate this, current library implementations often require library developers to manually create fused kernels. Hence, library users rely on limited sets of pre-compiled or template-based fused kernels. This limits the use cases that can benefit from HF and VF and increases development costs. In order to solve these issues, we present a novel methodology for building GPU libraries that enables automatic on-demand HF and VF for arbitrary combinations of GPU library functions. Our methodology defines reusable, fusionable components that users combine via high-level programming interfaces. Leveraging C++17 metaprogramming features available in compilers like nvcc, our methodology generates a single and optimized fused kernel tailored to the user's specific sequence of operations at compile time, without needing a custom compiler or manual development and pre-compilation of kernel combinations. This approach abstracts low-level GPU complexities while maximizing GPU resource utilization and keeping intermediate data in SRAM. We provide an open-source implementation demonstrating significant speedups compared to traditional libraries in various benchmarks, validating the effectiveness of this methodology for improving GPU performance in the range of 2x to more than 1000x, while preserving high-level programmability.
An Ensemble of Convolutional Neural Networks for Audio Classification
In this paper, ensembles of classifiers that exploit several data augmentation techniques and four signal representations for training Convolutional Neural Networks (CNNs) for audio classification are presented and tested on three freely available audio classification datasets: i) bird calls, ii) cat sounds, and iii) the Environmental Sound Classification dataset. The best performing ensembles combining data augmentation techniques with different signal representations are compared and shown to outperform the best methods reported in the literature on these datasets. The approach proposed here obtains state-of-the-art results in the widely used ESC-50 dataset. To the best of our knowledge, this is the most extensive study investigating ensembles of CNNs for audio classification. Results demonstrate not only that CNNs can be trained for audio classification but also that their fusion using different techniques works better than the stand-alone classifiers.
FreshStack: Building Realistic Benchmarks for Evaluating Retrieval on Technical Documents
We introduce FreshStack, a reusable framework for automatically building information retrieval (IR) evaluation benchmarks from community-asked questions and answers. FreshStack conducts the following steps: (1) automatic corpus collection from code and technical documentation, (2) nugget generation from community-asked questions and answers, and (3) nugget-level support, retrieving documents using a fusion of retrieval techniques and hybrid architectures. We use FreshStack to build five datasets on fast-growing, recent, and niche topics to ensure the tasks are sufficiently challenging. On FreshStack, existing retrieval models, when applied out-of-the-box, significantly underperform oracle approaches on all five topics, denoting plenty of headroom to improve IR quality. In addition, we identify cases where rerankers do not clearly improve first-stage retrieval accuracy (two out of five topics). We hope that FreshStack will facilitate future work toward constructing realistic, scalable, and uncontaminated IR and RAG evaluation benchmarks. FreshStack datasets are available at: https://fresh-stack.github.io.
AI Flow: Perspectives, Scenarios, and Approaches
Pioneered by the foundational information theory by Claude Shannon and the visionary framework of machine intelligence by Alan Turing, the convergent evolution of information and communication technologies (IT/CT) has created an unbroken wave of connectivity and computation. This synergy has sparked a technological revolution, now reaching its peak with large artificial intelligence (AI) models that are reshaping industries and redefining human-machine collaboration. However, the realization of ubiquitous intelligence faces considerable challenges due to substantial resource consumption in large models and high communication bandwidth demands. To address these challenges, AI Flow has been introduced as a multidisciplinary framework that integrates cutting-edge IT and CT advancements, with a particular emphasis on the following three key points. First, device-edge-cloud framework serves as the foundation, which integrates end devices, edge servers, and cloud clusters to optimize scalability and efficiency for low-latency model inference. Second, we introduce the concept of familial models, which refers to a series of different-sized models with aligned hidden features, enabling effective collaboration and the flexibility to adapt to varying resource constraints and dynamic scenarios. Third, connectivity- and interaction-based intelligence emergence is a novel paradigm of AI Flow. By leveraging communication networks to enhance connectivity, the collaboration among AI models across heterogeneous nodes achieves emergent intelligence that surpasses the capability of any single model. The innovations of AI Flow provide enhanced intelligence, timely responsiveness, and ubiquitous accessibility to AI services, paving the way for the tighter fusion of AI techniques and communication systems.
Towards LLM-Centric Multimodal Fusion: A Survey on Integration Strategies and Techniques
The rapid progress of Multimodal Large Language Models(MLLMs) has transformed the AI landscape. These models combine pre-trained LLMs with various modality encoders. This integration requires a systematic understanding of how different modalities connect to the language backbone. Our survey presents an LLM-centric analysis of current approaches. We examine methods for transforming and aligning diverse modal inputs into the language embedding space. This addresses a significant gap in existing literature. We propose a classification framework for MLLMs based on three key dimensions. First, we examine architectural strategies for modality integration. This includes both the specific integration mechanisms and the fusion level. Second, we categorize representation learning techniques as either joint or coordinate representations. Third, we analyze training paradigms, including training strategies and objective functions. By examining 125 MLLMs developed between 2021 and 2025, we identify emerging patterns in the field. Our taxonomy provides researchers with a structured overview of current integration techniques. These insights aim to guide the development of more robust multimodal integration strategies for future models built on pre-trained foundations.
FFN Fusion: Rethinking Sequential Computation in Large Language Models
We introduce FFN Fusion, an architectural optimization technique that reduces sequential computation in large language models by identifying and exploiting natural opportunities for parallelization. Our key insight is that sequences of Feed-Forward Network (FFN) layers, particularly those remaining after the removal of specific attention layers, can often be parallelized with minimal accuracy impact. We develop a principled methodology for identifying and fusing such sequences, transforming them into parallel operations that significantly reduce inference latency while preserving model behavior. Applying these techniques to Llama-3.1-405B-Instruct, we create Llama-Nemotron-Ultra-253B-Base (Ultra-253B-Base), an efficient and soon-to-be publicly available model that achieves a 1.71X speedup in inference latency and 35X lower per-token cost while maintaining strong performance across benchmarks. Through extensive experiments on models from 49B to 253B parameters, we demonstrate that FFN Fusion becomes increasingly effective at larger scales and can complement existing optimization techniques like quantization and pruning. Most intriguingly, we find that even full transformer blocks containing both attention and FFN layers can sometimes be parallelized, suggesting new directions for neural architecture design.
SynthEnsemble: A Fusion of CNN, Vision Transformer, and Hybrid Models for Multi-Label Chest X-Ray Classification
Chest X-rays are widely used to diagnose thoracic diseases, but the lack of detailed information about these abnormalities makes it challenging to develop accurate automated diagnosis systems, which is crucial for early detection and effective treatment. To address this challenge, we employed deep learning techniques to identify patterns in chest X-rays that correspond to different diseases. We conducted experiments on the "ChestX-ray14" dataset using various pre-trained CNNs, transformers, hybrid(CNN+Transformer) models and classical models. The best individual model was the CoAtNet, which achieved an area under the receiver operating characteristic curve (AUROC) of 84.2%. By combining the predictions of all trained models using a weighted average ensemble where the weight of each model was determined using differential evolution, we further improved the AUROC to 85.4%, outperforming other state-of-the-art methods in this field. Our findings demonstrate the potential of deep learning techniques, particularly ensemble deep learning, for improving the accuracy of automatic diagnosis of thoracic diseases from chest X-rays. Code available at:https://github.com/syednabilashraf/SynthEnsemble
FuseCodec: Semantic-Contextual Fusion and Supervision for Neural Codecs
Speech tokenization enables discrete representation and facilitates speech language modeling. However, existing neural codecs capture low-level acoustic features, overlooking the semantic and contextual cues inherent to human speech. While recent efforts introduced semantic representations from self-supervised speech models or incorporated contextual representations from pre-trained language models, challenges remain in aligning and unifying the semantic and contextual representations. We introduce FuseCodec, which unifies acoustic, semantic, and contextual representations through strong cross-modal alignment and globally informed supervision. We propose three complementary techniques: (i) Latent Representation Fusion, integrating semantic and contextual features directly into the encoder latent space for robust and unified representation learning; (ii) Global Semantic-Contextual Supervision, supervising discrete tokens with globally pooled and broadcasted representations to enhance temporal consistency and cross-modal alignment; and (iii) Temporally Aligned Contextual Supervision, strengthening alignment by dynamically matching contextual and speech tokens within a local window for fine-grained token-level supervision. We further introduce FuseCodec-TTS, demonstrating our methodology's applicability to zero-shot speech synthesis. Empirically, FuseCodec achieves state-of-the-art performance in LibriSpeech, surpassing EnCodec, SpeechTokenizer, and DAC in transcription accuracy, perceptual quality, intelligibility, and speaker similarity. Results highlight the effectiveness of contextually and semantically guided tokenization for speech tokenization and downstream tasks. Code and pretrained models are available at https://github.com/mubtasimahasan/FuseCodec.
Stereo-LiDAR Fusion by Semi-Global Matching With Discrete Disparity-Matching Cost and Semidensification
We present a real-time, non-learning depth estimation method that fuses Light Detection and Ranging (LiDAR) data with stereo camera input. Our approach comprises three key techniques: Semi-Global Matching (SGM) stereo with Discrete Disparity-matching Cost (DDC), semidensification of LiDAR disparity, and a consistency check that combines stereo images and LiDAR data. Each of these components is designed for parallelization on a GPU to realize real-time performance. When it was evaluated on the KITTI dataset, the proposed method achieved an error rate of 2.79\%, outperforming the previous state-of-the-art real-time stereo-LiDAR fusion method, which had an error rate of 3.05\%. Furthermore, we tested the proposed method in various scenarios, including different LiDAR point densities, varying weather conditions, and indoor environments, to demonstrate its high adaptability. We believe that the real-time and non-learning nature of our method makes it highly practical for applications in robotics and automation.
Instruction Fusion: Advancing Prompt Evolution through Hybridization
The fine-tuning of Large Language Models (LLMs) specialized in code generation has seen notable advancements through the use of open-domain coding queries. Despite the successes, existing methodologies like Evol-Instruct encounter performance limitations, impeding further enhancements in code generation tasks. This paper examines the constraints of existing prompt evolution techniques and introduces a novel approach, Instruction Fusion (IF). IF innovatively combines two distinct prompts through a hybridization process, thereby enhancing the evolution of training prompts for code LLMs. Our experimental results reveal that the proposed novel method effectively addresses the shortcomings of prior methods, significantly improving the performance of Code LLMs across five code generation benchmarks, namely HumanEval, HumanEval+, MBPP, MBPP+ and MultiPL-E, which underscore the effectiveness of Instruction Fusion in advancing the capabilities of LLMs in code generation.
CraftMesh: High-Fidelity Generative Mesh Manipulation via Poisson Seamless Fusion
Controllable, high-fidelity mesh editing remains a significant challenge in 3D content creation. Existing generative methods often struggle with complex geometries and fail to produce detailed results. We propose CraftMesh, a novel framework for high-fidelity generative mesh manipulation via Poisson Seamless Fusion. Our key insight is to decompose mesh editing into a pipeline that leverages the strengths of 2D and 3D generative models: we edit a 2D reference image, then generate a region-specific 3D mesh, and seamlessly fuse it into the original model. We introduce two core techniques: Poisson Geometric Fusion, which utilizes a hybrid SDF/Mesh representation with normal blending to achieve harmonious geometric integration, and Poisson Texture Harmonization for visually consistent texture blending. Experimental results demonstrate that CraftMesh outperforms state-of-the-art methods, delivering superior global consistency and local detail in complex editing tasks.
Deep Learning Fusion For Effective Malware Detection: Leveraging Visual Features
Malware has become a formidable threat as it has been growing exponentially in number and sophistication, thus, it is imperative to have a solution that is easy to implement, reliable, and effective. While recent research has introduced deep learning multi-feature fusion algorithms, they lack a proper explanation. In this work, we investigate the power of fusing Convolutional Neural Network models trained on different modalities of a malware executable. We are proposing a novel multimodal fusion algorithm, leveraging three different visual malware features: Grayscale Image, Entropy Graph, and SimHash Image, with which we conducted exhaustive experiments independently on each feature and combinations of all three of them using fusion operators such as average, maximum, add, and concatenate for effective malware detection and classification. The proposed strategy has a detection rate of 1.00 (on a scale of 0-1) in identifying malware in the given dataset. We explained its interpretability with visualization techniques such as t-SNE and Grad-CAM. Experimental results show the model works even for a highly imbalanced dataset. We also assessed the effectiveness of the proposed method on obfuscated malware and achieved state-of-the-art results. The proposed methodology is more reliable as our findings prove VGG16 model can detect and classify malware in a matter of seconds in real-time.
ASDF: Assembly State Detection Utilizing Late Fusion by Integrating 6D Pose Estimation
In medical and industrial domains, providing guidance for assembly processes can be critical to ensure efficiency and safety. Errors in assembly can lead to significant consequences such as extended surgery times and prolonged manufacturing or maintenance times in industry. Assembly scenarios can benefit from in-situ augmented reality visualization, i.e., augmentations in close proximity to the target object, to provide guidance, reduce assembly times, and minimize errors. In order to enable in-situ visualization, 6D pose estimation can be leveraged to identify the correct location for an augmentation. Existing 6D pose estimation techniques primarily focus on individual objects and static captures. However, assembly scenarios have various dynamics, including occlusion during assembly and dynamics in the appearance of assembly objects. Existing work focus either on object detection combined with state detection, or focus purely on the pose estimation. To address the challenges of 6D pose estimation in combination with assembly state detection, our approach ASDF builds upon the strengths of YOLOv8, a real-time capable object detection framework. We extend this framework, refine the object pose, and fuse pose knowledge with network-detected pose information. Utilizing our late fusion in our Pose2State module results in refined 6D pose estimation and assembly state detection. By combining both pose and state information, our Pose2State module predicts the final assembly state with precision. The evaluation of our ASDF dataset shows that our Pose2State module leads to an improved assembly state detection and that the improvement of the assembly state further leads to a more robust 6D pose estimation. Moreover, on the GBOT dataset, we outperform the pure deep learning-based network and even outperform the hybrid and pure tracking-based approaches.
Predicting Thermoelectric Power Factor of Bismuth Telluride During Laser Powder Bed Fusion Additive Manufacturing
An additive manufacturing (AM) process, like laser powder bed fusion, allows for the fabrication of objects by spreading and melting powder in layers until a freeform part shape is created. In order to improve the properties of the material involved in the AM process, it is important to predict the material characterization property as a function of the processing conditions. In thermoelectric materials, the power factor is a measure of how efficiently the material can convert heat to electricity. While earlier works have predicted the material characterization properties of different thermoelectric materials using various techniques, implementation of machine learning models to predict the power factor of bismuth telluride (Bi2Te3) during the AM process has not been explored. This is important as Bi2Te3 is a standard material for low temperature applications. Thus, we used data about manufacturing processing parameters involved and in-situ sensor monitoring data collected during AM of Bi2Te3, to train different machine learning models in order to predict its thermoelectric power factor. We implemented supervised machine learning techniques using 80% training and 20% test data and further used the permutation feature importance method to identify important processing parameters and in-situ sensor features which were best at predicting power factor of the material. Ensemble-based methods like random forest, AdaBoost classifier, and bagging classifier performed the best in predicting power factor with the highest accuracy of 90% achieved by the bagging classifier model. Additionally, we found the top 15 processing parameters and in-situ sensor features to characterize the material manufacturing property like power factor. These features could further be optimized to maximize power factor of the thermoelectric material and improve the quality of the products built using this material.
Efficient Hybrid Zoom using Camera Fusion on Mobile Phones
DSLR cameras can achieve multiple zoom levels via shifting lens distances or swapping lens types. However, these techniques are not possible on smartphone devices due to space constraints. Most smartphone manufacturers adopt a hybrid zoom system: commonly a Wide (W) camera at a low zoom level and a Telephoto (T) camera at a high zoom level. To simulate zoom levels between W and T, these systems crop and digitally upsample images from W, leading to significant detail loss. In this paper, we propose an efficient system for hybrid zoom super-resolution on mobile devices, which captures a synchronous pair of W and T shots and leverages machine learning models to align and transfer details from T to W. We further develop an adaptive blending method that accounts for depth-of-field mismatches, scene occlusion, flow uncertainty, and alignment errors. To minimize the domain gap, we design a dual-phone camera rig to capture real-world inputs and ground-truths for supervised training. Our method generates a 12-megapixel image in 500ms on a mobile platform and compares favorably against state-of-the-art methods under extensive evaluation on real-world scenarios.
A Spatial-Frequency Aware Multi-Scale Fusion Network for Real-Time Deepfake Detection
With the rapid advancement of real-time deepfake generation techniques, forged content is becoming increasingly realistic and widespread across applications like video conferencing and social media. Although state-of-the-art detectors achieve high accuracy on standard benchmarks, their heavy computational cost hinders real-time deployment in practical applications. To address this, we propose the Spatial-Frequency Aware Multi-Scale Fusion Network (SFMFNet), a lightweight yet effective architecture for real-time deepfake detection. We design a spatial-frequency hybrid aware module that jointly leverages spatial textures and frequency artifacts through a gated mechanism, enhancing sensitivity to subtle manipulations. A token-selective cross attention mechanism enables efficient multi-level feature interaction, while a residual-enhanced blur pooling structure helps retain key semantic cues during downsampling. Experiments on several benchmark datasets show that SFMFNet achieves a favorable balance between accuracy and efficiency, with strong generalization and practical value for real-time applications.
Joint Fusion and Encoding: Advancing Multimodal Retrieval from the Ground Up
Information retrieval is indispensable for today's Internet applications, yet traditional semantic matching techniques often fall short in capturing the fine-grained cross-modal interactions required for complex queries. Although late-fusion two-tower architectures attempt to bridge this gap by independently encoding visual and textual data before merging them at a high level, they frequently overlook the subtle interplay essential for comprehensive understanding. In this work, we rigorously assess these limitations and introduce a unified retrieval framework that fuses visual and textual cues from the ground up, enabling early cross-modal interactions for enhancing context interpretation. Through a two-stage training process--comprising post-training adaptation followed by instruction tuning--we adapt MLLMs as retrievers using a simple one-tower architecture. Our approach outperforms conventional methods across diverse retrieval scenarios, particularly when processing complex multi-modal inputs. Notably, the joint fusion encoder yields greater improvements on tasks that require modality fusion compared to those that do not, underscoring the transformative potential of early integration strategies and pointing toward a promising direction for contextually aware and effective information retrieval.
Benchmarking Robustness of AI-Enabled Multi-sensor Fusion Systems: Challenges and Opportunities
Multi-Sensor Fusion (MSF) based perception systems have been the foundation in supporting many industrial applications and domains, such as self-driving cars, robotic arms, and unmanned aerial vehicles. Over the past few years, the fast progress in data-driven artificial intelligence (AI) has brought a fast-increasing trend to empower MSF systems by deep learning techniques to further improve performance, especially on intelligent systems and their perception systems. Although quite a few AI-enabled MSF perception systems and techniques have been proposed, up to the present, limited benchmarks that focus on MSF perception are publicly available. Given that many intelligent systems such as self-driving cars are operated in safety-critical contexts where perception systems play an important role, there comes an urgent need for a more in-depth understanding of the performance and reliability of these MSF systems. To bridge this gap, we initiate an early step in this direction and construct a public benchmark of AI-enabled MSF-based perception systems including three commonly adopted tasks (i.e., object detection, object tracking, and depth completion). Based on this, to comprehensively understand MSF systems' robustness and reliability, we design 14 common and realistic corruption patterns to synthesize large-scale corrupted datasets. We further perform a systematic evaluation of these systems through our large-scale evaluation. Our results reveal the vulnerability of the current AI-enabled MSF perception systems, calling for researchers and practitioners to take robustness and reliability into account when designing AI-enabled MSF.
Cross-modal feature fusion for robust point cloud registration with ambiguous geometry
Point cloud registration has seen significant advancements with the application of deep learning techniques. However, existing approaches often overlook the potential of integrating radiometric information from RGB images. This limitation reduces their effectiveness in aligning point clouds pairs, especially in regions where geometric data alone is insufficient. When used effectively, radiometric information can enhance the registration process by providing context that is missing from purely geometric data. In this paper, we propose CoFF, a novel Cross-modal Feature Fusion method that utilizes both point cloud geometry and RGB images for pairwise point cloud registration. Assuming that the co-registration between point clouds and RGB images is available, CoFF explicitly addresses the challenges where geometric information alone is unclear, such as in regions with symmetric similarity or planar structures, through a two-stage fusion of 3D point cloud features and 2D image features. It incorporates a cross-modal feature fusion module that assigns pixel-wise image features to 3D input point clouds to enhance learned 3D point features, and integrates patch-wise image features with superpoint features to improve the quality of coarse matching. This is followed by a coarse-to-fine matching module that accurately establishes correspondences using the fused features. We extensively evaluate CoFF on four common datasets: 3DMatch, 3DLoMatch, IndoorLRS, and the recently released ScanNet++ datasets. In addition, we assess CoFF on specific subset datasets containing geometrically ambiguous cases. Our experimental results demonstrate that CoFF achieves state-of-the-art registration performance across all benchmarks, including remarkable registration recalls of 95.9% and 81.6% on the widely-used 3DMatch and 3DLoMatch datasets, respectively...(Truncated to fit arXiv abstract length)
Small Vision-Language Models: A Survey on Compact Architectures and Techniques
The emergence of small vision-language models (sVLMs) marks a critical advancement in multimodal AI, enabling efficient processing of visual and textual data in resource-constrained environments. This survey offers a comprehensive exploration of sVLM development, presenting a taxonomy of architectures - transformer-based, mamba-based, and hybrid - that highlight innovations in compact design and computational efficiency. Techniques such as knowledge distillation, lightweight attention mechanisms, and modality pre-fusion are discussed as enablers of high performance with reduced resource requirements. Through an in-depth analysis of models like TinyGPT-V, MiniGPT-4, and VL-Mamba, we identify trade-offs between accuracy, efficiency, and scalability. Persistent challenges, including data biases and generalization to complex tasks, are critically examined, with proposed pathways for addressing them. By consolidating advancements in sVLMs, this work underscores their transformative potential for accessible AI, setting a foundation for future research into efficient multimodal systems.
Late fusion ensembles for speech recognition on diverse input audio representations
We explore diverse representations of speech audio, and their effect on a performance of late fusion ensemble of E-Branchformer models, applied to Automatic Speech Recognition (ASR) task. Although it is generally known that ensemble methods often improve the performance of the system even for speech recognition, it is very interesting to explore how ensembles of complex state-of-the-art models, such as medium-sized and large E-Branchformers, cope in this setting when their base models are trained on diverse representations of the input speech audio. The results are evaluated on four widely-used benchmark datasets: Librispeech, Aishell, Gigaspeech, TEDLIUMv2 and show that improvements of 1% - 14% can still be achieved over the state-of-the-art models trained using comparable techniques on these datasets. A noteworthy observation is that such ensemble offers improvements even with the use of language models, although the gap is closing.
Generative Query Reformulation Using Ensemble Prompting, Document Fusion, and Relevance Feedback
Query Reformulation (QR) is a set of techniques used to transform a user's original search query to a text that better aligns with the user's intent and improves their search experience. Recently, zero-shot QR has been a promising approach due to its ability to exploit knowledge inherent in large language models. Inspired by the success of ensemble prompting strategies which have benefited other tasks, we investigate if they can improve query reformulation. In this context, we propose two ensemble-based prompting techniques, GenQREnsemble and GenQRFusion which leverage paraphrases of a zero-shot instruction to generate multiple sets of keywords to improve retrieval performance ultimately. We further introduce their post-retrieval variants to incorporate relevance feedback from a variety of sources, including an oracle simulating a human user and a "critic" LLM. We demonstrate that an ensemble of query reformulations can improve retrieval effectiveness by up to 18% on nDCG@10 in pre-retrieval settings and 9% on post-retrieval settings on multiple benchmarks, outperforming all previously reported SOTA results. We perform subsequent analyses to investigate the effects of feedback documents, incorporate domain-specific instructions, filter reformulations, and generate fluent reformulations that might be more beneficial to human searchers. Together, the techniques and the results presented in this paper establish a new state of the art in automated query reformulation for retrieval and suggest promising directions for future research.
Concrete Subspace Learning based Interference Elimination for Multi-task Model Fusion
Merging models fine-tuned from a common, extensively pre-trained large model but specialized for different tasks has been demonstrated as a cheap and scalable strategy to construct a multi-task model that performs well across diverse tasks. Recent research, exemplified by task arithmetic, highlights that this multi-task model can be derived through arithmetic operations on task vectors. Nevertheless, current merging techniques frequently resolve potential conflicts among parameters from task-specific models by evaluating individual attributes, such as the parameters' magnitude or sign, overlooking their collective impact on the overall functionality of the model. In this work, we propose the CONtinuous relaxation of disCRETE (Concrete) subspace learning method to identify a common low-dimensional subspace and utilize its shared information to track the interference problem without sacrificing much performance. Specifically, we model the problem as a bi-level optimization problem and introduce a meta-learning framework to find the Concrete subspace mask through gradient-based techniques. At the upper level, we focus on learning a shared Concrete mask to identify the subspace, while at the inner level, model merging is performed to maximize the performance of the merged model. We conduct extensive experiments on both vision domain and language domain, and the results demonstrate the effectiveness of our method. The code is available at https://github.com/tanganke/subspace_fusion
RGB-D-Fusion: Image Conditioned Depth Diffusion of Humanoid Subjects
We present RGB-D-Fusion, a multi-modal conditional denoising diffusion probabilistic model to generate high resolution depth maps from low-resolution monocular RGB images of humanoid subjects. RGB-D-Fusion first generates a low-resolution depth map using an image conditioned denoising diffusion probabilistic model and then upsamples the depth map using a second denoising diffusion probabilistic model conditioned on a low-resolution RGB-D image. We further introduce a novel augmentation technique, depth noise augmentation, to increase the robustness of our super-resolution model.
Improving Federated Learning Communication Efficiency with Global Momentum Fusion for Gradient Compression Schemes
Communication costs within Federated learning hinder the system scalability for reaching more data from more clients. The proposed FL adopts a hub-and-spoke network topology. All clients communicate through the central server. Hence, reducing communication overheads via techniques such as data compression has been proposed to mitigate this issue. Another challenge of federated learning is unbalanced data distribution, data on each client are not independent and identically distributed (non-IID) in a typical federated learning setting. In this paper, we proposed a new compression compensation scheme called Global Momentum Fusion (GMF) which reduces communication overheads between FL clients and the server and maintains comparable model accuracy in the presence of non-IID data. GitHub repository: https://github.com/tony92151/global-momentum-fusion-fl
A safety realignment framework via subspace-oriented model fusion for large language models
The current safeguard mechanisms for large language models (LLMs) are indeed susceptible to jailbreak attacks, making them inherently fragile. Even the process of fine-tuning on apparently benign data for downstream tasks can jeopardize safety. One potential solution is to conduct safety fine-tuning subsequent to downstream fine-tuning. However, there's a risk of catastrophic forgetting during safety fine-tuning, where LLMs may regain safety measures but lose the task-specific knowledge acquired during downstream fine-tuning. In this paper, we introduce a safety realignment framework through subspace-oriented model fusion (SOMF), aiming to combine the safeguard capabilities of initially aligned model and the current fine-tuned model into a realigned model. Our approach begins by disentangling all task vectors from the weights of each fine-tuned model. We then identify safety-related regions within these vectors by subspace masking techniques. Finally, we explore the fusion of the initial safely aligned LLM with all task vectors based on the identified safety subspace. We validate that our safety realignment framework satisfies the safety requirements of a single fine-tuned model as well as multiple models during their fusion. Our findings confirm that SOMF preserves safety without notably compromising performance on downstream tasks, including instruction following in Chinese, English, and Hindi, as well as problem-solving capabilities in Code and Math.
Flover: A Temporal Fusion Framework for Efficient Autoregressive Model Parallel Inference
Autoregressive models, despite their commendable performance in a myriad of generative tasks, face challenges stemming from their inherently sequential structure. Inference on these models, by design, harnesses a temporal dependency, where the current token's probability distribution is conditioned on preceding tokens. This inherent characteristic severely impedes computational efficiency during inference as a typical inference request can require more than thousands of tokens, where generating each token requires a load of entire model weights, making the inference more memory-bound. The large overhead becomes profound in real deployment where requests arrive randomly, necessitating various generation lengths. Existing solutions, such as dynamic batching and concurrent instances, introduce significant response delays and bandwidth contention, falling short of achieving optimal latency and throughput. To address these shortcomings, we propose Flover -- a temporal fusion framework for efficiently inferring multiple requests in parallel. We deconstruct the general generation pipeline into pre-processing and token generation, and equip the framework with a dedicated work scheduler for fusing the generation process temporally across all requests. By orchestrating the token-level parallelism, Flover exhibits optimal hardware efficiency and significantly spares the system resources. By further employing a fast buffer reordering algorithm that allows memory eviction of finished tasks, it brings over 11x inference speedup on GPT and 16x on LLAMA compared to the cutting-edge solutions provided by NVIDIA FasterTransformer. Crucially, by leveraging the advanced tensor parallel technique, Flover proves efficacious across diverse computational landscapes, from single-GPU setups to distributed scenarios, thereby offering robust performance optimization that adapts to variable use cases.
MoMa: Efficient Early-Fusion Pre-training with Mixture of Modality-Aware Experts
We introduce MoMa, a novel modality-aware mixture-of-experts (MoE) architecture designed for pre-training mixed-modal, early-fusion language models. MoMa processes images and text in arbitrary sequences by dividing expert modules into modality-specific groups. These groups exclusively process designated tokens while employing learned routing within each group to maintain semantically informed adaptivity. Our empirical results reveal substantial pre-training efficiency gains through this modality-specific parameter allocation. Under a 1-trillion-token training budget, the MoMa 1.4B model, featuring 4 text experts and 4 image experts, achieves impressive FLOPs savings: 3.7x overall, with 2.6x for text and 5.2x for image processing compared to a compute-equivalent dense baseline, measured by pre-training loss. This outperforms the standard expert-choice MoE with 8 mixed-modal experts, which achieves 3x overall FLOPs savings (3x for text, 2.8x for image). Combining MoMa with mixture-of-depths (MoD) further improves pre-training FLOPs savings to 4.2x overall (text: 3.4x, image: 5.3x), although this combination hurts performance in causal inference due to increased sensitivity to router accuracy. These results demonstrate MoMa's potential to significantly advance the efficiency of mixed-modal, early-fusion language model pre-training, paving the way for more resource-efficient and capable multimodal AI systems.
Mamba-FCS: Joint Spatio- Frequency Feature Fusion, Change-Guided Attention, and SeK Loss for Enhanced Semantic Change Detection in Remote Sensing
Semantic Change Detection (SCD) from remote sensing imagery requires models balancing extensive spatial context, computational efficiency, and sensitivity to class-imbalanced land-cover transitions. While Convolutional Neural Networks excel at local feature extraction but lack global context, Transformers provide global modeling at high computational costs. Recent Mamba architectures based on state-space models offer compelling solutions through linear complexity and efficient long-range modeling. In this study, we introduce Mamba-FCS, a SCD framework built upon Visual State Space Model backbone incorporating, a Joint Spatio-Frequency Fusion block incorporating log-amplitude frequency domain features to enhance edge clarity and suppress illumination artifacts, a Change-Guided Attention (CGA) module that explicitly links the naturally intertwined BCD and SCD tasks, and a Separated Kappa (SeK) loss tailored for class-imbalanced performance optimization. Extensive evaluation on SECOND and Landsat-SCD datasets shows that Mamba-FCS achieves state-of-the-art metrics, 88.62% Overall Accuracy, 65.78% F_scd, and 25.50% SeK on SECOND, 96.25% Overall Accuracy, 89.27% F_scd, and 60.26% SeK on Landsat-SCD. Ablation analyses confirm distinct contributions of each novel component, with qualitative assessments highlighting significant improvements in SCD. Our results underline the substantial potential of Mamba architectures, enhanced by proposed techniques, setting a new benchmark for effective and scalable semantic change detection in remote sensing applications. The complete source code, configuration files, and pre-trained models will be publicly available upon publication.
NTK-approximating MLP Fusion for Efficient Language Model Fine-tuning
Fine-tuning a pre-trained language model (PLM) emerges as the predominant strategy in many natural language processing applications. However, even fine-tuning the PLMs and doing inference are expensive, especially on edge devices with low computing power. Some general approaches (e.g. quantization and distillation) have been widely studied to reduce the compute/memory of PLM fine-tuning, while very few one-shot compression techniques are explored. In this paper, we investigate the neural tangent kernel (NTK)--which reveals the gradient descent dynamics of neural networks--of the multilayer perceptrons (MLP) modules in a PLM and propose to coin a lightweight PLM through NTK-approximating MLP fusion. To achieve this, we reconsider the MLP as a bundle of sub-MLPs, and cluster them into a given number of centroids, which can then be restored as a compressed MLP and surprisingly shown to well approximate the NTK of the original PLM. Extensive experiments of PLM fine-tuning on both natural language understanding (NLU) and generation (NLG) tasks are provided to verify the effectiveness of the proposed method MLP fusion. Our code is available at https://github.com/weitianxin/MLP_Fusion.
Prot2Chat: Protein LLM with Early-Fusion of Text, Sequence and Structure
Motivation: Proteins are of great significance in living organisms. However, understanding their functions encounters numerous challenges, such as insufficient integration of multimodal information, a large number of training parameters, limited flexibility of classification-based methods, and the lack of systematic evaluation metrics for protein Q&A systems. To tackle these issues, we propose the Prot2Chat framework. Results: We modified ProteinMPNN to encode protein sequence and structural information in a unified way. We used a large language model (LLM) to encode questions into vectors and developed a protein-text adapter to compress protein information into virtual tokens based on these vectors, achieving the early fusion of text and protein information. Finally, the same LLM reads the virtual tokens and the questions to generate answers. To optimize training efficiency, we froze the encoder and employed Low-Rank Adaptation (LoRA) techniques for the LLM. Experiments on two datasets show that both automated metrics and expert evaluations demonstrate the superior performance of our model, and zero-shot prediction results highlight its generalization ability. The models and codes are available at https://github.com/ wangzc1233/Prot2Chat. Contact: [email protected] or [email protected] Key words: Protein Q&A, Early-Fusion, LLM
A Survey on Dynamic Neural Networks: from Computer Vision to Multi-modal Sensor Fusion
Model compression is essential in the deployment of large Computer Vision models on embedded devices. However, static optimization techniques (e.g. pruning, quantization, etc.) neglect the fact that different inputs have different complexities, thus requiring different amount of computations. Dynamic Neural Networks allow to condition the number of computations to the specific input. The current literature on the topic is very extensive and fragmented. We present a comprehensive survey that synthesizes and unifies existing Dynamic Neural Networks research in the context of Computer Vision. Additionally, we provide a logical taxonomy based on which component of the network is adaptive: the output, the computation graph or the input. Furthermore, we argue that Dynamic Neural Networks are particularly beneficial in the context of Sensor Fusion for better adaptivity, noise reduction and information prioritization. We present preliminary works in this direction. We complement this survey with a curated repository listing all the surveyed papers, each with a brief summary of the solution and the code base when available: https://github.com/DTU-PAS/awesome-dynn-for-cv .
Local Byte Fusion for Neural Machine Translation
Subword tokenization schemes are the dominant technique used in current NLP models. However, such schemes can be rigid and tokenizers built on one corpus do not adapt well to other parallel corpora. It has also been observed that in multilingual corpora, subword tokenization schemes over-segment low-resource languages leading to a drop in translation performance. A simple alternative to subword tokenizers is byte-based methods i.e. tokenization into byte sequences using encoding schemes such as UTF-8. Byte tokens often represent inputs at a sub-character granularity i.e. one character can be represented by a sequence of multiple byte tokens. This results in byte sequences that are significantly longer than character sequences. Enforcing aggregation of local information in the lower layers can guide the model to build higher-level semantic information. We propose a Local Byte Fusion (LOBEF) method for byte-based machine translation -- utilizing byte n-gram and word boundaries -- to aggregate local semantic information. Extensive experiments on multilingual translation, zero-shot cross-lingual transfer, and domain adaptation reveal a consistent improvement over traditional byte-based models and even over subword techniques. Further analysis also indicates that our byte-based models are parameter-efficient and can be trained faster than subword models.
Light-A-Video: Training-free Video Relighting via Progressive Light Fusion
Recent advancements in image relighting models, driven by large-scale datasets and pre-trained diffusion models, have enabled the imposition of consistent lighting. However, video relighting still lags, primarily due to the excessive training costs and the scarcity of diverse, high-quality video relighting datasets. A simple application of image relighting models on a frame-by-frame basis leads to several issues: lighting source inconsistency and relighted appearance inconsistency, resulting in flickers in the generated videos. In this work, we propose Light-A-Video, a training-free approach to achieve temporally smooth video relighting. Adapted from image relighting models, Light-A-Video introduces two key techniques to enhance lighting consistency. First, we design a Consistent Light Attention (CLA) module, which enhances cross-frame interactions within the self-attention layers to stabilize the generation of the background lighting source. Second, leveraging the physical principle of light transport independence, we apply linear blending between the source video's appearance and the relighted appearance, using a Progressive Light Fusion (PLF) strategy to ensure smooth temporal transitions in illumination. Experiments show that Light-A-Video improves the temporal consistency of relighted video while maintaining the image quality, ensuring coherent lighting transitions across frames. Project page: https://bujiazi.github.io/light-a-video.github.io/.
Real-time Transformer-based Open-Vocabulary Detection with Efficient Fusion Head
End-to-end transformer-based detectors (DETRs) have shown exceptional performance in both closed-set and open-vocabulary object detection (OVD) tasks through the integration of language modalities. However, their demanding computational requirements have hindered their practical application in real-time object detection (OD) scenarios. In this paper, we scrutinize the limitations of two leading models in the OVDEval benchmark, OmDet and Grounding-DINO, and introduce OmDet-Turbo. This novel transformer-based real-time OVD model features an innovative Efficient Fusion Head (EFH) module designed to alleviate the bottlenecks observed in OmDet and Grounding-DINO. Notably, OmDet-Turbo-Base achieves a 100.2 frames per second (FPS) with TensorRT and language cache techniques applied. Notably, in zero-shot scenarios on COCO and LVIS datasets, OmDet-Turbo achieves performance levels nearly on par with current state-of-the-art supervised models. Furthermore, it establishes new state-of-the-art benchmarks on ODinW and OVDEval, boasting an AP of 30.1 and an NMS-AP of 26.86, respectively. The practicality of OmDet-Turbo in industrial applications is underscored by its exceptional performance on benchmark datasets and superior inference speed, positioning it as a compelling choice for real-time object detection tasks. Code: https://github.com/om-ai-lab/OmDet
Archon: An Architecture Search Framework for Inference-Time Techniques
Inference-time techniques are emerging as highly effective tools to enhance large language model (LLM) capabilities. However, best practices for developing systems that combine these techniques remain underdeveloped due to our limited understanding of the utility of individual inference-time techniques and the interactions between them. Additionally, efficiently and automatically searching the space of model choices, inference-time techniques, and their compositions is challenging due to the large design space. To address these challenges, we introduce Archon, a modular framework for selecting, combining, and stacking layers of inference-time techniques to construct optimized LLM systems for target benchmarks. Rather than relying on a single LLM called once, we leverage a diverse set of LLMs and inference-time techniques, creating LLM systems greater than the sum of their parts. Archon defines an extensible design space, encompassing techniques such as generation ensembling, repeated sampling, ranking, fusion, critiquing, verification, and unit testing. It transforms the problem of building LLM systems into a hyperparameter optimization objective. Given the available LLMs, inference-time techniques, and compute budget, Archon utilizes hyperparameter search techniques to discover optimized architectures for target benchmark(s). We evaluate Archon architectures across a range of instruction-following, reasoning, and coding benchmarks, including MT-Bench, Arena-Hard-Auto, AlpacaEval 2.0, MixEval, MixEval Hard, MATH, and CodeContests. Archon architectures outperform frontier models, such as GPT-4o and Claude 3.5 Sonnet, on these benchmarks, achieving an average accuracy increase of 15.1 percentage points by using all available LLMs. We make our code and datasets available publicly on Github: https://github.com/ScalingIntelligence/Archon.
A Novel Approach to Malicious Code Detection Using CNN-BiLSTM and Feature Fusion
With the rapid advancement of Internet technology, the threat of malware to computer systems and network security has intensified. Malware affects individual privacy and security and poses risks to critical infrastructures of enterprises and nations. The increasing quantity and complexity of malware, along with its concealment and diversity, challenge traditional detection techniques. Static detection methods struggle against variants and packed malware, while dynamic methods face high costs and risks that limit their application. Consequently, there is an urgent need for novel and efficient malware detection techniques to improve accuracy and robustness. This study first employs the minhash algorithm to convert binary files of malware into grayscale images, followed by the extraction of global and local texture features using GIST and LBP algorithms. Additionally, the study utilizes IDA Pro to decompile and extract opcode sequences, applying N-gram and tf-idf algorithms for feature vectorization. The fusion of these features enables the model to comprehensively capture the behavioral characteristics of malware. In terms of model construction, a CNN-BiLSTM fusion model is designed to simultaneously process image features and opcode sequences, enhancing classification performance. Experimental validation on multiple public datasets demonstrates that the proposed method significantly outperforms traditional detection techniques in terms of accuracy, recall, and F1 score, particularly in detecting variants and obfuscated malware with greater stability. The research presented in this paper offers new insights into the development of malware detection technologies, validating the effectiveness of feature and model fusion, and holds promising application prospects.
Collaborative Semantic Occupancy Prediction with Hybrid Feature Fusion in Connected Automated Vehicles
Collaborative perception in automated vehicles leverages the exchange of information between agents, aiming to elevate perception results. Previous camera-based collaborative 3D perception methods typically employ 3D bounding boxes or bird's eye views as representations of the environment. However, these approaches fall short in offering a comprehensive 3D environmental prediction. To bridge this gap, we introduce the first method for collaborative 3D semantic occupancy prediction. Particularly, it improves local 3D semantic occupancy predictions by hybrid fusion of (i) semantic and occupancy task features, and (ii) compressed orthogonal attention features shared between vehicles. Additionally, due to the lack of a collaborative perception dataset designed for semantic occupancy prediction, we augment a current collaborative perception dataset to include 3D collaborative semantic occupancy labels for a more robust evaluation. The experimental findings highlight that: (i) our collaborative semantic occupancy predictions excel above the results from single vehicles by over 30%, and (ii) models anchored on semantic occupancy outpace state-of-the-art collaborative 3D detection techniques in subsequent perception applications, showcasing enhanced accuracy and enriched semantic-awareness in road environments.
TextBraTS: Text-Guided Volumetric Brain Tumor Segmentation with Innovative Dataset Development and Fusion Module Exploration
Deep learning has demonstrated remarkable success in medical image segmentation and computer-aided diagnosis. In particular, numerous advanced methods have achieved state-of-the-art performance in brain tumor segmentation from MRI scans. While recent studies in other medical imaging domains have revealed that integrating textual reports with visual data can enhance segmentation accuracy, the field of brain tumor analysis lacks a comprehensive dataset that combines radiological images with corresponding textual annotations. This limitation has hindered the exploration of multimodal approaches that leverage both imaging and textual data. To bridge this critical gap, we introduce the TextBraTS dataset, the first publicly available volume-level multimodal dataset that contains paired MRI volumes and rich textual annotations, derived from the widely adopted BraTS2020 benchmark. Building upon this novel dataset, we propose a novel baseline framework and sequential cross-attention method for text-guided volumetric medical image segmentation. Through extensive experiments with various text-image fusion strategies and templated text formulations, our approach demonstrates significant improvements in brain tumor segmentation accuracy, offering valuable insights into effective multimodal integration techniques. Our dataset, implementation code, and pre-trained models are publicly available at https://github.com/Jupitern52/TextBraTS.
OG-HFYOLO :Orientation gradient guidance and heterogeneous feature fusion for deformation table cell instance segmentation
Table structure recognition is a key task in document analysis. However, the geometric deformation in deformed tables causes a weak correlation between content information and structure, resulting in downstream tasks not being able to obtain accurate content information. To obtain fine-grained spatial coordinates of cells, we propose the OG-HFYOLO model, which enhances the edge response by Gradient Orientation-aware Extractor, combines a Heterogeneous Kernel Cross Fusion module and a scale-aware loss function to adapt to multi-scale objective features, and introduces mask-driven non-maximal suppression in the post-processing, which replaces the traditional bounding box suppression mechanism. Furthermore, we also propose a data generator, filling the gap in the dataset for fine-grained deformation table cell spatial coordinate localization, and derive a large-scale dataset named Deformation Wired Table (DWTAL). Experiments show that our proposed model demonstrates excellent segmentation accuracy on all mainstream instance segmentation models. The dataset and the source code are open source: https://github.com/justliulong/OGHFYOLO.
Toolshed: Scale Tool-Equipped Agents with Advanced RAG-Tool Fusion and Tool Knowledge Bases
Recent advancements in tool-equipped Agents (LLMs) have enabled complex tasks like secure database interactions and multi-agent code development. However, scaling tool capacity beyond agent reasoning or model limits remains a challenge. In this paper, we address these challenges by introducing Toolshed Knowledge Bases, a tool knowledge base (vector database) designed to store enhanced tool representations and optimize tool selection for large-scale tool-equipped Agents. Additionally, we propose Advanced RAG-Tool Fusion, a novel ensemble of tool-applied advanced retrieval-augmented generation (RAG) techniques across the pre-retrieval, intra-retrieval, and post-retrieval phases, without requiring model fine-tuning. During pre-retrieval, tool documents are enhanced with key information and stored in the Toolshed Knowledge Base. Intra-retrieval focuses on query planning and transformation to increase retrieval accuracy. Post-retrieval refines the retrieved tool documents and enables self-reflection. Furthermore, by varying both the total number of tools (tool-M) an Agent has access to and the tool selection threshold (top-k), we address trade-offs between retrieval accuracy, agent performance, and token cost. Our approach achieves 46%, 56%, and 47% absolute improvements on the ToolE single-tool, ToolE multi-tool and Seal-Tools benchmark datasets, respectively (Recall@5).
Quadratic Time-Frequency Analysis of Vibration Signals for Diagnosing Bearing Faults
Diagnosis of bearing faults is paramount to reducing maintenance costs and operational breakdowns. Bearing faults are primary contributors to machine vibrations, and analyzing their signal morphology offers insights into their health status. Unfortunately, existing approaches are optimized for controlled environments, neglecting realistic conditions such as time-varying rotational speeds and the vibration's non-stationary nature. This paper presents a fusion of time-frequency analysis and deep learning techniques to diagnose bearing faults under time-varying speeds and varying noise levels. First, we formulate the bearing fault-induced vibrations and discuss the link between their non-stationarity and the bearing's inherent and operational parameters. We also elucidate quadratic time-frequency distributions and validate their effectiveness in resolving distinctive dynamic patterns associated with different bearing faults. Based on this, we design a time-frequency convolutional neural network (TF-CNN) to diagnose various faults in rolling-element bearings. Our experimental findings undeniably demonstrate the superior performance of TF-CNN in comparison to recently developed techniques. They also assert its versatility in capturing fault-relevant non-stationary features that couple with speed changes and show its exceptional resilience to noise, consistently surpassing competing methods across various signal-to-noise ratios and performance metrics. Altogether, the TF-CNN achieves substantial accuracy improvements up to 15%, in severe noise conditions.
Don't Rank, Combine! Combining Machine Translation Hypotheses Using Quality Estimation
Neural machine translation systems estimate probabilities of target sentences given source sentences, yet these estimates may not align with human preferences. This work introduces QE-fusion, a method utilizing a quality estimation metric (QE) that better correlates with human judgments to synthesize improved translations. QE-fusion leverages a candidate pool sampled from a model, combining spans from different candidates using QE metrics such as CometKiwi. We compare QE-fusion against beam search and recent reranking techniques, such as Minimum Bayes Risk decoding or QE-reranking. Our method consistently improves translation quality in terms of COMET and BLEURT scores when applied to large language models (LLMs) used for translation (PolyLM, XGLM, Llama2, and Mistral) and to multilingual translation models (NLLB), over five language pairs. Notably, QE-fusion exhibits larger improvements for LLMs due to their ability to generate diverse outputs. We demonstrate that our approach generates novel translations in over half of the cases and consistently outperforms other methods across varying numbers of candidates (5-200). Furthermore, we empirically establish that QE-fusion scales linearly with the number of candidates in the pool. QE-fusion proves effective in enhancing LLM-based translation without the need for costly retraining of LLMs.
A Systematic Study of Joint Representation Learning on Protein Sequences and Structures
Learning effective protein representations is critical in a variety of tasks in biology such as predicting protein functions. Recent sequence representation learning methods based on Protein Language Models (PLMs) excel in sequence-based tasks, but their direct adaptation to tasks involving protein structures remains a challenge. In contrast, structure-based methods leverage 3D structural information with graph neural networks and geometric pre-training methods show potential in function prediction tasks, but still suffers from the limited number of available structures. To bridge this gap, our study undertakes a comprehensive exploration of joint protein representation learning by integrating a state-of-the-art PLM (ESM-2) with distinct structure encoders (GVP, GearNet, CDConv). We introduce three representation fusion strategies and explore different pre-training techniques. Our method achieves significant improvements over existing sequence- and structure-based methods, setting new state-of-the-art for function annotation. This study underscores several important design choices for fusing protein sequence and structure information. Our implementation is available at https://github.com/DeepGraphLearning/ESM-GearNet.
Leveraging Biomolecule and Natural Language through Multi-Modal Learning: A Survey
The integration of biomolecular modeling with natural language (BL) has emerged as a promising interdisciplinary area at the intersection of artificial intelligence, chemistry and biology. This approach leverages the rich, multifaceted descriptions of biomolecules contained within textual data sources to enhance our fundamental understanding and enable downstream computational tasks such as biomolecule property prediction. The fusion of the nuanced narratives expressed through natural language with the structural and functional specifics of biomolecules described via various molecular modeling techniques opens new avenues for comprehensively representing and analyzing biomolecules. By incorporating the contextual language data that surrounds biomolecules into their modeling, BL aims to capture a holistic view encompassing both the symbolic qualities conveyed through language as well as quantitative structural characteristics. In this review, we provide an extensive analysis of recent advancements achieved through cross modeling of biomolecules and natural language. (1) We begin by outlining the technical representations of biomolecules employed, including sequences, 2D graphs, and 3D structures. (2) We then examine in depth the rationale and key objectives underlying effective multi-modal integration of language and molecular data sources. (3) We subsequently survey the practical applications enabled to date in this developing research area. (4) We also compile and summarize the available resources and datasets to facilitate future work. (5) Looking ahead, we identify several promising research directions worthy of further exploration and investment to continue advancing the field. The related resources and contents are updating in https://github.com/QizhiPei/Awesome-Biomolecule-Language-Cross-Modeling.
Towards Good Practices for Missing Modality Robust Action Recognition
Standard multi-modal models assume the use of the same modalities in training and inference stages. However, in practice, the environment in which multi-modal models operate may not satisfy such assumption. As such, their performances degrade drastically if any modality is missing in the inference stage. We ask: how can we train a model that is robust to missing modalities? This paper seeks a set of good practices for multi-modal action recognition, with a particular interest in circumstances where some modalities are not available at an inference time. First, we study how to effectively regularize the model during training (e.g., data augmentation). Second, we investigate on fusion methods for robustness to missing modalities: we find that transformer-based fusion shows better robustness for missing modality than summation or concatenation. Third, we propose a simple modular network, ActionMAE, which learns missing modality predictive coding by randomly dropping modality features and tries to reconstruct them with the remaining modality features. Coupling these good practices, we build a model that is not only effective in multi-modal action recognition but also robust to modality missing. Our model achieves the state-of-the-arts on multiple benchmarks and maintains competitive performances even in missing modality scenarios. Codes are available at https://github.com/sangminwoo/ActionMAE.
SeMe: Training-Free Language Model Merging via Semantic Alignment
Despite the remarkable capabilities of Language Models (LMs) across diverse tasks, no single model consistently outperforms others, necessitating efficient methods to combine their strengths without expensive retraining. Existing model merging techniques, such as parameter averaging and task-guided fusion, often rely on data-dependent computations or fail to preserve internal knowledge, limiting their robustness and scalability. We introduce SeMe (Semantic-based Merging), a novel, data-free, and training-free approach that leverages latent semantic alignment to merge LMs at a fine-grained, layer-wise level. Unlike prior work, SeMe not only preserves model behaviors but also explicitly stabilizes internal knowledge, addressing a critical gap in LM fusion. Through extensive experiments across diverse architectures and tasks, we demonstrate that SeMe outperforms existing methods in both performance and efficiency while eliminating reliance on external data. Our work establishes a new paradigm for knowledge-aware model merging and provides insights into the semantic structure of LMs, paving the way for more scalable and interpretable model composition.
LongBoX: Evaluating Transformers on Long-Sequence Clinical Tasks
Many large language models (LLMs) for medicine have largely been evaluated on short texts, and their ability to handle longer sequences such as a complete electronic health record (EHR) has not been systematically explored. Assessing these models on long sequences is crucial since prior work in the general domain has demonstrated performance degradation of LLMs on longer texts. Motivated by this, we introduce LongBoX, a collection of seven medical datasets in text-to-text format, designed to investigate model performance on long sequences. Preliminary experiments reveal that both medical LLMs (e.g., BioGPT) and strong general domain LLMs (e.g., FLAN-T5) struggle on this benchmark. We further evaluate two techniques designed for long-sequence handling: (i) local-global attention, and (ii) Fusion-in-Decoder (FiD). Our results demonstrate mixed results with long-sequence handling - while scores on some datasets increase, there is substantial room for improvement. We hope that LongBoX facilitates the development of more effective long-sequence techniques for the medical domain. Data and source code are available at https://github.com/Mihir3009/LongBoX.
Self-Supervised Diffusion MRI Denoising via Iterative and Stable Refinement
Magnetic Resonance Imaging (MRI), including diffusion MRI (dMRI), serves as a ``microscope'' for anatomical structures and routinely mitigates the influence of low signal-to-noise ratio scans by compromising temporal or spatial resolution. However, these compromises fail to meet clinical demands for both efficiency and precision. Consequently, denoising is a vital preprocessing step, particularly for dMRI, where clean data is unavailable. In this paper, we introduce Di-Fusion, a fully self-supervised denoising method that leverages the latter diffusion steps and an adaptive sampling process. Unlike previous approaches, our single-stage framework achieves efficient and stable training without extra noise model training and offers adaptive and controllable results in the sampling process. Our thorough experiments on real and simulated data demonstrate that Di-Fusion achieves state-of-the-art performance in microstructure modeling, tractography tracking, and other downstream tasks. Code is available at https://github.com/FouierL/Di-Fusion.
A Survey on Non-Intrusive ASR Refinement: From Output-Level Correction to Full-Model Distillation
Automatic Speech Recognition (ASR) has become an integral component of modern technology, powering applications such as voice-activated assistants, transcription services, and accessibility tools. Yet ASR systems continue to struggle with the inherent variability of human speech, such as accents, dialects, and speaking styles, as well as environmental interference, including background noise. Moreover, domain-specific conversations often employ specialized terminology, which can exacerbate transcription errors. These shortcomings not only degrade raw ASR accuracy but also propagate mistakes through subsequent natural language processing pipelines. Because redesigning an ASR model is costly and time-consuming, non-intrusive refinement techniques that leave the model's architecture unchanged have become increasingly popular. In this survey, we systematically review current non-intrusive refinement approaches and group them into five classes: fusion, re-scoring, correction, distillation, and training adjustment. For each class, we outline the main methods, advantages, drawbacks, and ideal application scenarios. Beyond method classification, this work surveys adaptation techniques aimed at refining ASR in domain-specific contexts, reviews commonly used evaluation datasets along with their construction processes, and proposes a standardized set of metrics to facilitate fair comparisons. Finally, we identify open research gaps and suggest promising directions for future work. By providing this structured overview, we aim to equip researchers and practitioners with a clear foundation for developing more robust, accurate ASR refinement pipelines.
V2X Cooperative Perception for Autonomous Driving: Recent Advances and Challenges
Achieving fully autonomous driving with heightened safety and efficiency depends on vehicle-to-everything (V2X) cooperative perception (CP), which allows vehicles to share perception data, thereby enhancing situational awareness and overcoming the limitations of the sensing ability of individual vehicles. V2X CP is crucial for extending perception range, improving accuracy, and strengthening the decision-making and control capabilities of autonomous vehicles in complex environments. This paper provides a comprehensive survey of recent advances in V2X CP, introducing mathematical models of CP processes across various collaboration strategies. We examine essential techniques for reliable perception sharing, including agent selection, data alignment, and fusion methods. Key issues are analyzed, such as agent and model heterogeneity, perception uncertainty, and the impact of V2X communication constraints like delays and data loss on CP effectiveness. To inspire further advancements in V2X CP, we outline promising avenues, including privacy-preserving artificial intelligence (AI), collaborative AI, and integrated sensing frameworks, as pathways to enhance CP capabilities.
A Survey on Deep Neural Network Pruning-Taxonomy, Comparison, Analysis, and Recommendations
Modern deep neural networks, particularly recent large language models, come with massive model sizes that require significant computational and storage resources. To enable the deployment of modern models on resource-constrained environments and accelerate inference time, researchers have increasingly explored pruning techniques as a popular research direction in neural network compression. However, there is a dearth of up-to-date comprehensive review papers on pruning. To address this issue, in this survey, we provide a comprehensive review of existing research works on deep neural network pruning in a taxonomy of 1) universal/specific speedup, 2) when to prune, 3) how to prune, and 4) fusion of pruning and other compression techniques. We then provide a thorough comparative analysis of seven pairs of contrast settings for pruning (e.g., unstructured/structured) and explore emerging topics, including post-training pruning, different levels of supervision for pruning, and broader applications (e.g., adversarial robustness) to shed light on the commonalities and differences of existing methods and lay the foundation for further method development. To facilitate future research, we build a curated collection of datasets, networks, and evaluations on different applications. Finally, we provide some valuable recommendations on selecting pruning methods and prospect promising research directions. We build a repository at https://github.com/hrcheng1066/awesome-pruning.
Knowledge Grafting of Large Language Models
Cross-capability transfer is a key challenge in large language model (LLM) research, with applications in multi-task integration, model compression, and continual learning. Recent works like FuseLLM and FuseChat have demonstrated the potential of transferring multiple model capabilities to lightweight models, enhancing adaptability and efficiency, which motivates our investigation into more efficient cross-capability transfer methods. However, existing approaches primarily focus on small, homogeneous models, limiting their applicability. For large, heterogeneous models, knowledge distillation with full-parameter fine-tuning often overlooks the student model's intrinsic capacity and risks catastrophic forgetting, while PEFT methods struggle to effectively absorb knowledge from source LLMs. To address these issues, we introduce GraftLLM, a novel method that stores source model capabilities in a target model with SkillPack format. This approach preserves general capabilities, reduces parameter conflicts, and supports forget-free continual learning and model fusion. We employ a module-aware adaptive compression strategy to compress parameter updates, ensuring efficient storage while maintaining task-specific knowledge. The resulting SkillPack serves as a compact and transferable knowledge carrier, ideal for heterogeneous model fusion and continual learning. Experiments across various scenarios demonstrate that GraftLLM outperforms existing techniques in knowledge transfer, knowledge fusion, and forget-free learning, providing a scalable and efficient solution for cross-capability transfer. The code is publicly available at: https://github.com/duguodong7/GraftLLM.
Audio-Visual Glance Network for Efficient Video Recognition
Deep learning has made significant strides in video understanding tasks, but the computation required to classify lengthy and massive videos using clip-level video classifiers remains impractical and prohibitively expensive. To address this issue, we propose Audio-Visual Glance Network (AVGN), which leverages the commonly available audio and visual modalities to efficiently process the spatio-temporally important parts of a video. AVGN firstly divides the video into snippets of image-audio clip pair and employs lightweight unimodal encoders to extract global visual features and audio features. To identify the important temporal segments, we use an Audio-Visual Temporal Saliency Transformer (AV-TeST) that estimates the saliency scores of each frame. To further increase efficiency in the spatial dimension, AVGN processes only the important patches instead of the whole images. We use an Audio-Enhanced Spatial Patch Attention (AESPA) module to produce a set of enhanced coarse visual features, which are fed to a policy network that produces the coordinates of the important patches. This approach enables us to focus only on the most important spatio-temporally parts of the video, leading to more efficient video recognition. Moreover, we incorporate various training techniques and multi-modal feature fusion to enhance the robustness and effectiveness of our AVGN. By combining these strategies, our AVGN sets new state-of-the-art performance in multiple video recognition benchmarks while achieving faster processing speed.
LUT Tensor Core: Lookup Table Enables Efficient Low-Bit LLM Inference Acceleration
As large language model (LLM) inference demands ever-greater resources, there is a rapid growing trend of using low-bit weights to shrink memory usage and boost inference efficiency. However, these low-bit LLMs introduce the need for mixed-precision matrix multiplication (mpGEMM), which is a crucial yet under-explored operation that involves multiplying lower-precision weights with higher-precision activations. Unfortunately, current hardware does not natively support mpGEMM, resulting in indirect and inefficient dequantization-based implementations. To address the mpGEMM requirements in low-bit LLMs, we explored the lookup table (LUT)-based approach for mpGEMM. However, a conventional LUT implementation falls short of its potential. To fully harness the power of LUT-based mpGEMM, we introduce LUT Tensor Core, a software-hardware co-design optimized for low-bit LLM inference. Specifically, we introduce software-based operator fusion and table symmetrization techniques to optimize table precompute and table storage, respectively. Then, LUT Tensor Core proposes the hardware design featuring an elongated tiling shape design to enhance table reuse and a bit-serial design to support various precision combinations in mpGEMM. Moreover, we design an end-to-end compilation stack with new instructions for LUT-based mpGEMM, enabling efficient LLM compilation and optimizations. The evaluation on low-bit LLMs (e.g., BitNet, LLAMA) shows that LUT Tensor Core achieves more than a magnitude of improvements on both compute density and energy efficiency.
Unleashing Artificial Cognition: Integrating Multiple AI Systems
In this study, we present an innovative fusion of language models and query analysis techniques to unlock cognition in artificial intelligence. Our system seamlessly integrates a Chess engine with a language model, enabling it to predict moves and provide strategic explanations. Leveraging a vector database to achieve retrievable answer generation, our OpenSI AI system elucidates its decision-making process, bridging the gap between raw computation and human-like understanding. Our choice of Chess as the demonstration environment underscores the versatility of our approach. Beyond Chess, our system holds promise for diverse applications, from medical diagnostics to financial forecasting.
Deep Learning based Visually Rich Document Content Understanding: A Survey
Visually Rich Documents (VRDs) are essential in academia, finance, medical fields, and marketing due to their multimodal information content. Traditional methods for extracting information from VRDs depend on expert knowledge and manual labor, making them costly and inefficient. The advent of deep learning has revolutionized this process, introducing models that leverage multimodal information vision, text, and layout along with pretraining tasks to develop comprehensive document representations. These models have achieved state-of-the-art performance across various downstream tasks, significantly enhancing the efficiency and accuracy of information extraction from VRDs. In response to the growing demands and rapid developments in Visually Rich Document Understanding (VRDU), this paper provides a comprehensive review of deep learning-based VRDU frameworks. We systematically survey and analyze existing methods and benchmark datasets, categorizing them based on adopted strategies and downstream tasks. Furthermore, we compare different techniques used in VRDU models, focusing on feature representation and fusion, model architecture, and pretraining methods, while highlighting their strengths, limitations, and appropriate scenarios. Finally, we identify emerging trends and challenges in VRDU, offering insights into future research directions and practical applications. This survey aims to provide a thorough understanding of VRDU advancements, benefiting both academic and industrial sectors.
Factor Graph Optimization for Leak Localization in Water Distribution Networks
Detecting and localizing leaks in water distribution network systems is an important topic with direct environmental, economic, and social impact. Our paper is the first to explore the use of factor graph optimization techniques for leak localization in water distribution networks, enabling us to perform sensor fusion between pressure and demand sensor readings and to estimate the network's temporal and structural state evolution across all network nodes. The methodology introduces specific water network factors and proposes a new architecture composed of two factor graphs: a leak-free state estimation factor graph and a leak localization factor graph. When a new sensor reading is obtained, unlike Kalman and other interpolation-based methods, which estimate only the current network state, factor graphs update both current and past states. Results on Modena, L-TOWN and synthetic networks show that factor graphs are much faster than nonlinear Kalman-based alternatives such as the UKF, while also providing improvements in localization compared to state-of-the-art estimation-localization approaches. Implementation and benchmarks are available at https://github.com/pirofti/FGLL.
TV-3DG: Mastering Text-to-3D Customized Generation with Visual Prompt
In recent years, advancements in generative models have significantly expanded the capabilities of text-to-3D generation. Many approaches rely on Score Distillation Sampling (SDS) technology. However, SDS struggles to accommodate multi-condition inputs, such as text and visual prompts, in customized generation tasks. To explore the core reasons, we decompose SDS into a difference term and a classifier-free guidance term. Our analysis identifies the core issue as arising from the difference term and the random noise addition during the optimization process, both contributing to deviations from the target mode during distillation. To address this, we propose a novel algorithm, Classifier Score Matching (CSM), which removes the difference term in SDS and uses a deterministic noise addition process to reduce noise during optimization, effectively overcoming the low-quality limitations of SDS in our customized generation framework. Based on CSM, we integrate visual prompt information with an attention fusion mechanism and sampling guidance techniques, forming the Visual Prompt CSM (VPCSM) algorithm. Furthermore, we introduce a Semantic-Geometry Calibration (SGC) module to enhance quality through improved textual information integration. We present our approach as TV-3DG, with extensive experiments demonstrating its capability to achieve stable, high-quality, customized 3D generation. Project page: https://yjhboy.github.io/TV-3DG
Upper Limb Movement Recognition utilising EEG and EMG Signals for Rehabilitative Robotics
Upper limb movement classification, which maps input signals to the target activities, is a key building block in the control of rehabilitative robotics. Classifiers are trained for the rehabilitative system to comprehend the desires of the patient whose upper limbs do not function properly. Electromyography (EMG) signals and Electroencephalography (EEG) signals are used widely for upper limb movement classification. By analysing the classification results of the real-time EEG and EMG signals, the system can understand the intention of the user and predict the events that one would like to carry out. Accordingly, it will provide external help to the user. However, the noise in the real-time EEG and EMG data collection process contaminates the effectiveness of the data, which undermines classification performance. Moreover, not all patients process strong EMG signals due to muscle damage and neuromuscular disorder. To address these issues, this paper explores different feature extraction techniques and machine learning and deep learning models for EEG and EMG signals classification and proposes a novel decision-level multisensor fusion technique to integrate EEG signals with EMG signals. This system retrieves effective information from both sources to understand and predict the desire of the user, and thus aid. By testing out the proposed technique on a publicly available WAY-EEG-GAL dataset, which contains EEG and EMG signals that were recorded simultaneously, we manage to conclude the feasibility and effectiveness of the novel system.
A Systematic Review of Key Retrieval-Augmented Generation (RAG) Systems: Progress, Gaps, and Future Directions
Retrieval-Augmented Generation (RAG) represents a major advancement in natural language processing (NLP), combining large language models (LLMs) with information retrieval systems to enhance factual grounding, accuracy, and contextual relevance. This paper presents a comprehensive systematic review of RAG, tracing its evolution from early developments in open domain question answering to recent state-of-the-art implementations across diverse applications. The review begins by outlining the motivations behind RAG, particularly its ability to mitigate hallucinations and outdated knowledge in parametric models. Core technical components-retrieval mechanisms, sequence-to-sequence generation models, and fusion strategies are examined in detail. A year-by-year analysis highlights key milestones and research trends, providing insight into RAG's rapid growth. The paper further explores the deployment of RAG in enterprise systems, addressing practical challenges related to retrieval of proprietary data, security, and scalability. A comparative evaluation of RAG implementations is conducted, benchmarking performance on retrieval accuracy, generation fluency, latency, and computational efficiency. Persistent challenges such as retrieval quality, privacy concerns, and integration overhead are critically assessed. Finally, the review highlights emerging solutions, including hybrid retrieval approaches, privacy-preserving techniques, optimized fusion strategies, and agentic RAG architectures. These innovations point toward a future of more reliable, efficient, and context-aware knowledge-intensive NLP systems.
Improved training of end-to-end attention models for speech recognition
Sequence-to-sequence attention-based models on subword units allow simple open-vocabulary end-to-end speech recognition. In this work, we show that such models can achieve competitive results on the Switchboard 300h and LibriSpeech 1000h tasks. In particular, we report the state-of-the-art word error rates (WER) of 3.54% on the dev-clean and 3.82% on the test-clean evaluation subsets of LibriSpeech. We introduce a new pretraining scheme by starting with a high time reduction factor and lowering it during training, which is crucial both for convergence and final performance. In some experiments, we also use an auxiliary CTC loss function to help the convergence. In addition, we train long short-term memory (LSTM) language models on subword units. By shallow fusion, we report up to 27% relative improvements in WER over the attention baseline without a language model.
SE-MoE: A Scalable and Efficient Mixture-of-Experts Distributed Training and Inference System
With the increasing diversity of ML infrastructures nowadays, distributed training over heterogeneous computing systems is desired to facilitate the production of big models. Mixture-of-Experts (MoE) models have been proposed to lower the cost of training subject to the overall size of models/data through gating and parallelism in a divide-and-conquer fashion. While DeepSpeed has made efforts in carrying out large-scale MoE training over heterogeneous infrastructures, the efficiency of training and inference could be further improved from several system aspects, including load balancing, communication/computation efficiency, and memory footprint limits. In this work, we present SE-MoE that proposes Elastic MoE training with 2D prefetch and Fusion communication over Hierarchical storage, so as to enjoy efficient parallelisms in various types. For scalable inference in a single node, especially when the model size is larger than GPU memory, SE-MoE forms the CPU-GPU memory jointly into a ring of sections to load the model, and executes the computation tasks across the memory sections in a round-robin manner for efficient inference. We carried out extensive experiments to evaluate SE-MoE, where SE-MoE successfully trains a Unified Feature Optimization (UFO) model with a Sparsely-Gated Mixture-of-Experts model of 12B parameters in 8 days on 48 A100 GPU cards. The comparison against the state-of-the-art shows that SE-MoE outperformed DeepSpeed with 33% higher throughput (tokens per second) in training and 13% higher throughput in inference in general. Particularly, under unbalanced MoE Tasks, e.g., UFO, SE-MoE achieved 64% higher throughput with 18% lower memory footprints. The code of the framework will be released on: https://github.com/PaddlePaddle/Paddle.
CromSS: Cross-modal pre-training with noisy labels for remote sensing image segmentation
We explore the potential of large-scale noisily labeled data to enhance feature learning by pretraining semantic segmentation models within a multi-modal framework for geospatial applications. We propose a novel Cross-modal Sample Selection (CromSS) method, a weakly supervised pretraining strategy designed to improve feature representations through cross-modal consistency and noise mitigation techniques. Unlike conventional pretraining approaches, CromSS exploits massive amounts of noisy and easy-to-come-by labels for improved feature learning beneficial to semantic segmentation tasks. We investigate middle and late fusion strategies to optimize the multi-modal pretraining architecture design. We also introduce a cross-modal sample selection module to mitigate the adverse effects of label noise, which employs a cross-modal entangling strategy to refine the estimated confidence masks within each modality to guide the sampling process. Additionally, we introduce a spatial-temporal label smoothing technique to counteract overconfidence for enhanced robustness against noisy labels. To validate our approach, we assembled the multi-modal dataset, NoLDO-S12, which consists of a large-scale noisy label subset from Google's Dynamic World (DW) dataset for pretraining and two downstream subsets with high-quality labels from Google DW and OpenStreetMap (OSM) for transfer learning. Experimental results on two downstream tasks and the publicly available DFC2020 dataset demonstrate that when effectively utilized, the low-cost noisy labels can significantly enhance feature learning for segmentation tasks. All data, code, and pretrained weights will be made publicly available.
RMIT-ADM+S at the SIGIR 2025 LiveRAG Challenge
This paper presents the RMIT--ADM+S participation in the SIGIR 2025 LiveRAG Challenge. Our Generation-Retrieval-Augmented Generation (GRAG) approach relies on generating a hypothetical answer that is used in the retrieval phase, alongside the original question. GRAG also incorporates a pointwise large language model (LLM)-based re-ranking step prior to final answer generation. We describe the system architecture and the rationale behind our design choices. In particular, a systematic evaluation using the Grid of Points (GoP) framework and N-way ANOVA enabled comparison across multiple configurations, including query variant generation, question decomposition, rank fusion strategies, and prompting techniques for answer generation. Our system achieved a Relevance score of 1.199 and a Faithfulness score of 0.477 on the private leaderboard, placing among the top four finalists in the LiveRAG 2025 Challenge.
UltraFusion: Ultra High Dynamic Imaging using Exposure Fusion
Capturing high dynamic range (HDR) scenes is one of the most important issues in camera design. Majority of cameras use exposure fusion technique, which fuses images captured by different exposure levels, to increase dynamic range. However, this approach can only handle images with limited exposure difference, normally 3-4 stops. When applying to very high dynamic scenes where a large exposure difference is required, this approach often fails due to incorrect alignment or inconsistent lighting between inputs, or tone mapping artifacts. In this work, we propose UltraFusion, the first exposure fusion technique that can merge input with 9 stops differences. The key idea is that we model the exposure fusion as a guided inpainting problem, where the under-exposed image is used as a guidance to fill the missing information of over-exposed highlight in the over-exposed region. Using under-exposed image as a soft guidance, instead of a hard constrain, our model is robust to potential alignment issue or lighting variations. Moreover, utilizing the image prior of the generative model, our model also generates natural tone mapping, even for very high-dynamic range scene. Our approach outperforms HDR-Transformer on latest HDR benchmarks. Moreover, to test its performance in ultra high dynamic range scene, we capture a new real-world exposure fusion benchmark, UltraFusion Dataset, with exposure difference up to 9 stops, and experiments show that \model~can generate beautiful and high-quality fusion results under various scenarios. An online demo is provided at https://openimaginglab.github.io/UltraFusion/.
ShaSTA-Fuse: Camera-LiDAR Sensor Fusion to Model Shape and Spatio-Temporal Affinities for 3D Multi-Object Tracking
3D multi-object tracking (MOT) is essential for an autonomous mobile agent to safely navigate a scene. In order to maximize the perception capabilities of the autonomous agent, we aim to develop a 3D MOT framework that fuses camera and LiDAR sensor information. Building on our prior LiDAR-only work, ShaSTA, which models shape and spatio-temporal affinities for 3D MOT, we propose a novel camera-LiDAR fusion approach for learning affinities. At its core, this work proposes a fusion technique that generates a rich sensory signal incorporating information about depth and distant objects to enhance affinity estimation for improved data association, track lifecycle management, false-positive elimination, false-negative propagation, and track confidence score refinement. Our main contributions include a novel fusion approach for combining camera and LiDAR sensory signals to learn affinities, and a first-of-its-kind multimodal sequential track confidence refinement technique that fuses 2D and 3D detections. Additionally, we perform an ablative analysis on each fusion step to demonstrate the added benefits of incorporating the camera sensor, particular for small, distant objects that tend to suffer from the depth-sensing limits and sparsity of LiDAR sensors. In sum, our technique achieves state-of-the-art performance on the nuScenes benchmark amongst multimodal 3D MOT algorithms using CenterPoint detections.
FaR: Enhancing Multi-Concept Text-to-Image Diffusion via Concept Fusion and Localized Refinement
Generating multiple new concepts remains a challenging problem in the text-to-image task. Current methods often overfit when trained on a small number of samples and struggle with attribute leakage, particularly for class-similar subjects (e.g., two specific dogs). In this paper, we introduce Fuse-and-Refine (FaR), a novel approach that tackles these challenges through two key contributions: Concept Fusion technique and Localized Refinement loss function. Concept Fusion systematically augments the training data by separating reference subjects from backgrounds and recombining them into composite images to increase diversity. This augmentation technique tackles the overfitting problem by mitigating the narrow distribution of the limited training samples. In addition, Localized Refinement loss function is introduced to preserve subject representative attributes by aligning each concept's attention map to its correct region. This approach effectively prevents attribute leakage by ensuring that the diffusion model distinguishes similar subjects without mixing their attention maps during the denoising process. By fine-tuning specific modules at the same time, FaR balances the learning of new concepts with the retention of previously learned knowledge. Empirical results show that FaR not only prevents overfitting and attribute leakage while maintaining photorealism, but also outperforms other state-of-the-art methods.
Whispering LLaMA: A Cross-Modal Generative Error Correction Framework for Speech Recognition
We introduce a new cross-modal fusion technique designed for generative error correction in automatic speech recognition (ASR). Our methodology leverages both acoustic information and external linguistic representations to generate accurate speech transcription contexts. This marks a step towards a fresh paradigm in generative error correction within the realm of n-best hypotheses. Unlike the existing ranking-based rescoring methods, our approach adeptly uses distinct initialization techniques and parameter-efficient algorithms to boost ASR performance derived from pre-trained speech and text models. Through evaluation across diverse ASR datasets, we evaluate the stability and reproducibility of our fusion technique, demonstrating its improved word error rate relative (WERR) performance in comparison to n-best hypotheses by relatively 37.66%. To encourage future research, we have made our code and pre-trained models open source at https://github.com/Srijith-rkr/Whispering-LLaMA.
Dual Illumination Estimation for Robust Exposure Correction
Exposure correction is one of the fundamental tasks in image processing and computational photography. While various methods have been proposed, they either fail to produce visually pleasing results, or only work well for limited types of image (e.g., underexposed images). In this paper, we present a novel automatic exposure correction method, which is able to robustly produce high-quality results for images of various exposure conditions (e.g., underexposed, overexposed, and partially under- and over-exposed). At the core of our approach is the proposed dual illumination estimation, where we separately cast the under- and over-exposure correction as trivial illumination estimation of the input image and the inverted input image. By performing dual illumination estimation, we obtain two intermediate exposure correction results for the input image, with one fixes the underexposed regions and the other one restores the overexposed regions. A multi-exposure image fusion technique is then employed to adaptively blend the visually best exposed parts in the two intermediate exposure correction images and the input image into a globally well-exposed image. Experiments on a number of challenging images demonstrate the effectiveness of the proposed approach and its superiority over the state-of-the-art methods and popular automatic exposure correction tools.
Parametric-ControlNet: Multimodal Control in Foundation Models for Precise Engineering Design Synthesis
This paper introduces a generative model designed for multimodal control over text-to-image foundation generative AI models such as Stable Diffusion, specifically tailored for engineering design synthesis. Our model proposes parametric, image, and text control modalities to enhance design precision and diversity. Firstly, it handles both partial and complete parametric inputs using a diffusion model that acts as a design autocomplete co-pilot, coupled with a parametric encoder to process the information. Secondly, the model utilizes assembly graphs to systematically assemble input component images, which are then processed through a component encoder to capture essential visual data. Thirdly, textual descriptions are integrated via CLIP encoding, ensuring a comprehensive interpretation of design intent. These diverse inputs are synthesized through a multimodal fusion technique, creating a joint embedding that acts as the input to a module inspired by ControlNet. This integration allows the model to apply robust multimodal control to foundation models, facilitating the generation of complex and precise engineering designs. This approach broadens the capabilities of AI-driven design tools and demonstrates significant advancements in precise control based on diverse data modalities for enhanced design generation.
Query-Aware Learnable Graph Pooling Tokens as Prompt for Large Language Models
Graph-structured data plays a vital role in numerous domains, such as social networks, citation networks, commonsense reasoning graphs and knowledge graphs. While graph neural networks have been employed for graph processing, recent advancements have explored integrating large language models for graph-based tasks. In this paper, we propose a novel approach named Learnable Graph Pooling Token (LGPT), which addresses the limitations of the scalability issues in node-level projection and information loss in graph-level projection. LGPT enables flexible and efficient graph representation by introducing learnable parameters that act as tokens in large language models, balancing fine-grained and global graph information. Additionally, we investigate an Early Query Fusion technique, which fuses query context before constructing the graph representation, leading to more effective graph embeddings. Our method achieves a 4.13\% performance improvement on the GraphQA benchmark without training the large language model, demonstrating significant gains in handling complex textual-attributed graph data.
DGInStyle: Domain-Generalizable Semantic Segmentation with Image Diffusion Models and Stylized Semantic Control
Large, pretrained latent diffusion models (LDMs) have demonstrated an extraordinary ability to generate creative content, specialize to user data through few-shot fine-tuning, and condition their output on other modalities, such as semantic maps. However, are they usable as large-scale data generators, e.g., to improve tasks in the perception stack, like semantic segmentation? We investigate this question in the context of autonomous driving, and answer it with a resounding "yes". We propose an efficient data generation pipeline termed DGInStyle. First, we examine the problem of specializing a pretrained LDM to semantically-controlled generation within a narrow domain. Second, we design a Multi-resolution Latent Fusion technique to overcome the bias of LDMs towards dominant objects. Third, we propose a Style Swap technique to endow the rich generative prior with the learned semantic control. Using DGInStyle, we generate a diverse dataset of street scenes, train a domain-agnostic semantic segmentation model on it, and evaluate the model on multiple popular autonomous driving datasets. Our approach consistently increases the performance of several domain generalization methods, in some cases by +2.5 mIoU compared to the previous state-of-the-art method without our generative augmentation scheme. Source code and dataset are available at https://dginstyle.github.io .
CREST: Cross-modal Resonance through Evidential Deep Learning for Enhanced Zero-Shot Learning
Zero-shot learning (ZSL) enables the recognition of novel classes by leveraging semantic knowledge transfer from known to unknown categories. This knowledge, typically encapsulated in attribute descriptions, aids in identifying class-specific visual features, thus facilitating visual-semantic alignment and improving ZSL performance. However, real-world challenges such as distribution imbalances and attribute co-occurrence among instances often hinder the discernment of local variances in images, a problem exacerbated by the scarcity of fine-grained, region-specific attribute annotations. Moreover, the variability in visual presentation within categories can also skew attribute-category associations. In response, we propose a bidirectional cross-modal ZSL approach CREST. It begins by extracting representations for attribute and visual localization and employs Evidential Deep Learning (EDL) to measure underlying epistemic uncertainty, thereby enhancing the model's resilience against hard negatives. CREST incorporates dual learning pathways, focusing on both visual-category and attribute-category alignments, to ensure robust correlation between latent and observable spaces. Moreover, we introduce an uncertainty-informed cross-modal fusion technique to refine visual-attribute inference. Extensive experiments demonstrate our model's effectiveness and unique explainability across multiple datasets. Our code and data are available at: https://github.com/JethroJames/CREST
MagicAnimate: Temporally Consistent Human Image Animation using Diffusion Model
This paper studies the human image animation task, which aims to generate a video of a certain reference identity following a particular motion sequence. Existing animation works typically employ the frame-warping technique to animate the reference image towards the target motion. Despite achieving reasonable results, these approaches face challenges in maintaining temporal consistency throughout the animation due to the lack of temporal modeling and poor preservation of reference identity. In this work, we introduce MagicAnimate, a diffusion-based framework that aims at enhancing temporal consistency, preserving reference image faithfully, and improving animation fidelity. To achieve this, we first develop a video diffusion model to encode temporal information. Second, to maintain the appearance coherence across frames, we introduce a novel appearance encoder to retain the intricate details of the reference image. Leveraging these two innovations, we further employ a simple video fusion technique to encourage smooth transitions for long video animation. Empirical results demonstrate the superiority of our method over baseline approaches on two benchmarks. Notably, our approach outperforms the strongest baseline by over 38% in terms of video fidelity on the challenging TikTok dancing dataset. Code and model will be made available.
LoCoCo: Dropping In Convolutions for Long Context Compression
This paper tackles the memory hurdle of processing long context sequences in Large Language Models (LLMs), by presenting a novel approach, Dropping In Convolutions for Long Context Compression (LoCoCo). LoCoCo employs only a fixed-size Key-Value (KV) cache, and can enhance efficiency in both inference and fine-tuning stages. Diverging from prior methods that selectively drop KV pairs based on heuristics, LoCoCo leverages a data-driven adaptive fusion technique, blending previous KV pairs with incoming tokens to minimize the loss of contextual information and ensure accurate attention modeling. This token integration is achieved through injecting one-dimensional convolutional kernels that dynamically calculate mixing weights for each KV cache slot. Designed for broad compatibility with existing LLM frameworks, LoCoCo allows for straightforward "drop-in" integration without needing architectural modifications, while incurring minimal tuning overhead. Experiments demonstrate that LoCoCo maintains consistently outstanding performance across various context lengths and can achieve a high context compression rate during both inference and fine-tuning phases. During inference, we successfully compressed up to 3482 tokens into a 128-size KV cache, while retaining comparable performance to the full sequence - an accuracy improvement of up to 0.2791 compared to baselines at the same cache size. During post-training tuning, we also effectively extended the context length from 4K to 32K using a KV cache of fixed size 512, achieving performance similar to fine-tuning with entire sequences.
Transformer Fusion with Optimal Transport
Fusion is a technique for merging multiple independently-trained neural networks in order to combine their capabilities. Past attempts have been restricted to the case of fully-connected, convolutional, and residual networks. In this paper, we present a systematic approach for fusing two or more transformer-based networks exploiting Optimal Transport to (soft-)align the various architectural components. We flesh out an abstraction for layer alignment, that can generalize to arbitrary architectures -- in principle -- and we apply this to the key ingredients of Transformers such as multi-head self-attention, layer-normalization, and residual connections, and we discuss how to handle them via various ablation studies. Furthermore, our method allows the fusion of models of different sizes (heterogeneous fusion), providing a new and efficient way for compression of Transformers. The proposed approach is evaluated on both image classification tasks via Vision Transformer and natural language modeling tasks using BERT. Our approach consistently outperforms vanilla fusion, and, after a surprisingly short finetuning, also outperforms the individual converged parent models. In our analysis, we uncover intriguing insights about the significant role of soft alignment in the case of Transformers. Our results showcase the potential of fusing multiple Transformers, thus compounding their expertise, in the budding paradigm of model fusion and recombination.
An Intermediate Fusion ViT Enables Efficient Text-Image Alignment in Diffusion Models
Diffusion models have been widely used for conditional data cross-modal generation tasks such as text-to-image and text-to-video. However, state-of-the-art models still fail to align the generated visual concepts with high-level semantics in a language such as object count, spatial relationship, etc. We approach this problem from a multimodal data fusion perspective and investigate how different fusion strategies can affect vision-language alignment. We discover that compared to the widely used early fusion of conditioning text in a pretrained image feature space, a specially designed intermediate fusion can: (i) boost text-to-image alignment with improved generation quality and (ii) improve training and inference efficiency by reducing low-rank text-to-image attention calculations. We perform experiments using a text-to-image generation task on the MS-COCO dataset. We compare our intermediate fusion mechanism with the classic early fusion mechanism on two common conditioning methods on a U-shaped ViT backbone. Our intermediate fusion model achieves a higher CLIP Score and lower FID, with 20% reduced FLOPs, and 50% increased training speed compared to a strong U-ViT baseline with an early fusion.
SupFusion: Supervised LiDAR-Camera Fusion for 3D Object Detection
In this paper, we propose a novel training strategy called SupFusion, which provides an auxiliary feature level supervision for effective LiDAR-Camera fusion and significantly boosts detection performance. Our strategy involves a data enhancement method named Polar Sampling, which densifies sparse objects and trains an assistant model to generate high-quality features as the supervision. These features are then used to train the LiDAR-Camera fusion model, where the fusion feature is optimized to simulate the generated high-quality features. Furthermore, we propose a simple yet effective deep fusion module, which contiguously gains superior performance compared with previous fusion methods with SupFusion strategy. In such a manner, our proposal shares the following advantages. Firstly, SupFusion introduces auxiliary feature-level supervision which could boost LiDAR-Camera detection performance without introducing extra inference costs. Secondly, the proposed deep fusion could continuously improve the detector's abilities. Our proposed SupFusion and deep fusion module is plug-and-play, we make extensive experiments to demonstrate its effectiveness. Specifically, we gain around 2% 3D mAP improvements on KITTI benchmark based on multiple LiDAR-Camera 3D detectors.
Equivariant Multi-Modality Image Fusion
Multi-modality image fusion is a technique that combines information from different sensors or modalities, enabling the fused image to retain complementary features from each modality, such as functional highlights and texture details. However, effective training of such fusion models is challenging due to the scarcity of ground truth fusion data. To tackle this issue, we propose the Equivariant Multi-Modality imAge fusion (EMMA) paradigm for end-to-end self-supervised learning. Our approach is rooted in the prior knowledge that natural imaging responses are equivariant to certain transformations. Consequently, we introduce a novel training paradigm that encompasses a fusion module, a pseudo-sensing module, and an equivariant fusion module. These components enable the net training to follow the principles of the natural sensing-imaging process while satisfying the equivariant imaging prior. Extensive experiments confirm that EMMA yields high-quality fusion results for infrared-visible and medical images, concurrently facilitating downstream multi-modal segmentation and detection tasks. The code is available at https://github.com/Zhaozixiang1228/MMIF-EMMA.
A Comparative Assessment of Multi-view fusion learning for Crop Classification
With a rapidly increasing amount and diversity of remote sensing (RS) data sources, there is a strong need for multi-view learning modeling. This is a complex task when considering the differences in resolution, magnitude, and noise of RS data. The typical approach for merging multiple RS sources has been input-level fusion, but other - more advanced - fusion strategies may outperform this traditional approach. This work assesses different fusion strategies for crop classification in the CropHarvest dataset. The fusion methods proposed in this work outperform models based on individual views and previous fusion methods. We do not find one single fusion method that consistently outperforms all other approaches. Instead, we present a comparison of multi-view fusion methods for three different datasets and show that, depending on the test region, different methods obtain the best performance. Despite this, we suggest a preliminary criterion for the selection of fusion methods.
MMSFormer: Multimodal Transformer for Material and Semantic Segmentation
Leveraging information across diverse modalities is known to enhance performance on multimodal segmentation tasks. However, effectively fusing information from different modalities remains challenging due to the unique characteristics of each modality. In this paper, we propose a novel fusion strategy that can effectively fuse information from different modality combinations. We also propose a new model named Multi-Modal Segmentation TransFormer (MMSFormer) that incorporates the proposed fusion strategy to perform multimodal material and semantic segmentation tasks. MMSFormer outperforms current state-of-the-art models on three different datasets. As we begin with only one input modality, performance improves progressively as additional modalities are incorporated, showcasing the effectiveness of the fusion block in combining useful information from diverse input modalities. Ablation studies show that different modules in the fusion block are crucial for overall model performance. Furthermore, our ablation studies also highlight the capacity of different input modalities to improve performance in the identification of different types of materials. The code and pretrained models will be made available at https://github.com/csiplab/MMSFormer.
Target-aware Dual Adversarial Learning and a Multi-scenario Multi-Modality Benchmark to Fuse Infrared and Visible for Object Detection
This study addresses the issue of fusing infrared and visible images that appear differently for object detection. Aiming at generating an image of high visual quality, previous approaches discover commons underlying the two modalities and fuse upon the common space either by iterative optimization or deep networks. These approaches neglect that modality differences implying the complementary information are extremely important for both fusion and subsequent detection task. This paper proposes a bilevel optimization formulation for the joint problem of fusion and detection, and then unrolls to a target-aware Dual Adversarial Learning (TarDAL) network for fusion and a commonly used detection network. The fusion network with one generator and dual discriminators seeks commons while learning from differences, which preserves structural information of targets from the infrared and textural details from the visible. Furthermore, we build a synchronized imaging system with calibrated infrared and optical sensors, and collect currently the most comprehensive benchmark covering a wide range of scenarios. Extensive experiments on several public datasets and our benchmark demonstrate that our method outputs not only visually appealing fusion but also higher detection mAP than the state-of-the-art approaches.
Mantis Shrimp: Exploring Photometric Band Utilization in Computer Vision Networks for Photometric Redshift Estimation
We present Mantis Shrimp, a multi-survey deep learning model for photometric redshift estimation that fuses ultra-violet (GALEX), optical (PanSTARRS), and infrared (UnWISE) imagery. Machine learning is now an established approach for photometric redshift estimation, with generally acknowledged higher performance in areas with a high density of spectroscopically identified galaxies over template-based methods. Multiple works have shown that image-based convolutional neural networks can outperform tabular-based color/magnitude models. In comparison to tabular models, image models have additional design complexities: it is largely unknown how to fuse inputs from different instruments which have different resolutions or noise properties. The Mantis Shrimp model estimates the conditional density estimate of redshift using cutout images. The density estimates are well calibrated and the point estimates perform well in the distribution of available spectroscopically confirmed galaxies with (bias = 1e-2), scatter (NMAD = 2.44e-2) and catastrophic outlier rate (eta=17.53%). We find that early fusion approaches (e.g., resampling and stacking images from different instruments) match the performance of late fusion approaches (e.g., concatenating latent space representations), so that the design choice ultimately is left to the user. Finally, we study how the models learn to use information across bands, finding evidence that our models successfully incorporates information from all surveys. The applicability of our model to the analysis of large populations of galaxies is limited by the speed of downloading cutouts from external servers; however, our model could be useful in smaller studies such as generating priors over redshift for stellar population synthesis.
Multi-modal Gated Mixture of Local-to-Global Experts for Dynamic Image Fusion
Infrared and visible image fusion aims to integrate comprehensive information from multiple sources to achieve superior performances on various practical tasks, such as detection, over that of a single modality. However, most existing methods directly combined the texture details and object contrast of different modalities, ignoring the dynamic changes in reality, which diminishes the visible texture in good lighting conditions and the infrared contrast in low lighting conditions. To fill this gap, we propose a dynamic image fusion framework with a multi-modal gated mixture of local-to-global experts, termed MoE-Fusion, to dynamically extract effective and comprehensive information from the respective modalities. Our model consists of a Mixture of Local Experts (MoLE) and a Mixture of Global Experts (MoGE) guided by a multi-modal gate. The MoLE performs specialized learning of multi-modal local features, prompting the fused images to retain the local information in a sample-adaptive manner, while the MoGE focuses on the global information that complements the fused image with overall texture detail and contrast. Extensive experiments show that our MoE-Fusion outperforms state-of-the-art methods in preserving multi-modal image texture and contrast through the local-to-global dynamic learning paradigm, and also achieves superior performance on detection tasks. Our code will be available: https://github.com/SunYM2020/MoE-Fusion.
Using Deep Learning to Design High Aspect Ratio Fusion Devices
The design of fusion devices is typically based on computationally expensive simulations. This can be alleviated using high aspect ratio models that employ a reduced number of free parameters, especially in the case of stellarator optimization where non-axisymmetric magnetic fields with a large parameter space are optimized to satisfy certain performance criteria. However, optimization is still required to find configurations with properties such as low elongation, high rotational transform, finite plasma beta, and good fast particle confinement. In this work, we train a machine learning model to construct configurations with favorable confinement properties by finding a solution to the inverse design problem, that is, obtaining a set of model input parameters for given desired properties. Since the solution of the inverse problem is non-unique, a probabilistic approach, based on mixture density networks, is used. It is shown that optimized configurations can be generated reliably using this method.
Towards a Robust Sensor Fusion Step for 3D Object Detection on Corrupted Data
Multimodal sensor fusion methods for 3D object detection have been revolutionizing the autonomous driving research field. Nevertheless, most of these methods heavily rely on dense LiDAR data and accurately calibrated sensors which is often not the case in real-world scenarios. Data from LiDAR and cameras often come misaligned due to the miscalibration, decalibration, or different frequencies of the sensors. Additionally, some parts of the LiDAR data may be occluded and parts of the data may be missing due to hardware malfunction or weather conditions. This work presents a novel fusion step that addresses data corruptions and makes sensor fusion for 3D object detection more robust. Through extensive experiments, we demonstrate that our method performs on par with state-of-the-art approaches on normal data and outperforms them on misaligned data.
MMFusion: Combining Image Forensic Filters for Visual Manipulation Detection and Localization
Recent image manipulation localization and detection techniques typically leverage forensic artifacts and traces that are produced by a noise-sensitive filter, such as SRM or Bayar convolution. In this paper, we showcase that different filters commonly used in such approaches excel at unveiling different types of manipulations and provide complementary forensic traces. Thus, we explore ways of combining the outputs of such filters to leverage the complementary nature of the produced artifacts for performing image manipulation localization and detection (IMLD). We assess two distinct combination methods: one that produces independent features from each forensic filter and then fuses them (this is referred to as late fusion) and one that performs early mixing of different modal outputs and produces combined features (this is referred to as early fusion). We use the latter as a feature encoding mechanism, accompanied by a new decoding mechanism that encompasses feature re-weighting, for formulating the proposed MMFusion architecture. We demonstrate that MMFusion achieves competitive performance for both image manipulation localization and detection, outperforming state-of-the-art models across several image and video datasets. We also investigate further the contribution of each forensic filter within MMFusion for addressing different types of manipulations, building on recent AI explainability measures.
Sparse Dense Fusion for 3D Object Detection
With the prevalence of multimodal learning, camera-LiDAR fusion has gained popularity in 3D object detection. Although multiple fusion approaches have been proposed, they can be classified into either sparse-only or dense-only fashion based on the feature representation in the fusion module. In this paper, we analyze them in a common taxonomy and thereafter observe two challenges: 1) sparse-only solutions preserve 3D geometric prior and yet lose rich semantic information from the camera, and 2) dense-only alternatives retain the semantic continuity but miss the accurate geometric information from LiDAR. By analyzing these two formulations, we conclude that the information loss is inevitable due to their design scheme. To compensate for the information loss in either manner, we propose Sparse Dense Fusion (SDF), a complementary framework that incorporates both sparse-fusion and dense-fusion modules via the Transformer architecture. Such a simple yet effective sparse-dense fusion structure enriches semantic texture and exploits spatial structure information simultaneously. Through our SDF strategy, we assemble two popular methods with moderate performance and outperform baseline by 4.3% in mAP and 2.5% in NDS, ranking first on the nuScenes benchmark. Extensive ablations demonstrate the effectiveness of our method and empirically align our analysis.
FusionVision: A comprehensive approach of 3D object reconstruction and segmentation from RGB-D cameras using YOLO and fast segment anything
In the realm of computer vision, the integration of advanced techniques into the processing of RGB-D camera inputs poses a significant challenge, given the inherent complexities arising from diverse environmental conditions and varying object appearances. Therefore, this paper introduces FusionVision, an exhaustive pipeline adapted for the robust 3D segmentation of objects in RGB-D imagery. Traditional computer vision systems face limitations in simultaneously capturing precise object boundaries and achieving high-precision object detection on depth map as they are mainly proposed for RGB cameras. To address this challenge, FusionVision adopts an integrated approach by merging state-of-the-art object detection techniques, with advanced instance segmentation methods. The integration of these components enables a holistic (unified analysis of information obtained from both color RGB and depth D channels) interpretation of RGB-D data, facilitating the extraction of comprehensive and accurate object information. The proposed FusionVision pipeline employs YOLO for identifying objects within the RGB image domain. Subsequently, FastSAM, an innovative semantic segmentation model, is applied to delineate object boundaries, yielding refined segmentation masks. The synergy between these components and their integration into 3D scene understanding ensures a cohesive fusion of object detection and segmentation, enhancing overall precision in 3D object segmentation. The code and pre-trained models are publicly available at https://github.com/safouaneelg/FusionVision/.
Sensor Fusion by Spatial Encoding for Autonomous Driving
Sensor fusion is critical to perception systems for task domains such as autonomous driving and robotics. Recently, the Transformer integrated with CNN has demonstrated high performance in sensor fusion for various perception tasks. In this work, we introduce a method for fusing data from camera and LiDAR. By employing Transformer modules at multiple resolutions, proposed method effectively combines local and global contextual relationships. The performance of the proposed method is validated by extensive experiments with two adversarial benchmarks with lengthy routes and high-density traffics. The proposed method outperforms previous approaches with the most challenging benchmarks, achieving significantly higher driving and infraction scores. Compared with TransFuser, it achieves 8% and 19% improvement in driving scores for the Longest6 and Town05 Long benchmarks, respectively.
Drone Object Detection Using RGB/IR Fusion
Object detection using aerial drone imagery has received a great deal of attention in recent years. While visible light images are adequate for detecting objects in most scenarios, thermal cameras can extend the capabilities of object detection to night-time or occluded objects. As such, RGB and Infrared (IR) fusion methods for object detection are useful and important. One of the biggest challenges in applying deep learning methods to RGB/IR object detection is the lack of available training data for drone IR imagery, especially at night. In this paper, we develop several strategies for creating synthetic IR images using the AIRSim simulation engine and CycleGAN. Furthermore, we utilize an illumination-aware fusion framework to fuse RGB and IR images for object detection on the ground. We characterize and test our methods for both simulated and actual data. Our solution is implemented on an NVIDIA Jetson Xavier running on an actual drone, requiring about 28 milliseconds of processing per RGB/IR image pair.
Multispectral Fusion for Object Detection with Cyclic Fuse-and-Refine Blocks
Multispectral images (e.g. visible and infrared) may be particularly useful when detecting objects with the same model in different environments (e.g. day/night outdoor scenes). To effectively use the different spectra, the main technical problem resides in the information fusion process. In this paper, we propose a new halfway feature fusion method for neural networks that leverages the complementary/consistency balance existing in multispectral features by adding to the network architecture, a particular module that cyclically fuses and refines each spectral feature. We evaluate the effectiveness of our fusion method on two challenging multispectral datasets for object detection. Our results show that implementing our Cyclic Fuse-and-Refine module in any network improves the performance on both datasets compared to other state-of-the-art multispectral object detection methods.
Image Anything: Towards Reasoning-coherent and Training-free Multi-modal Image Generation
The multifaceted nature of human perception and comprehension indicates that, when we think, our body can naturally take any combination of senses, a.k.a., modalities and form a beautiful picture in our brain. For example, when we see a cattery and simultaneously perceive the cat's purring sound, our brain can construct a picture of a cat in the cattery. Intuitively, generative AI models should hold the versatility of humans and be capable of generating images from any combination of modalities efficiently and collaboratively. This paper presents ImgAny, a novel end-to-end multi-modal generative model that can mimic human reasoning and generate high-quality images. Our method serves as the first attempt in its capacity of efficiently and flexibly taking any combination of seven modalities, ranging from language, audio to vision modalities, including image, point cloud, thermal, depth, and event data. Our key idea is inspired by human-level cognitive processes and involves the integration and harmonization of multiple input modalities at both the entity and attribute levels without specific tuning across modalities. Accordingly, our method brings two novel training-free technical branches: 1) Entity Fusion Branch ensures the coherence between inputs and outputs. It extracts entity features from the multi-modal representations powered by our specially constructed entity knowledge graph; 2) Attribute Fusion Branch adeptly preserves and processes the attributes. It efficiently amalgamates distinct attributes from diverse input modalities via our proposed attribute knowledge graph. Lastly, the entity and attribute features are adaptively fused as the conditional inputs to the pre-trained Stable Diffusion model for image generation. Extensive experiments under diverse modality combinations demonstrate its exceptional capability for visual content creation.
Fusion-DeepONet: A Data-Efficient Neural Operator for Geometry-Dependent Hypersonic and Supersonic Flows
Shape optimization is essential in aerospace vehicle design, including reentry systems, and propulsion system components, as it directly influences aerodynamic efficiency, structural integrity, and overall mission success. Rapid and accurate prediction of external and internal flows accelerates design iterations. To this end, we develop a new variant of DeepONet, called Fusion-DeepONet as a fast surrogate model for geometry-dependent hypersonic and supersonic flow fields. We evaluated Fusion-DeepONet in learning two external hypersonic flows and a supersonic shape-dependent internal flow problem. First, we compare the performance of Fusion-DeepONet with state-of-the-art neural operators to learn inviscid hypersonic flow around semi-elliptic blunt bodies for two grid types: uniform Cartesian and irregular grids. Fusion-DeepONet provides comparable accuracy to parameter-conditioned U-Net on uniform grids while outperforming MeshGraphNet and Vanilla-DeepONet on irregular grids. Fusion-DeepONet requires significantly fewer trainable parameters than U-Net, MeshGraphNet, and FNO. For the second hypersonic problem, we set up Fusion-DeepONet to map from geometry and free stream Mach number to the temperature field around a reentry capsule traveling at hypersonic speed. This fast surrogate is then improved to predict the spatial derivative of the temperature, resulting in an accurate prediction of heat flux at the surfaces of the capsule. To enhance the accuracy of spatial derivative prediction, we introduce a derivative-enhanced loss term with the least computation overhead. For the third problem, we show that Fusion-DeepONet outperforms MeshGraphNet in learning geometry-dependent supersonic flow in a converging-diverging nozzle configuration. For all the problems, we used high-fidelity simulations with a high-order entropy-stable DGSEM solver to generate training datasets with limited samples.
DDFM: Denoising Diffusion Model for Multi-Modality Image Fusion
Multi-modality image fusion aims to combine different modalities to produce fused images that retain the complementary features of each modality, such as functional highlights and texture details. To leverage strong generative priors and address challenges such as unstable training and lack of interpretability for GAN-based generative methods, we propose a novel fusion algorithm based on the denoising diffusion probabilistic model (DDPM). The fusion task is formulated as a conditional generation problem under the DDPM sampling framework, which is further divided into an unconditional generation subproblem and a maximum likelihood subproblem. The latter is modeled in a hierarchical Bayesian manner with latent variables and inferred by the expectation-maximization (EM) algorithm. By integrating the inference solution into the diffusion sampling iteration, our method can generate high-quality fused images with natural image generative priors and cross-modality information from source images. Note that all we required is an unconditional pre-trained generative model, and no fine-tuning is needed. Our extensive experiments indicate that our approach yields promising fusion results in infrared-visible image fusion and medical image fusion. The code is available at https://github.com/Zhaozixiang1228/MMIF-DDFM.
pyMEAL: A Multi-Encoder Augmentation-Aware Learning for Robust and Generalizable Medical Image Translation
Medical imaging is critical for diagnostics, but clinical adoption of advanced AI-driven imaging faces challenges due to patient variability, image artifacts, and limited model generalization. While deep learning has transformed image analysis, 3D medical imaging still suffers from data scarcity and inconsistencies due to acquisition protocols, scanner differences, and patient motion. Traditional augmentation uses a single pipeline for all transformations, disregarding the unique traits of each augmentation and struggling with large data volumes. To address these challenges, we propose a Multi-encoder Augmentation-Aware Learning (MEAL) framework that leverages four distinct augmentation variants processed through dedicated encoders. Three fusion strategies such as concatenation (CC), fusion layer (FL), and adaptive controller block (BD) are integrated to build multi-encoder models that combine augmentation-specific features before decoding. MEAL-BD uniquely preserves augmentation-aware representations, enabling robust, protocol-invariant feature learning. As demonstrated in a Computed Tomography (CT)-to-T1-weighted Magnetic Resonance Imaging (MRI) translation study, MEAL-BD consistently achieved the best performance on both unseen- and predefined-test data. On both geometric transformations (like rotations and flips) and non-augmented inputs, MEAL-BD outperformed other competing methods, achieving higher mean peak signal-to-noise ratio (PSNR) and structural similarity index measure (SSIM) scores. These results establish MEAL as a reliable framework for preserving structural fidelity and generalizing across clinically relevant variability. By reframing augmentation as a source of diverse, generalizable features, MEAL supports robust, protocol-invariant learning, advancing clinically reliable medical imaging solutions.
UniFusion: Unified Multi-view Fusion Transformer for Spatial-Temporal Representation in Bird's-Eye-View
Bird's eye view (BEV) representation is a new perception formulation for autonomous driving, which is based on spatial fusion. Further, temporal fusion is also introduced in BEV representation and gains great success. In this work, we propose a new method that unifies both spatial and temporal fusion and merges them into a unified mathematical formulation. The unified fusion could not only provide a new perspective on BEV fusion but also brings new capabilities. With the proposed unified spatial-temporal fusion, our method could support long-range fusion, which is hard to achieve in conventional BEV methods. Moreover, the BEV fusion in our work is temporal-adaptive and the weights of temporal fusion are learnable. In contrast, conventional methods mainly use fixed and equal weights for temporal fusion. Besides, the proposed unified fusion could avoid information lost in conventional BEV fusion methods and make full use of features. Extensive experiments and ablation studies on the NuScenes dataset show the effectiveness of the proposed method and our method gains the state-of-the-art performance in the map segmentation task.
PAIF: Perception-Aware Infrared-Visible Image Fusion for Attack-Tolerant Semantic Segmentation
Infrared and visible image fusion is a powerful technique that combines complementary information from different modalities for downstream semantic perception tasks. Existing learning-based methods show remarkable performance, but are suffering from the inherent vulnerability of adversarial attacks, causing a significant decrease in accuracy. In this work, a perception-aware fusion framework is proposed to promote segmentation robustness in adversarial scenes. We first conduct systematic analyses about the components of image fusion, investigating the correlation with segmentation robustness under adversarial perturbations. Based on these analyses, we propose a harmonized architecture search with a decomposition-based structure to balance standard accuracy and robustness. We also propose an adaptive learning strategy to improve the parameter robustness of image fusion, which can learn effective feature extraction under diverse adversarial perturbations. Thus, the goals of image fusion (i.e., extracting complementary features from source modalities and defending attack) can be realized from the perspectives of architectural and learning strategies. Extensive experimental results demonstrate that our scheme substantially enhances the robustness, with gains of 15.3% mIOU of segmentation in the adversarial scene, compared with advanced competitors. The source codes are available at https://github.com/LiuZhu-CV/PAIF.
CalibFormer: A Transformer-based Automatic LiDAR-Camera Calibration Network
The fusion of LiDARs and cameras has been increasingly adopted in autonomous driving for perception tasks. The performance of such fusion-based algorithms largely depends on the accuracy of sensor calibration, which is challenging due to the difficulty of identifying common features across different data modalities. Previously, many calibration methods involved specific targets and/or manual intervention, which has proven to be cumbersome and costly. Learning-based online calibration methods have been proposed, but their performance is barely satisfactory in most cases. These methods usually suffer from issues such as sparse feature maps, unreliable cross-modality association, inaccurate calibration parameter regression, etc. In this paper, to address these issues, we propose CalibFormer, an end-to-end network for automatic LiDAR-camera calibration. We aggregate multiple layers of camera and LiDAR image features to achieve high-resolution representations. A multi-head correlation module is utilized to identify correlations between features more accurately. Lastly, we employ transformer architectures to estimate accurate calibration parameters from the correlation information. Our method achieved a mean translation error of 0.8751 cm and a mean rotation error of 0.0562 ^{circ} on the KITTI dataset, surpassing existing state-of-the-art methods and demonstrating strong robustness, accuracy, and generalization capabilities.
Multi-interactive Feature Learning and a Full-time Multi-modality Benchmark for Image Fusion and Segmentation
Multi-modality image fusion and segmentation play a vital role in autonomous driving and robotic operation. Early efforts focus on boosting the performance for only one task, e.g., fusion or segmentation, making it hard to reach~`Best of Both Worlds'. To overcome this issue, in this paper, we propose a Multi-interactive Feature learning architecture for image fusion and Segmentation, namely SegMiF, and exploit dual-task correlation to promote the performance of both tasks. The SegMiF is of a cascade structure, containing a fusion sub-network and a commonly used segmentation sub-network. By slickly bridging intermediate features between two components, the knowledge learned from the segmentation task can effectively assist the fusion task. Also, the benefited fusion network supports the segmentation one to perform more pretentiously. Besides, a hierarchical interactive attention block is established to ensure fine-grained mapping of all the vital information between two tasks, so that the modality/semantic features can be fully mutual-interactive. In addition, a dynamic weight factor is introduced to automatically adjust the corresponding weights of each task, which can balance the interactive feature correspondence and break through the limitation of laborious tuning. Furthermore, we construct a smart multi-wave binocular imaging system and collect a full-time multi-modality benchmark with 15 annotated pixel-level categories for image fusion and segmentation. Extensive experiments on several public datasets and our benchmark demonstrate that the proposed method outputs visually appealing fused images and perform averagely 7.66% higher segmentation mIoU in the real-world scene than the state-of-the-art approaches. The source code and benchmark are available at https://github.com/JinyuanLiu-CV/SegMiF.
LLVIP: A Visible-infrared Paired Dataset for Low-light Vision
It is very challenging for various visual tasks such as image fusion, pedestrian detection and image-to-image translation in low light conditions due to the loss of effective target areas. In this case, infrared and visible images can be used together to provide both rich detail information and effective target areas. In this paper, we present LLVIP, a visible-infrared paired dataset for low-light vision. This dataset contains 30976 images, or 15488 pairs, most of which were taken at very dark scenes, and all of the images are strictly aligned in time and space. Pedestrians in the dataset are labeled. We compare the dataset with other visible-infrared datasets and evaluate the performance of some popular visual algorithms including image fusion, pedestrian detection and image-to-image translation on the dataset. The experimental results demonstrate the complementary effect of fusion on image information, and find the deficiency of existing algorithms of the three visual tasks in very low-light conditions. We believe the LLVIP dataset will contribute to the community of computer vision by promoting image fusion, pedestrian detection and image-to-image translation in very low-light applications. The dataset is being released in https://bupt-ai-cz.github.io/LLVIP. Raw data is also provided for further research such as image registration.
Deep Equilibrium Multimodal Fusion
Multimodal fusion integrates the complementary information present in multiple modalities and has gained much attention recently. Most existing fusion approaches either learn a fixed fusion strategy during training and inference, or are only capable of fusing the information to a certain extent. Such solutions may fail to fully capture the dynamics of interactions across modalities especially when there are complex intra- and inter-modality correlations to be considered for informative multimodal fusion. In this paper, we propose a novel deep equilibrium (DEQ) method towards multimodal fusion via seeking a fixed point of the dynamic multimodal fusion process and modeling the feature correlations in an adaptive and recursive manner. This new way encodes the rich information within and across modalities thoroughly from low level to high level for efficacious downstream multimodal learning and is readily pluggable to various multimodal frameworks. Extensive experiments on BRCA, MM-IMDB, CMU-MOSI, SUN RGB-D, and VQA-v2 demonstrate the superiority of our DEQ fusion. More remarkably, DEQ fusion consistently achieves state-of-the-art performance on multiple multimodal benchmarks. The code will be released.
Incorporating brain-inspired mechanisms for multimodal learning in artificial intelligence
Multimodal learning enhances the perceptual capabilities of cognitive systems by integrating information from different sensory modalities. However, existing multimodal fusion research typically assumes static integration, not fully incorporating key dynamic mechanisms found in the brain. Specifically, the brain exhibits an inverse effectiveness phenomenon, wherein weaker unimodal cues yield stronger multisensory integration benefits; conversely, when individual modal cues are stronger, the effect of fusion is diminished. This mechanism enables biological systems to achieve robust cognition even with scarce or noisy perceptual cues. Inspired by this biological mechanism, we explore the relationship between multimodal output and information from individual modalities, proposing an inverse effectiveness driven multimodal fusion (IEMF) strategy. By incorporating this strategy into neural networks, we achieve more efficient integration with improved model performance and computational efficiency, demonstrating up to 50% reduction in computational cost across diverse fusion methods. We conduct experiments on audio-visual classification, continual learning, and question answering tasks to validate our method. Results consistently demonstrate that our method performs excellently in these tasks. To verify universality and generalization, we also conduct experiments on Artificial Neural Networks (ANN) and Spiking Neural Networks (SNN), with results showing good adaptability to both network types. Our research emphasizes the potential of incorporating biologically inspired mechanisms into multimodal networks and provides promising directions for the future development of multimodal artificial intelligence. The code is available at https://github.com/Brain-Cog-Lab/IEMF.
FSATFusion: Frequency-Spatial Attention Transformer for Infrared and Visible Image Fusion
The infrared and visible images fusion (IVIF) is receiving increasing attention from both the research community and industry due to its excellent results in downstream applications. Existing deep learning approaches often utilize convolutional neural networks to extract image features. However, the inherently capacity of convolution operations to capture global context can lead to information loss, thereby restricting fusion performance. To address this limitation, we propose an end-to-end fusion network named the Frequency-Spatial Attention Transformer Fusion Network (FSATFusion). The FSATFusion contains a frequency-spatial attention Transformer (FSAT) module designed to effectively capture discriminate features from source images. This FSAT module includes a frequency-spatial attention mechanism (FSAM) capable of extracting significant features from feature maps. Additionally, we propose an improved Transformer module (ITM) to enhance the ability to extract global context information of vanilla Transformer. We conducted both qualitative and quantitative comparative experiments, demonstrating the superior fusion quality and efficiency of FSATFusion compared to other state-of-the-art methods. Furthermore, our network was tested on two additional tasks without any modifications, to verify the excellent generalization capability of FSATFusion. Finally, the object detection experiment demonstrated the superiority of FSATFusion in downstream visual tasks. Our code is available at https://github.com/Lmmh058/FSATFusion.
M4-SAR: A Multi-Resolution, Multi-Polarization, Multi-Scene, Multi-Source Dataset and Benchmark for Optical-SAR Fusion Object Detection
Single-source remote sensing object detection using optical or SAR images struggles in complex environments. Optical images offer rich textural details but are often affected by low-light, cloud-obscured, or low-resolution conditions, reducing the detection performance. SAR images are robust to weather, but suffer from speckle noise and limited semantic expressiveness. Optical and SAR images provide complementary advantages, and fusing them can significantly improve the detection accuracy. However, progress in this field is hindered by the lack of large-scale, standardized datasets. To address these challenges, we propose the first comprehensive dataset for optical-SAR fusion object detection, named Multi-resolution, Multi-polarization, Multi-scene, Multi-source SAR dataset (M4-SAR). It contains 112,184 precisely aligned image pairs and nearly one million labeled instances with arbitrary orientations, spanning six key categories. To enable standardized evaluation, we develop a unified benchmarking toolkit that integrates six state-of-the-art multi-source fusion methods. Furthermore, we propose E2E-OSDet, a novel end-to-end multi-source fusion detection framework that mitigates cross-domain discrepancies and establishes a robust baseline for future studies. Extensive experiments on M4-SAR demonstrate that fusing optical and SAR data can improve mAP by 5.7\% over single-source inputs, with particularly significant gains in complex environments. The dataset and code are publicly available at https://github.com/wchao0601/M4-SAR.
Anisotropic Diffusion for Details Enhancement in Multi-Exposure Image Fusion
We develop a multiexposure image fusion method based on texture features, which exploits the edge preserving and intraregion smoothing property of nonlinear diffusion filters based on partial differential equations (PDE). With the captured multiexposure image series, we first decompose images into base layers and detail layers to extract sharp details and fine details, respectively. The magnitude of the gradient of the image intensity is utilized to encourage smoothness at homogeneous regions in preference to inhomogeneous regions. Then, we have considered texture features of the base layer to generate a mask (i.e., decision mask) that guides the fusion of base layers in multiresolution fashion. Finally, well-exposed fused image is obtained that combines fused base layer and the detail layers at each scale across all the input exposures. Proposed algorithm skipping complex High Dynamic Range Image (HDRI) generation and tone mapping steps to produce detail preserving image for display on standard dynamic range display devices. Moreover, our technique is effective for blending flash/no-flash image pair and multifocus images, that is, images focused on different targets.
LoGoNet: Towards Accurate 3D Object Detection with Local-to-Global Cross-Modal Fusion
LiDAR-camera fusion methods have shown impressive performance in 3D object detection. Recent advanced multi-modal methods mainly perform global fusion, where image features and point cloud features are fused across the whole scene. Such practice lacks fine-grained region-level information, yielding suboptimal fusion performance. In this paper, we present the novel Local-to-Global fusion network (LoGoNet), which performs LiDAR-camera fusion at both local and global levels. Concretely, the Global Fusion (GoF) of LoGoNet is built upon previous literature, while we exclusively use point centroids to more precisely represent the position of voxel features, thus achieving better cross-modal alignment. As to the Local Fusion (LoF), we first divide each proposal into uniform grids and then project these grid centers to the images. The image features around the projected grid points are sampled to be fused with position-decorated point cloud features, maximally utilizing the rich contextual information around the proposals. The Feature Dynamic Aggregation (FDA) module is further proposed to achieve information interaction between these locally and globally fused features, thus producing more informative multi-modal features. Extensive experiments on both Waymo Open Dataset (WOD) and KITTI datasets show that LoGoNet outperforms all state-of-the-art 3D detection methods. Notably, LoGoNet ranks 1st on Waymo 3D object detection leaderboard and obtains 81.02 mAPH (L2) detection performance. It is noteworthy that, for the first time, the detection performance on three classes surpasses 80 APH (L2) simultaneously. Code will be available at https://github.com/sankin97/LoGoNet.
Exploring Recurrent Long-term Temporal Fusion for Multi-view 3D Perception
Long-term temporal fusion is a crucial but often overlooked technique in camera-based Bird's-Eye-View (BEV) 3D perception. Existing methods are mostly in a parallel manner. While parallel fusion can benefit from long-term information, it suffers from increasing computational and memory overheads as the fusion window size grows. Alternatively, BEVFormer adopts a recurrent fusion pipeline so that history information can be efficiently integrated, yet it fails to benefit from longer temporal frames. In this paper, we explore an embarrassingly simple long-term recurrent fusion strategy built upon the LSS-based methods and find it already able to enjoy the merits from both sides, i.e., rich long-term information and efficient fusion pipeline. A temporal embedding module is further proposed to improve the model's robustness against occasionally missed frames in practical scenarios. We name this simple but effective fusing pipeline VideoBEV. Experimental results on the nuScenes benchmark show that VideoBEV obtains leading performance on various camera-based 3D perception tasks, including object detection (55.4% mAP and 62.9% NDS), segmentation (48.6% vehicle mIoU), tracking (54.8% AMOTA), and motion prediction (0.80m minADE and 0.463 EPA). Code will be available.
A Unified Solution to Video Fusion: From Multi-Frame Learning to Benchmarking
The real world is dynamic, yet most image fusion methods process static frames independently, ignoring temporal correlations in videos and leading to flickering and temporal inconsistency. To address this, we propose Unified Video Fusion (UniVF), a novel framework for temporally coherent video fusion that leverages multi-frame learning and optical flow-based feature warping for informative, temporally coherent video fusion. To support its development, we also introduce Video Fusion Benchmark (VF-Bench), the first comprehensive benchmark covering four video fusion tasks: multi-exposure, multi-focus, infrared-visible, and medical fusion. VF-Bench provides high-quality, well-aligned video pairs obtained through synthetic data generation and rigorous curation from existing datasets, with a unified evaluation protocol that jointly assesses the spatial quality and temporal consistency of video fusion. Extensive experiments show that UniVF achieves state-of-the-art results across all tasks on VF-Bench. Project page: https://vfbench.github.io.
Real-Time Drone Detection and Tracking With Visible, Thermal and Acoustic Sensors
This paper explores the process of designing an automatic multi-sensor drone detection system. Besides the common video and audio sensors, the system also includes a thermal infrared camera, which is shown to be a feasible solution to the drone detection task. Even with slightly lower resolution, the performance is just as good as a camera in visible range. The detector performance as a function of the sensor-to-target distance is also investigated. In addition, using sensor fusion, the system is made more robust than the individual sensors, helping to reduce false detections. To counteract the lack of public datasets, a novel video dataset containing 650 annotated infrared and visible videos of drones, birds, airplanes and helicopters is also presented (https://github.com/DroneDetectionThesis/Drone-detection-dataset). The database is complemented with an audio dataset of the classes drones, helicopters and background noise.
Multi-Modal Temporal Attention Models for Crop Mapping from Satellite Time Series
Optical and radar satellite time series are synergetic: optical images contain rich spectral information, while C-band radar captures useful geometrical information and is immune to cloud cover. Motivated by the recent success of temporal attention-based methods across multiple crop mapping tasks, we propose to investigate how these models can be adapted to operate on several modalities. We implement and evaluate multiple fusion schemes, including a novel approach and simple adjustments to the training procedure, significantly improving performance and efficiency with little added complexity. We show that most fusion schemes have advantages and drawbacks, making them relevant for specific settings. We then evaluate the benefit of multimodality across several tasks: parcel classification, pixel-based segmentation, and panoptic parcel segmentation. We show that by leveraging both optical and radar time series, multimodal temporal attention-based models can outmatch single-modality models in terms of performance and resilience to cloud cover. To conduct these experiments, we augment the PASTIS dataset with spatially aligned radar image time series. The resulting dataset, PASTIS-R, constitutes the first large-scale, multimodal, and open-access satellite time series dataset with semantic and instance annotations.
Reality Fusion: Robust Real-time Immersive Mobile Robot Teleoperation with Volumetric Visual Data Fusion
We introduce Reality Fusion, a novel robot teleoperation system that localizes, streams, projects, and merges a typical onboard depth sensor with a photorealistic, high resolution, high framerate, and wide field of view (FoV) rendering of the complex remote environment represented as 3D Gaussian splats (3DGS). Our framework enables robust egocentric and exocentric robot teleoperation in immersive VR, with the 3DGS effectively extending spatial information of a depth sensor with limited FoV and balancing the trade-off between data streaming costs and data visual quality. We evaluated our framework through a user study with 24 participants, which revealed that Reality Fusion leads to significantly better user performance, situation awareness, and user preferences. To support further research and development, we provide an open-source implementation with an easy-to-replicate custom-made telepresence robot, a high-performance virtual reality 3DGS renderer, and an immersive robot control package. (Source code: https://github.com/uhhhci/RealityFusion)
WAVE: Machine Learning for Full-Waveform Time-Of-Flight Detectors
We propose a WAveform Vector Exploitation (WAVE) deep neural network for full-waveform Time-Of-Flight (TOF) physics detectors, and evaluate its performance against traditional reconstruction techniques via Monte Carlo study of a small plastic-scintillator scatter camera. Ultralytics LLC (www.ultralytics.com) provides WAVE freely under the open source GPL-3.0 license at https://github.com/ultralytics/wave.
Addressing the Depth-of-Field Constraint: A New Paradigm for High Resolution Multi-Focus Image Fusion
Multi-focus image fusion (MFIF) addresses the depth-of-field (DOF) limitations of optical lenses, where only objects within a specific range appear sharp. Although traditional and deep learning methods have advanced the field, challenges persist, including limited training data, domain gaps from synthetic datasets, and difficulties with regions lacking information. We propose VAEEDOF, a novel MFIF method that uses a distilled variational autoencoder for high-fidelity, efficient image reconstruction. Our fusion module processes up to seven images simultaneously, enabling robust fusion across diverse focus points. To address data scarcity, we introduce MattingMFIF, a new syntetic 4K dataset, simulating realistic DOF effects from real photographs. Our method achieves state-of-the-art results, generating seamless artifact-free fused images and bridging the gap between synthetic and real-world scenarios, offering a significant step forward in addressing complex MFIF challenges. The code, and weights are available here:
DepthFusion: Depth-Aware Hybrid Feature Fusion for LiDAR-Camera 3D Object Detection
State-of-the-art LiDAR-camera 3D object detectors usually focus on feature fusion. However, they neglect the factor of depth while designing the fusion strategy. In this work, we are the first to observe that different modalities play different roles as depth varies via statistical analysis and visualization. Based on this finding, we propose a Depth-Aware Hybrid Feature Fusion (DepthFusion) strategy that guides the weights of point cloud and RGB image modalities by introducing depth encoding at both global and local levels. Specifically, the Depth-GFusion module adaptively adjusts the weights of image Bird's-Eye-View (BEV) features in multi-modal global features via depth encoding. Furthermore, to compensate for the information lost when transferring raw features to the BEV space, we propose a Depth-LFusion module, which adaptively adjusts the weights of original voxel features and multi-view image features in multi-modal local features via depth encoding. Extensive experiments on the nuScenes and KITTI datasets demonstrate that our DepthFusion method surpasses previous state-of-the-art methods. Moreover, our DepthFusion is more robust to various kinds of corruptions, outperforming previous methods on the nuScenes-C dataset.
Contextual Fusion For Adversarial Robustness
Mammalian brains handle complex reasoning tasks in a gestalt manner by integrating information from regions of the brain that are specialised to individual sensory modalities. This allows for improved robustness and better generalisation ability. In contrast, deep neural networks are usually designed to process one particular information stream and susceptible to various types of adversarial perturbations. While many methods exist for detecting and defending against adversarial attacks, they do not generalise across a range of attacks and negatively affect performance on clean, unperturbed data. We developed a fusion model using a combination of background and foreground features extracted in parallel from Places-CNN and Imagenet-CNN. We tested the benefits of the fusion approach on preserving adversarial robustness for human perceivable (e.g., Gaussian blur) and network perceivable (e.g., gradient-based) attacks for CIFAR-10 and MS COCO data sets. For gradient based attacks, our results show that fusion allows for significant improvements in classification without decreasing performance on unperturbed data and without need to perform adversarial retraining. Our fused model revealed improvements for Gaussian blur type perturbations as well. The increase in performance from fusion approach depended on the variability of the image contexts; larger increases were seen for classes of images with larger differences in their contexts. We also demonstrate the effect of regularization to bias the classifier decision in the presence of a known adversary. We propose that this biologically inspired approach to integrate information across multiple modalities provides a new way to improve adversarial robustness that can be complementary to current state of the art approaches.
In the Search for Optimal Multi-view Learning Models for Crop Classification with Global Remote Sensing Data
Studying and analyzing cropland is a difficult task due to its dynamic and heterogeneous growth behavior. Usually, diverse data sources can be collected for its estimation. Although deep learning models have proven to excel in the crop classification task, they face substantial challenges when dealing with multiple inputs, named Multi-View Learning (MVL). The methods used in the MVL scenario can be structured based on the encoder architecture, the fusion strategy, and the optimization technique. The literature has primarily focused on using specific encoder architectures for local regions, lacking a deeper exploration of other components in the MVL methodology. In contrast, we investigate the simultaneous selection of the fusion strategy and encoder architecture, assessing global-scale cropland and crop-type classifications. We use a range of five fusion strategies (Input, Feature, Decision, Ensemble, Hybrid) and five temporal encoders (LSTM, GRU, TempCNN, TAE, L-TAE) as possible configurations in the MVL method. We use the CropHarvest dataset for validation, which provides optical, radar, weather time series, and topographic information as input data. We found that in scenarios with a limited number of labeled samples, a unique configuration is insufficient for all the cases. Instead, a specialized combination should be meticulously sought, including an encoder and fusion strategy. To streamline this search process, we suggest identifying the optimal encoder architecture tailored for a particular fusion strategy, and then determining the most suitable fusion strategy for the classification task. We provide a methodological framework for researchers exploring crop classification through an MVL methodology.
Task-Generalized Adaptive Cross-Domain Learning for Multimodal Image Fusion
Multimodal Image Fusion (MMIF) aims to integrate complementary information from different imaging modalities to overcome the limitations of individual sensors. It enhances image quality and facilitates downstream applications such as remote sensing, medical diagnostics, and robotics. Despite significant advancements, current MMIF methods still face challenges such as modality misalignment, high-frequency detail destruction, and task-specific limitations. To address these challenges, we propose AdaSFFuse, a novel framework for task-generalized MMIF through adaptive cross-domain co-fusion learning. AdaSFFuse introduces two key innovations: the Adaptive Approximate Wavelet Transform (AdaWAT) for frequency decoupling, and the Spatial-Frequency Mamba Blocks for efficient multimodal fusion. AdaWAT adaptively separates the high- and low-frequency components of multimodal images from different scenes, enabling fine-grained extraction and alignment of distinct frequency characteristics for each modality. The Spatial-Frequency Mamba Blocks facilitate cross-domain fusion in both spatial and frequency domains, enhancing this process. These blocks dynamically adjust through learnable mappings to ensure robust fusion across diverse modalities. By combining these components, AdaSFFuse improves the alignment and integration of multimodal features, reduces frequency loss, and preserves critical details. Extensive experiments on four MMIF tasks -- Infrared-Visible Image Fusion (IVF), Multi-Focus Image Fusion (MFF), Multi-Exposure Image Fusion (MEF), and Medical Image Fusion (MIF) -- demonstrate AdaSFFuse's superior fusion performance, ensuring both low computational cost and a compact network, offering a strong balance between performance and efficiency. The code will be publicly available at https://github.com/Zhen-yu-Liu/AdaSFFuse.
MouSi: Poly-Visual-Expert Vision-Language Models
Current large vision-language models (VLMs) often encounter challenges such as insufficient capabilities of a single visual component and excessively long visual tokens. These issues can limit the model's effectiveness in accurately interpreting complex visual information and over-lengthy contextual information. Addressing these challenges is crucial for enhancing the performance and applicability of VLMs. This paper proposes the use of ensemble experts technique to synergizes the capabilities of individual visual encoders, including those skilled in image-text matching, OCR, image segmentation, etc. This technique introduces a fusion network to unify the processing of outputs from different visual experts, while bridging the gap between image encoders and pre-trained LLMs. In addition, we explore different positional encoding schemes to alleviate the waste of positional encoding caused by lengthy image feature sequences, effectively addressing the issue of position overflow and length limitations. For instance, in our implementation, this technique significantly reduces the positional occupancy in models like SAM, from a substantial 4096 to a more efficient and manageable 64 or even down to 1. Experimental results demonstrate that VLMs with multiple experts exhibit consistently superior performance over isolated visual encoders and mark a significant performance boost as more experts are integrated. We have open-sourced the training code used in this report. All of these resources can be found on our project website.
CSFMamba: Cross State Fusion Mamba Operator for Multimodal Remote Sensing Image Classification
Multimodal fusion has made great progress in the field of remote sensing image classification due to its ability to exploit the complementary spatial-spectral information. Deep learning methods such as CNN and Transformer have been widely used in these domains. State Space Models recently highlighted that prior methods suffer from quadratic computational complexity. As a result, modeling longer-range dependencies of spatial-spectral features imposes an overwhelming burden on the network. Mamba solves this problem by incorporating time-varying parameters into ordinary SSM and performing hardware optimization, but it cannot perform feature fusion directly. In order to make full use of Mamba's low computational burden and explore the potential of internal structure in multimodal feature fusion, we propose Cross State Fusion Mamba (CSFMamba) Network. Specifically, we first design the preprocessing module of remote sensing image information for the needs of Mamba structure, and combine it with CNN to extract multi-layer features. Secondly, a cross-state module based on Mamba operator is creatively designed to fully fuse the feature of the two modalities. The advantages of Mamba and CNN are combined by designing a more powerful backbone. We capture the fusion relationship between HSI and LiDAR modalities with stronger full-image understanding. The experimental results on two datasets of MUUFL and Houston2018 show that the proposed method outperforms the experimental results of Transformer under the premise of reducing the network training burden.
NLOS Dies Twice: Challenges and Solutions of V2X for Cooperative Perception
Multi-agent multi-lidar sensor fusion between connected vehicles for cooperative perception has recently been recognized as the best technique for minimizing the blind zone of individual vehicular perception systems and further enhancing the overall safety of autonomous driving systems. This technique relies heavily on the reliability and availability of vehicle-to-everything (V2X) communication. In practical sensor fusion application scenarios, the non-line-of-sight (NLOS) issue causes blind zones for not only the perception system but also V2X direct communication. To counteract underlying communication issues, we introduce an abstract perception matrix matching method for quick sensor fusion matching procedures and mobility-height hybrid relay determination procedures, proactively improving the efficiency and performance of V2X communication to serve the upper layer application fusion requirements. To demonstrate the effectiveness of our solution, we design a new simulation framework to consider autonomous driving, sensor fusion and V2X communication in general, paving the way for end-to-end performance evaluation and further solution derivation.
Fisheye Camera and Ultrasonic Sensor Fusion For Near-Field Obstacle Perception in Bird's-Eye-View
Accurate obstacle identification represents a fundamental challenge within the scope of near-field perception for autonomous driving. Conventionally, fisheye cameras are frequently employed for comprehensive surround-view perception, including rear-view obstacle localization. However, the performance of such cameras can significantly deteriorate in low-light conditions, during nighttime, or when subjected to intense sun glare. Conversely, cost-effective sensors like ultrasonic sensors remain largely unaffected under these conditions. Therefore, we present, to our knowledge, the first end-to-end multimodal fusion model tailored for efficient obstacle perception in a bird's-eye-view (BEV) perspective, utilizing fisheye cameras and ultrasonic sensors. Initially, ResNeXt-50 is employed as a set of unimodal encoders to extract features specific to each modality. Subsequently, the feature space associated with the visible spectrum undergoes transformation into BEV. The fusion of these two modalities is facilitated via concatenation. At the same time, the ultrasonic spectrum-based unimodal feature maps pass through content-aware dilated convolution, applied to mitigate the sensor misalignment between two sensors in the fused feature space. Finally, the fused features are utilized by a two-stage semantic occupancy decoder to generate grid-wise predictions for precise obstacle perception. We conduct a systematic investigation to determine the optimal strategy for multimodal fusion of both sensors. We provide insights into our dataset creation procedures, annotation guidelines, and perform a thorough data analysis to ensure adequate coverage of all scenarios. When applied to our dataset, the experimental results underscore the robustness and effectiveness of our proposed multimodal fusion approach.
Pulsed Schlieren Imaging of Ultrasonic Haptics and Levitation using Phased Arrays
Ultrasonic acoustic fields have recently been used to generate haptic effects on the human skin as well as to levitate small sub-wavelength size particles. Schlieren imaging and background-oriented schlieren techniques can be used for acoustic wave pattern and beam shape visualization. These techniques exploit variations in the refractive index of a propagation medium by applying refractive optics or cross-correlation algorithms of photographs of illuminated background patterns. Here both background-oriented and traditional schlieren systems are used to visualize the regions of the acoustic power involved in creating dynamic haptic sensations and dynamic levitation traps. We demonstrate for the first time the application of back-ground-oriented schlieren for imaging ultrasonic fields in air. We detail our imaging apparatus and present improved algorithms used to visualize these phenomena that we have produced using multiple phased arrays. Moreover, to improve imaging, we leverage an electronically controlled, high-output LED which is pulsed in synchrony with the ultrasonic carrier frequency.
BEVFusion: Multi-Task Multi-Sensor Fusion with Unified Bird's-Eye View Representation
Multi-sensor fusion is essential for an accurate and reliable autonomous driving system. Recent approaches are based on point-level fusion: augmenting the LiDAR point cloud with camera features. However, the camera-to-LiDAR projection throws away the semantic density of camera features, hindering the effectiveness of such methods, especially for semantic-oriented tasks (such as 3D scene segmentation). In this paper, we break this deeply-rooted convention with BEVFusion, an efficient and generic multi-task multi-sensor fusion framework. It unifies multi-modal features in the shared bird's-eye view (BEV) representation space, which nicely preserves both geometric and semantic information. To achieve this, we diagnose and lift key efficiency bottlenecks in the view transformation with optimized BEV pooling, reducing latency by more than 40x. BEVFusion is fundamentally task-agnostic and seamlessly supports different 3D perception tasks with almost no architectural changes. It establishes the new state of the art on nuScenes, achieving 1.3% higher mAP and NDS on 3D object detection and 13.6% higher mIoU on BEV map segmentation, with 1.9x lower computation cost. Code to reproduce our results is available at https://github.com/mit-han-lab/bevfusion.
SparseFusion: Fusing Multi-Modal Sparse Representations for Multi-Sensor 3D Object Detection
By identifying four important components of existing LiDAR-camera 3D object detection methods (LiDAR and camera candidates, transformation, and fusion outputs), we observe that all existing methods either find dense candidates or yield dense representations of scenes. However, given that objects occupy only a small part of a scene, finding dense candidates and generating dense representations is noisy and inefficient. We propose SparseFusion, a novel multi-sensor 3D detection method that exclusively uses sparse candidates and sparse representations. Specifically, SparseFusion utilizes the outputs of parallel detectors in the LiDAR and camera modalities as sparse candidates for fusion. We transform the camera candidates into the LiDAR coordinate space by disentangling the object representations. Then, we can fuse the multi-modality candidates in a unified 3D space by a lightweight self-attention module. To mitigate negative transfer between modalities, we propose novel semantic and geometric cross-modality transfer modules that are applied prior to the modality-specific detectors. SparseFusion achieves state-of-the-art performance on the nuScenes benchmark while also running at the fastest speed, even outperforming methods with stronger backbones. We perform extensive experiments to demonstrate the effectiveness and efficiency of our modules and overall method pipeline. Our code will be made publicly available at https://github.com/yichen928/SparseFusion.
FuseAnyPart: Diffusion-Driven Facial Parts Swapping via Multiple Reference Images
Facial parts swapping aims to selectively transfer regions of interest from the source image onto the target image while maintaining the rest of the target image unchanged. Most studies on face swapping designed specifically for full-face swapping, are either unable or significantly limited when it comes to swapping individual facial parts, which hinders fine-grained and customized character designs. However, designing such an approach specifically for facial parts swapping is challenged by a reasonable multiple reference feature fusion, which needs to be both efficient and effective. To overcome this challenge, FuseAnyPart is proposed to facilitate the seamless "fuse-any-part" customization of the face. In FuseAnyPart, facial parts from different people are assembled into a complete face in latent space within the Mask-based Fusion Module. Subsequently, the consolidated feature is dispatched to the Addition-based Injection Module for fusion within the UNet of the diffusion model to create novel characters. Extensive experiments qualitatively and quantitatively validate the superiority and robustness of FuseAnyPart. Source codes are available at https://github.com/Thomas-wyh/FuseAnyPart.
MIS-FM: 3D Medical Image Segmentation using Foundation Models Pretrained on a Large-Scale Unannotated Dataset
Pretraining with large-scale 3D volumes has a potential for improving the segmentation performance on a target medical image dataset where the training images and annotations are limited. Due to the high cost of acquiring pixel-level segmentation annotations on the large-scale pretraining dataset, pretraining with unannotated images is highly desirable. In this work, we propose a novel self-supervised learning strategy named Volume Fusion (VF) for pretraining 3D segmentation models. It fuses several random patches from a foreground sub-volume to a background sub-volume based on a predefined set of discrete fusion coefficients, and forces the model to predict the fusion coefficient of each voxel, which is formulated as a self-supervised segmentation task without manual annotations. Additionally, we propose a novel network architecture based on parallel convolution and transformer blocks that is suitable to be transferred to different downstream segmentation tasks with various scales of organs and lesions. The proposed model was pretrained with 110k unannotated 3D CT volumes, and experiments with different downstream segmentation targets including head and neck organs, thoracic/abdominal organs showed that our pretrained model largely outperformed training from scratch and several state-of-the-art self-supervised training methods and segmentation models. The code and pretrained model are available at https://github.com/openmedlab/MIS-FM.
Multi-Modal Fusion Transformer for End-to-End Autonomous Driving
How should representations from complementary sensors be integrated for autonomous driving? Geometry-based sensor fusion has shown great promise for perception tasks such as object detection and motion forecasting. However, for the actual driving task, the global context of the 3D scene is key, e.g. a change in traffic light state can affect the behavior of a vehicle geometrically distant from that traffic light. Geometry alone may therefore be insufficient for effectively fusing representations in end-to-end driving models. In this work, we demonstrate that imitation learning policies based on existing sensor fusion methods under-perform in the presence of a high density of dynamic agents and complex scenarios, which require global contextual reasoning, such as handling traffic oncoming from multiple directions at uncontrolled intersections. Therefore, we propose TransFuser, a novel Multi-Modal Fusion Transformer, to integrate image and LiDAR representations using attention. We experimentally validate the efficacy of our approach in urban settings involving complex scenarios using the CARLA urban driving simulator. Our approach achieves state-of-the-art driving performance while reducing collisions by 76% compared to geometry-based fusion.
Towards Unifying Medical Vision-and-Language Pre-training via Soft Prompts
Medical vision-and-language pre-training (Med-VLP) has shown promising improvements on many downstream medical tasks owing to its applicability to extracting generic representations from medical images and texts. Practically, there exist two typical types, i.e., the fusion-encoder type and the dual-encoder type, depending on whether a heavy fusion module is used. The former is superior at multi-modal tasks owing to the sufficient interaction between modalities; the latter is good at uni-modal and cross-modal tasks due to the single-modality encoding ability. To take advantage of these two types, we propose an effective yet straightforward scheme named PTUnifier to unify the two types. We first unify the input format by introducing visual and textual prompts, which serve as a feature bank that stores the most representative images/texts. By doing so, a single model could serve as a foundation model that processes various tasks adopting different input formats (i.e., image-only, text-only, and image-text-pair). Furthermore, we construct a prompt pool (instead of static ones) to improve diversity and scalability. Experimental results show that our approach achieves state-of-the-art results on a broad range of tasks, spanning uni-modal tasks (i.e., image/text classification and text summarization), cross-modal tasks (i.e., image-to-text generation and image-text/text-image retrieval), and multi-modal tasks (i.e., visual question answering), demonstrating the effectiveness of our approach. Note that the adoption of prompts is orthogonal to most existing Med-VLP approaches and could be a beneficial and complementary extension to these approaches.
Every SAM Drop Counts: Embracing Semantic Priors for Multi-Modality Image Fusion and Beyond
Multi-modality image fusion, particularly infrared and visible, plays a crucial role in integrating diverse modalities to enhance scene understanding. Although early research prioritized visual quality, preserving fine details and adapting to downstream tasks remains challenging. Recent approaches attempt task-specific design but rarely achieve "The Best of Both Worlds" due to inconsistent optimization goals. To address these issues, we propose a novel method that leverages the semantic knowledge from the Segment Anything Model (SAM) to Grow the quality of fusion results and Enable downstream task adaptability, namely SAGE. Specifically, we design a Semantic Persistent Attention (SPA) Module that efficiently maintains source information via the persistent repository while extracting high-level semantic priors from SAM. More importantly, to eliminate the impractical dependence on SAM during inference, we introduce a bi-level optimization-driven distillation mechanism with triplet losses, which allow the student network to effectively extract knowledge. Extensive experiments show that our method achieves a balance between high-quality visual results and downstream task adaptability while maintaining practical deployment efficiency. The code is available at https://github.com/RollingPlain/SAGE_IVIF.
FuseGPT: Learnable Layers Fusion of Generative Pre-trained Transformers
Generative Pre-trained Transformers (GPTs) have demonstrated remarkable performance across diverse domains through the extensive scaling of model parameters. Recent works observe the redundancy across the transformer blocks and develop compression methods by structured pruning of the unimportant blocks. However, such straightforward elimination will always provide irreversible performance degradation. In this paper, we propose FuseGPT, a novel methodology to recycle the pruned transformer blocks to further recover the model performance. Firstly we introduce a new importance detection metric, Macro Influence (MI), to detect the long-term influence of each transformer block by calculating their loss of information after removal. Then we propose group-level layers fusion, which adopts the parameters in layers of the unimportant blocks and injects them into the corresponding layers inside the neighboring blocks. The fusion is not one-off but through iterative parameter updates by lightweight group-level fine-tuning. Specifically, these injected parameters are frozen but weighted with learnable rank decomposition matrices to reduce the overhead during fine-tuning. Our approach not only works well on large language models but also on large multimodal models. The experiments have shown that, by using modest amounts of data, FuseGPT can outperform previous works in both perplexity and zero-shot task performance.
GyroSwin: 5D Surrogates for Gyrokinetic Plasma Turbulence Simulations
Nuclear fusion plays a pivotal role in the quest for reliable and sustainable energy production. A major roadblock to viable fusion power is understanding plasma turbulence, which significantly impairs plasma confinement, and is vital for next-generation reactor design. Plasma turbulence is governed by the nonlinear gyrokinetic equation, which evolves a 5D distribution function over time. Due to its high computational cost, reduced-order models are often employed in practice to approximate turbulent transport of energy. However, they omit nonlinear effects unique to the full 5D dynamics. To tackle this, we introduce GyroSwin, the first scalable 5D neural surrogate that can model 5D nonlinear gyrokinetic simulations, thereby capturing the physical phenomena neglected by reduced models, while providing accurate estimates of turbulent heat transport.GyroSwin (i) extends hierarchical Vision Transformers to 5D, (ii) introduces cross-attention and integration modules for latent 3Dleftrightarrow5D interactions between electrostatic potential fields and the distribution function, and (iii) performs channelwise mode separation inspired by nonlinear physics. We demonstrate that GyroSwin outperforms widely used reduced numerics on heat flux prediction, captures the turbulent energy cascade, and reduces the cost of fully resolved nonlinear gyrokinetics by three orders of magnitude while remaining physically verifiable. GyroSwin shows promising scaling laws, tested up to one billion parameters, paving the way for scalable neural surrogates for gyrokinetic simulations of plasma turbulence.
Explorative Inbetweening of Time and Space
We introduce bounded generation as a generalized task to control video generation to synthesize arbitrary camera and subject motion based only on a given start and end frame. Our objective is to fully leverage the inherent generalization capability of an image-to-video model without additional training or fine-tuning of the original model. This is achieved through the proposed new sampling strategy, which we call Time Reversal Fusion, that fuses the temporally forward and backward denoising paths conditioned on the start and end frame, respectively. The fused path results in a video that smoothly connects the two frames, generating inbetweening of faithful subject motion, novel views of static scenes, and seamless video looping when the two bounding frames are identical. We curate a diverse evaluation dataset of image pairs and compare against the closest existing methods. We find that Time Reversal Fusion outperforms related work on all subtasks, exhibiting the ability to generate complex motions and 3D-consistent views guided by bounded frames. See project page at https://time-reversal.github.io.
Multitask Learning and Multistage Fusion for Dimensional Audiovisual Emotion Recognition
Due to its ability to accurately predict emotional state using multimodal features, audiovisual emotion recognition has recently gained more interest from researchers. This paper proposes two methods to predict emotional attributes from audio and visual data using a multitask learning and a fusion strategy. First, multitask learning is employed by adjusting three parameters for each attribute to improve the recognition rate. Second, a multistage fusion is proposed to combine results from various modalities' final prediction. Our approach used multitask learning, employed at unimodal and early fusion methods, shows improvement over single-task learning with an average CCC score of 0.431 compared to 0.297. A multistage method, employed at the late fusion approach, significantly improved the agreement score between true and predicted values on the development set of data (from [0.537, 0.565, 0.083] to [0.68, 0.656, 0.443]) for arousal, valence, and liking.
Provable Dynamic Fusion for Low-Quality Multimodal Data
The inherent challenge of multimodal fusion is to precisely capture the cross-modal correlation and flexibly conduct cross-modal interaction. To fully release the value of each modality and mitigate the influence of low-quality multimodal data, dynamic multimodal fusion emerges as a promising learning paradigm. Despite its widespread use, theoretical justifications in this field are still notably lacking. Can we design a provably robust multimodal fusion method? This paper provides theoretical understandings to answer this question under a most popular multimodal fusion framework from the generalization perspective. We proceed to reveal that several uncertainty estimation solutions are naturally available to achieve robust multimodal fusion. Then a novel multimodal fusion framework termed Quality-aware Multimodal Fusion (QMF) is proposed, which can improve the performance in terms of classification accuracy and model robustness. Extensive experimental results on multiple benchmarks can support our findings.
Bi-LRFusion: Bi-Directional LiDAR-Radar Fusion for 3D Dynamic Object Detection
LiDAR and Radar are two complementary sensing approaches in that LiDAR specializes in capturing an object's 3D shape while Radar provides longer detection ranges as well as velocity hints. Though seemingly natural, how to efficiently combine them for improved feature representation is still unclear. The main challenge arises from that Radar data are extremely sparse and lack height information. Therefore, directly integrating Radar features into LiDAR-centric detection networks is not optimal. In this work, we introduce a bi-directional LiDAR-Radar fusion framework, termed Bi-LRFusion, to tackle the challenges and improve 3D detection for dynamic objects. Technically, Bi-LRFusion involves two steps: first, it enriches Radar's local features by learning important details from the LiDAR branch to alleviate the problems caused by the absence of height information and extreme sparsity; second, it combines LiDAR features with the enhanced Radar features in a unified bird's-eye-view representation. We conduct extensive experiments on nuScenes and ORR datasets, and show that our Bi-LRFusion achieves state-of-the-art performance for detecting dynamic objects. Notably, Radar data in these two datasets have different formats, which demonstrates the generalizability of our method. Codes are available at https://github.com/JessieW0806/BiLRFusion.
Unity is Strength: Unifying Convolutional and Transformeral Features for Better Person Re-Identification
Person Re-identification (ReID) aims to retrieve the specific person across non-overlapping cameras, which greatly helps intelligent transportation systems. As we all know, Convolutional Neural Networks (CNNs) and Transformers have the unique strengths to extract local and global features, respectively. Considering this fact, we focus on the mutual fusion between them to learn more comprehensive representations for persons. In particular, we utilize the complementary integration of deep features from different model structures. We propose a novel fusion framework called FusionReID to unify the strengths of CNNs and Transformers for image-based person ReID. More specifically, we first deploy a Dual-branch Feature Extraction (DFE) to extract features through CNNs and Transformers from a single image. Moreover, we design a novel Dual-attention Mutual Fusion (DMF) to achieve sufficient feature fusions. The DMF comprises Local Refinement Units (LRU) and Heterogenous Transmission Modules (HTM). LRU utilizes depth-separable convolutions to align deep features in channel dimensions and spatial sizes. HTM consists of a Shared Encoding Unit (SEU) and two Mutual Fusion Units (MFU). Through the continuous stacking of HTM, deep features after LRU are repeatedly utilized to generate more discriminative features. Extensive experiments on three public ReID benchmarks demonstrate that our method can attain superior performances than most state-of-the-arts. The source code is available at https://github.com/924973292/FusionReID.
On Cross-Layer Alignment for Model Fusion of Heterogeneous Neural Networks
Layer-wise model fusion via optimal transport, named OTFusion, applies soft neuron association for unifying different pre-trained networks to save computational resources. While enjoying its success, OTFusion requires the input networks to have the same number of layers. To address this issue, we propose a novel model fusion framework, named CLAFusion, to fuse neural networks with a different number of layers, which we refer to as heterogeneous neural networks, via cross-layer alignment. The cross-layer alignment problem, which is an unbalanced assignment problem, can be solved efficiently using dynamic programming. Based on the cross-layer alignment, our framework balances the number of layers of neural networks before applying layer-wise model fusion. Our experiments indicate that CLAFusion, with an extra finetuning process, improves the accuracy of residual networks on the CIFAR10, CIFAR100, and Tiny-ImageNet datasets. Furthermore, we explore its practical usage for model compression and knowledge distillation when applying to the teacher-student setting.
Unsupervised Segmentation of Fire and Smoke from Infra-Red Videos
This paper proposes a vision-based fire and smoke segmentation system which use spatial, temporal and motion information to extract the desired regions from the video frames. The fusion of information is done using multiple features such as optical flow, divergence and intensity values. These features extracted from the images are used to segment the pixels into different classes in an unsupervised way. A comparative analysis is done by using multiple clustering algorithms for segmentation. Here the Markov Random Field performs more accurately than other segmentation algorithms since it characterizes the spatial interactions of pixels using a finite number of parameters. It builds a probabilistic image model that selects the most likely labeling using the maximum a posteriori (MAP) estimation. This unsupervised approach is tested on various images and achieves a frame-wise fire detection rate of 95.39%. Hence this method can be used for early detection of fire in real-time and it can be incorporated into an indoor or outdoor surveillance system.
Boosting Multi-modal Model Performance with Adaptive Gradient Modulation
While the field of multi-modal learning keeps growing fast, the deficiency of the standard joint training paradigm has become clear through recent studies. They attribute the sub-optimal performance of the jointly trained model to the modality competition phenomenon. Existing works attempt to improve the jointly trained model by modulating the training process. Despite their effectiveness, those methods can only apply to late fusion models. More importantly, the mechanism of the modality competition remains unexplored. In this paper, we first propose an adaptive gradient modulation method that can boost the performance of multi-modal models with various fusion strategies. Extensive experiments show that our method surpasses all existing modulation methods. Furthermore, to have a quantitative understanding of the modality competition and the mechanism behind the effectiveness of our modulation method, we introduce a novel metric to measure the competition strength. This metric is built on the mono-modal concept, a function that is designed to represent the competition-less state of a modality. Through systematic investigation, our results confirm the intuition that the modulation encourages the model to rely on the more informative modality. In addition, we find that the jointly trained model typically has a preferred modality on which the competition is weaker than other modalities. However, this preferred modality need not dominate others. Our code will be available at https://github.com/lihong2303/AGM_ICCV2023.
LetsTalk: Latent Diffusion Transformer for Talking Video Synthesis
Portrait image animation using audio has rapidly advanced, enabling the creation of increasingly realistic and expressive animated faces. The challenges of this multimodality-guided video generation task involve fusing various modalities while ensuring consistency in timing and portrait. We further seek to produce vivid talking heads. To address these challenges, we present LetsTalk (LatEnt Diffusion TranSformer for Talking Video Synthesis), a diffusion transformer that incorporates modular temporal and spatial attention mechanisms to merge multimodality and enhance spatial-temporal consistency. To handle multimodal conditions, we first summarize three fusion schemes, ranging from shallow to deep fusion compactness, and thoroughly explore their impact and applicability. Then we propose a suitable solution according to the modality differences of image, audio, and video generation. For portrait, we utilize a deep fusion scheme (Symbiotic Fusion) to ensure portrait consistency. For audio, we implement a shallow fusion scheme (Direct Fusion) to achieve audio-animation alignment while preserving diversity. Our extensive experiments demonstrate that our approach generates temporally coherent and realistic videos with enhanced diversity and liveliness.
FUSE: Label-Free Image-Event Joint Monocular Depth Estimation via Frequency-Decoupled Alignment and Degradation-Robust Fusion
Image-event joint depth estimation methods leverage complementary modalities for robust perception, yet face challenges in generalizability stemming from two factors: 1) limited annotated image-event-depth datasets causing insufficient cross-modal supervision, and 2) inherent frequency mismatches between static images and dynamic event streams with distinct spatiotemporal patterns, leading to ineffective feature fusion. To address this dual challenge, we propose Frequency-decoupled Unified Self-supervised Encoder (FUSE) with two synergistic components: The Parameter-efficient Self-supervised Transfer (PST) establishes cross-modal knowledge transfer through latent space alignment with image foundation models, effectively mitigating data scarcity by enabling joint encoding without depth ground truth. Complementing this, we propose the Frequency-Decoupled Fusion module (FreDFuse) to explicitly decouple high-frequency edge features from low-frequency structural components, resolving modality-specific frequency mismatches through physics-aware fusion. This combined approach enables FUSE to construct a universal image-event encoder that only requires lightweight decoder adaptation for target datasets. Extensive experiments demonstrate state-of-the-art performance with 14% and 24.9% improvements in Abs.Rel on MVSEC and DENSE datasets. The framework exhibits remarkable zero-shot adaptability to challenging scenarios including extreme lighting and motion blur, significantly advancing real-world deployment capabilities. The source code for our method is publicly available at: https://github.com/sunpihai-up/FUSE
MMFformer: Multimodal Fusion Transformer Network for Depression Detection
Depression is a serious mental health illness that significantly affects an individual's well-being and quality of life, making early detection crucial for adequate care and treatment. Detecting depression is often difficult, as it is based primarily on subjective evaluations during clinical interviews. Hence, the early diagnosis of depression, thanks to the content of social networks, has become a prominent research area. The extensive and diverse nature of user-generated information poses a significant challenge, limiting the accurate extraction of relevant temporal information and the effective fusion of data across multiple modalities. This paper introduces MMFformer, a multimodal depression detection network designed to retrieve depressive spatio-temporal high-level patterns from multimodal social media information. The transformer network with residual connections captures spatial features from videos, and a transformer encoder is exploited to design important temporal dynamics in audio. Moreover, the fusion architecture fused the extracted features through late and intermediate fusion strategies to find out the most relevant intermodal correlations among them. Finally, the proposed network is assessed on two large-scale depression detection datasets, and the results clearly reveal that it surpasses existing state-of-the-art approaches, improving the F1-Score by 13.92% for D-Vlog dataset and 7.74% for LMVD dataset. The code is made available publicly at https://github.com/rezwanh001/Large-Scale-Multimodal-Depression-Detection.
Fusion-based quantum computation
We introduce fusion-based quantum computing (FBQC) - a model of universal quantum computation in which entangling measurements, called fusions, are performed on the qubits of small constant-sized entangled resource states. We introduce a stabilizer formalism for analyzing fault tolerance and computation in these schemes. This framework naturally captures the error structure that arises in certain physical systems for quantum computing, such as photonics. FBQC can offer significant architectural simplifications, enabling hardware made up of many identical modules, requiring an extremely low depth of operations on each physical qubit and reducing classical processing requirements. We present two pedagogical examples of fault-tolerant schemes constructed in this framework and numerically evaluate their threshold under a hardware agnostic fusion error model including both erasure and Pauli error. We also study an error model of linear optical quantum computing with probabilistic fusion and photon loss. In FBQC the non-determinism of fusion is directly dealt with by the quantum error correction protocol, along with other errors. We find that tailoring the fault-tolerance framework to the physical system allows the scheme to have a higher threshold than schemes reported in literature. We present a ballistic scheme which can tolerate a 10.4% probability of suffering photon loss in each fusion.
Fast and Robust Dynamic Hand Gesture Recognition via Key Frames Extraction and Feature Fusion
Gesture recognition is a hot topic in computer vision and pattern recognition, which plays a vitally important role in natural human-computer interface. Although great progress has been made recently, fast and robust hand gesture recognition remains an open problem, since the existing methods have not well balanced the performance and the efficiency simultaneously. To bridge it, this work combines image entropy and density clustering to exploit the key frames from hand gesture video for further feature extraction, which can improve the efficiency of recognition. Moreover, a feature fusion strategy is also proposed to further improve feature representation, which elevates the performance of recognition. To validate our approach in a "wild" environment, we also introduce two new datasets called HandGesture and Action3D datasets. Experiments consistently demonstrate that our strategy achieves competitive results on Northwestern University, Cambridge, HandGesture and Action3D hand gesture datasets. Our code and datasets will release at https://github.com/Ha0Tang/HandGestureRecognition.
Continuous Convolutional Neural Networks for Disruption Prediction in Nuclear Fusion Plasmas
Grid decarbonization for climate change requires dispatchable carbon-free energy like nuclear fusion. The tokamak concept offers a promising path for fusion, but one of the foremost challenges in implementation is the occurrence of energetic plasma disruptions. In this study, we delve into Machine Learning approaches to predict plasma state outcomes. Our contributions are twofold: (1) We present a novel application of Continuous Convolutional Neural Networks for disruption prediction and (2) We examine the advantages and disadvantages of continuous models over discrete models for disruption prediction by comparing our model with the previous, discrete state of the art, and show that continuous models offer significantly better performance (Area Under the Receiver Operating Characteristic Curve = 0.974 v.s. 0.799) with fewer parameters
Tri-Modal Severity Fused Diagnosis across Depression and Post-traumatic Stress Disorders
Depression and post traumatic stress disorder (PTSD) often co-occur with connected symptoms, complicating automated assessment, which is often binary and disorder specific. Clinically useful diagnosis needs severity aware cross disorder estimates and decision support explanations. Our unified tri modal affective severity framework synchronizes and fuses interview text with sentence level transformer embeddings, audio with log Mel statistics with deltas, and facial signals with action units, gaze, head and pose descriptors to output graded severities for diagnosing both depression (PHQ-8; 5 classes) and PTSD (3 classes). Standardized features are fused via a calibrated late fusion classifier, yielding per disorder probabilities and feature-level attributions. This severity aware tri-modal affective fusion approach is demoed on multi disorder concurrent depression and PTSD assessment. Stratified cross validation on DAIC derived corpora outperforms unimodal/ablation baselines. The fused model matches the strongest unimodal baseline on accuracy and weighted F1, while improving decision curve utility and robustness under noisy or missing modalities. For PTSD specifically, fusion reduces regression error and improves class concordance. Errors cluster between adjacent severities; extreme classes are identified reliably. Ablations show text contributes most to depression severity, audio and facial cues are critical for PTSD, whereas attributions align with linguistic and behavioral markers. Our approach offers reproducible evaluation and clinician in the loop support for affective clinical decision making.
