new

Get trending papers in your email inbox!

Subscribe

Daily Papers

byAK and the research community

Nov 21

Platonic Representations for Poverty Mapping: Unified Vision-Language Codes or Agent-Induced Novelty?

We investigate whether socio-economic indicators like household wealth leave recoverable imprints in satellite imagery (capturing physical features) and Internet-sourced text (reflecting historical/economic narratives). Using Demographic and Health Survey (DHS) data from African neighborhoods, we pair Landsat images with LLM-generated textual descriptions conditioned on location/year and text retrieved by an AI search agent from web sources. We develop a multimodal framework predicting household wealth (International Wealth Index) through five pipelines: (i) vision model on satellite images, (ii) LLM using only location/year, (iii) AI agent searching/synthesizing web text, (iv) joint image-text encoder, (v) ensemble of all signals. Our framework yields three contributions. First, fusing vision and agent/LLM text outperforms vision-only baselines in wealth prediction (e.g., R-squared of 0.77 vs. 0.63 on out-of-sample splits), with LLM-internal knowledge proving more effective than agent-retrieved text, improving robustness to out-of-country and out-of-time generalization. Second, we find partial representational convergence: fused embeddings from vision/language modalities correlate moderately (median cosine similarity of 0.60 after alignment), suggesting a shared latent code of material well-being while retaining complementary details, consistent with the Platonic Representation Hypothesis. Although LLM-only text outperforms agent-retrieved data, challenging our Agent-Induced Novelty Hypothesis, modest gains from combining agent data in some splits weakly support the notion that agent-gathered information introduces unique representational structures not fully captured by static LLM knowledge. Third, we release a large-scale multimodal dataset comprising more than 60,000 DHS clusters linked to satellite images, LLM-generated descriptions, and agent-retrieved texts.

NAICS-Aware Graph Neural Networks for Large-Scale POI Co-visitation Prediction: A Multi-Modal Dataset and Methodology

Understanding where people go after visiting one business is crucial for urban planning, retail analytics, and location-based services. However, predicting these co-visitation patterns across millions of venues remains challenging due to extreme data sparsity and the complex interplay between spatial proximity and business relationships. Traditional approaches using only geographic distance fail to capture why coffee shops attract different customer flows than fine dining restaurants, even when co-located. We introduce NAICS-aware GraphSAGE, a novel graph neural network that integrates business taxonomy knowledge through learnable embeddings to predict population-scale co-visitation patterns. Our key insight is that business semantics, captured through detailed industry codes, provide crucial signals that pure spatial models cannot explain. The approach scales to massive datasets (4.2 billion potential venue pairs) through efficient state-wise decomposition while combining spatial, temporal, and socioeconomic features in an end-to-end framework. Evaluated on our POI-Graph dataset comprising 94.9 million co-visitation records across 92,486 brands and 48 US states, our method achieves significant improvements over state-of-the-art baselines: the R-squared value increases from 0.243 to 0.625 (a 157 percent improvement), with strong gains in ranking quality (32 percent improvement in NDCG at 10).

  • 6 authors
·
Jul 25

Fed-SB: A Silver Bullet for Extreme Communication Efficiency and Performance in (Private) Federated LoRA Fine-Tuning

Low-Rank Adaptation (LoRA) has become ubiquitous for efficiently fine-tuning foundation models. However, federated fine-tuning using LoRA is challenging due to suboptimal updates arising from traditional federated averaging of individual adapters. Existing solutions either incur prohibitively high communication cost that scales linearly with the number of clients or suffer from performance degradation due to limited expressivity. We introduce Federated Silver Bullet (Fed-SB), a novel approach for federated fine-tuning of LLMs using LoRA-SB, a recently proposed low-rank adaptation method. LoRA-SB optimally aligns the optimization trajectory with the ideal low-rank full fine-tuning projection by learning a small square matrix (R) between adapters B and A, keeping other components fixed. Direct averaging of R guarantees exact updates, substantially reducing communication cost, which remains independent of the number of clients, and enables scalability. Fed-SB achieves state-of-the-art performance across commonsense reasoning, arithmetic reasoning, and language inference tasks while reducing communication costs by up to 230x. In private settings, Fed-SB further improves performance by (1) reducing trainable parameters, thereby lowering the noise required for differential privacy and (2) avoiding noise amplification introduced by other methods. Overall, Fed-SB establishes a new Pareto frontier in the tradeoff between communication and performance, offering an efficient and scalable solution for both private and non-private federated fine-tuning. Our code is publicly available at https://github.com/CERT-Lab/fed-sb.

  • 5 authors
·
Feb 21

3D radio data visualisation in open science platforms for next-generation observatories

Next-generation telescopes will bring groundbreaking discoveries but they will also present new technological challenges. The Square Kilometre Array Observatory (SKAO) will be one of the most demanding scientific infrastructures, with a projected data output of 700 PB per year to be distributed to a network of SKA Regional Centres. Current tools are not fully suited to manage such massive data volumes, therefore, new research is required to transform science archives from data providers into service providers. In this paper we examine how a science archive can deliver advanced visualisation capabilities for the SKA science archive. In particular, we have conducted a thorough exploration of existing visualisation software for astronomy and other fields to identify tools capable of addressing Big Data requirements. Using selected technologies, we have developed a prototype archive that provides access to interactive visualisations of 3D radio data through web-based interfaces, adhering to International Virtual Observatory Alliance (IVOA) recommendations to favour interoperability and Open Science practices. In addition, we discuss how current IVOA recommendations support these visualisation capabilities and how they could be expanded. Our prototype archive includes a service to generate 3D models on the fly as a server operation, enabling remote visualisations in a flexible manner; for instance, a set of parameters can be used to customise the models and their visualisation. We have used SKA precursor and pathfinder data to test its usability and scalability, concluding that remote visualisation is a viable solution for handling high-volume data. However, our prototype is constrained by memory limitations, requiring techniques to reduce memory usage.

  • 7 authors
·
Mar 20

Understanding the Neutron Star Population with the SKA

Since their discovery in the late 1960's the population of known neutron stars (NSs) has grown to ~2500. The last five decades of observations have yielded many surprises and demonstrated that the observational properties of NSs are remarkably diverse. The surveys that will be performed with SKA (the Square Kilometre Array) will produce a further tenfold increase in the number of Galactic NSs known. Moreover, the SKA's broad spectral coverage, sub-arraying and multi-beaming capabilities will allow us to characterise these sources with unprecedented efficiency, in turn enabling a giant leap in the understanding of their properties. Here we review the NS population and outline our strategies for studying each of the growing number of diverse classes that are populating the "NS zoo". Some of the main scientific questions that will be addressed by the much larger statistical samples and vastly improved timing efficiency provided by SKA include: (i) the spin period and spin-down rate distributions (and thus magnetic fields) at birth, and the associated information about the SNe wherein they are formed; (ii) the radio pulsar-magnetar connection; (iii) the link between normal radio pulsars, intermittent pulsars and rotating radio transients; (iv) the slowest possible spin period for a radio pulsar (revealing the conditions at the pulsar death-line); (v) proper motions of pulsars (revealing SN kick physics); (vi) the mass distribution of NSs (vii) the fastest possible spin period for a recycled pulsar (constraining magnetosphere-accretion disc interactions, gravitational wave radiation and the equation-of-state); (viii) the origin of high eccentricity millisecond pulsars (MSPs); (ix) the formation channels for recently identified triple systems; and finally (x) how isolated MSPs are formed. We expect that the SKA will break new ground unveiling exotic systems that will challenge... [abridged]

  • 12 authors
·
Dec 30, 2014

CayleyPy Growth: Efficient growth computations and hundreds of new conjectures on Cayley graphs (Brief version)

This is the third paper of the CayleyPy project applying artificial intelligence to problems in group theory. We announce the first public release of CayleyPy, an open source Python library for computations with Cayley and Schreier graphs. Compared with systems such as GAP and Sage, CayleyPy handles much larger graphs and performs several orders of magnitude faster. Using CayleyPy we obtained about 200 new conjectures on Cayley and Schreier graphs, focused on diameters and growth. For many Cayley graphs of symmetric groups Sn we observe quasi polynomial diameter formulas: a small set of quadratic or linear polynomials indexed by n mod s. We conjecture that this is a general phenomenon, giving efficient diameter computation despite the problem being NP hard. We propose a refinement of the Babai type conjecture on diameters of Sn: n^2/2 + 4n upper bounds in the undirected case, compared to previous O(n^2) bounds. We also provide explicit generator families, related to involutions in a square with whiskers pattern, conjectured to maximize the diameter; search confirms this for all n up to 15. We further conjecture an answer to a question posed by V M Glushkov in 1968 on directed Cayley graphs generated by a cyclic shift and a transposition. For nilpotent groups we conjecture an improvement of J S Ellenberg's results on upper unitriangular matrices over Z/pZ, showing linear dependence of diameter on p. Moreover. Some conjectures are LLM friendly, naturally stated as sorting problems verifiable by algorithms or Python code. To benchmark path finding we created more than 10 Kaggle datasets. CayleyPy works with arbitrary permutation or matrix groups and includes over 100 predefined generators. Our growth computation code outperforms GAP and Sage up to 1000 times in speed and size.

  • 49 authors
·
Sep 23

DESI 2024 V: Full-Shape Galaxy Clustering from Galaxies and Quasars

We present the measurements and cosmological implications of the galaxy two-point clustering using over 4.7 million unique galaxy and quasar redshifts in the range 0.1<z<2.1 divided into six redshift bins over a sim 7,500 square degree footprint, from the first year of observations with the Dark Energy Spectroscopic Instrument (DESI Data Release 1). By fitting the full power spectrum, we extend previous DESI DR1 baryon acoustic oscillation (BAO) measurements to include redshift-space distortions and signals from the matter-radiation equality scale. For the first time, this Full-Shape analysis is blinded at the catalogue-level to avoid confirmation bias and the systematic errors are accounted for at the two-point clustering level, which automatically propagates them into any cosmological parameter. When analysing the data in terms of compressed model-agnostic variables, we obtain a combined precision of 4.7\% on the amplitude of the redshift space distortion signal reaching similar precision with just one year of DESI data than with 20 years of observation from previous generation surveys. We analyse the data to directly constrain the cosmological parameters within the LambdaCDM model using perturbation theory and combine this information with the reconstructed DESI DR1 galaxy BAO. Using a Big Bang Nucleosynthesis Gaussian prior on the baryon density parameter, and a Gaussian prior on the spectral index, we constrain the matter density is Omega_m=0.296pm 0.010 and the Hubble constant H_0=(68.63 pm 0.79)[{rm km, s^{-1}Mpc^{-1}}]. Additionally, we measure the amplitude of clustering sigma_8=0.841 pm 0.034. The DESI DR1 results are in agreement with the LambdaCDM model based on general relativity with parameters consistent with those from Planck. The cosmological interpretation of these results in combination with external datasets are presented in a companion paper.

  • 198 authors
·
Nov 18, 2024