File size: 10,189 Bytes
06c0672
6dad701
 
 
c7c4cc4
6dad701
c7c4cc4
6dad701
 
 
 
 
 
 
dd9e133
6dad701
 
 
 
 
34534c7
 
304e516
6dad701
304e516
6dad701
 
 
 
 
 
 
 
304e516
6dad701
 
304e516
6dad701
304e516
d19155d
 
 
 
304e516
6dad701
dd9e133
6dad701
 
 
 
 
 
 
879537a
6dad701
879537a
c7c4cc4
6dad701
c7c4cc4
6dad701
 
 
 
 
 
 
73ac358
6dad701
304e516
6dad701
c7c4cc4
629fd00
 
 
 
 
 
 
b252405
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d19155d
 
 
dcc0684
6dad701
 
 
80fec26
 
c7c4cc4
6dad701
 
 
9d71bd3
6dad701
 
9d71bd3
6dad701
 
 
9d71bd3
6dad701
9d71bd3
6dad701
 
b429379
6dad701
 
 
c7c4cc4
6dad701
 
b429379
6dad701
 
 
c7c4cc4
6dad701
c7c4cc4
6dad701
 
 
 
 
c7c4cc4
6dad701
304e516
6dad701
 
304e516
6dad701
 
 
c7c4cc4
6dad701
 
c7c4cc4
6dad701
 
 
c7c4cc4
b252405
6dad701
c7c4cc4
b252405
6dad701
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
---
license: apache-2.0
base_model:
- mistralai/Mistral-7B-Instruct-v0.3
---
# Introduction

## Task Description
This model was fine-tuned to improve its ability to perform spatial
reasoning tasks. The objective is to enable the model to interpret natural language queries related to
spatial relationships, directions, and locations and output actionable responses.
The task addresses limitations in current LLMs, which often fail to perform precise spatial reasoning,
such as determining relationships between points on a map, planning routes, or identifying locations
based on bounding boxes. 

## Task Importance
Spatial reasoning is important for a wide range of applications such as navigation and geospatial
analysis. Many smaller LLMs, while strong in general reasoning, often lack the ability to interpret
spatial relationships with precision or utilize real-world geographic data effectively. For example, they
struggle to answer queries like “What’s between Point A and Point B?” or “Find me the fastest route
avoiding traffic at 8 AM tomorrow.” Even when the LLM has access to geospatial information, smaller models struggle to correctly
interpret user questions.

## Related Work/Gap Analysis

While there is ongoing research in integrating LLMs with geospatial systems, most existing solutions
rely on symbolic AI or rule-based systems rather than leveraging the generalization capabilities of
LLMs. Additionally, the paper “Advancing Spatial Reasoning in Large Language Models: An In-Depth
Evaluation and Enhancement Using the StepGame Benchmark,” concluded that larger models like
GPT-4 perform well in mapping natural language descriptions to spatial relations but struggle with
multi-hop reasoning. This paper used the StepGame as a benchmark for spatial reasoning.
Fine-tuning a model fills the gap identified in the paper, as the only solutions identified in their
research was prompt engineering with Chain of Thought.

Research by organizations like OpenAI and Google has focused on improving contextual reasoning
through fine-tuning, but there is limited work targeting spatial reasoning.

## Main Results

The fine-tuned model slightly improved on general knowledge tasks such as MMLU Geography and Babi Task 17 
compared to the original Mistral-7B base model. However, its performance on spatial reasoning benchmarks like SpatialEval
significantly declined, suggesting that fine-tuning may have led to incompatibility between the prompt style used for training with StepGame 
and the multiple-choice formatting in SpatialEval.

# Training Data

For this fine-tuning task the step-game dataset was used. This dataset is large and provides multi-step
reasoning challenges for geospatial reasoning. The train-test split is predefined with 50,000 rows in the train split and 10,000 in the test
split. It focuses on multi-step problem-solving with
spatial relationships, such as directional logic, relative positioning, and route-based reasoning. It
presents text-based tasks that require stepwise deductions, ensuring the model develops strong
reasoning abilities beyond simple fact recall. This dataset follows the template of story,
question and answer to assess spatial reasoning as depicted below.

![Description of StepGame Training Data](https://huggingface.co/sareena/spatial_lora_mistral/resolve/main/stepgame.jpg)


# Training Method

For this task of spatial reasoning LoRA (Low-Rank Adaptation)  was used as the training method. 
LoRA allows for efficient fine-tuning of large language models by freezing the majority of the model weights and only updating small, 
low-rank adapter matrices within attention layers. It significantly reduces the computational cost and memory requirements of full 
fine-tuning, making it ideal for working with limited GPU resources. LoRA is especially effective for task-specific
adaptation when the dataset is moderately sized and instruction formatting is consistent as in the case of this dataset of stepGame. 
In previous experiments with spatial reasoning fine-tuning, LoRA performed better than prompt tuning. While prompt tuning resulted in close to 0% accuracy on both the StepGame and
MMLU evaluations, LoRA preserved partial task performance (18% accuracy) and retained some general knowledge ability (46% accuracy on 
MMLU geography vs. 52% before training). I used a learning rate of 1e-4, batch size of 4, and trained for 3 epochs. 
This setup preserved general reasoning ability while improving spatial accuracy.

# Evaluation

| Model                                           | MMLU-Geography (%) | Spatial Eval (%) | Babi Task 17 (%) |
|-------------------------------------------------|--------------------|------------------|------------------|
| mistralai/Mistral-7B-Instruct-v0.3 (base model) | 75.63%             | 36.07%           | 51%              |
| sareena/spatial_lora_mistral (fine-tuned)       | 76.17%             | 0%               | 53%              |
| meta-llama/Llama-2-7b-hf                        | 42.42%             | 18.25%           | 48.00%           |
| google/gemma-7b                                 | 80.30%             | 7.01%            | 58.00%           |

## Benchmark Tasks

### SpatialQA Benchmark

This benchmark provides more realistic question answer pairs than the
stepgame benchmark. It contains place names instead of abstracted concepts of letters
and answers, but still requires the same multi-step geospatial reasoning capability.
It complements stepGame by testing broader spatial logic and more realistic
scenarios

### bAbI Dataset (Task 17)

This benchmark was introduced by Facebook AI Research. It includes 20
synthetic question-answering tasks designed to evaluate various reasoning abilities in
models. Tasks 17 ("Positional Reasoning") specifically assesses
spatial reasoning through textual descriptions. Pathfinding in combination with spatial reasoning will potentially help assess
the model’s performance in tasks such as calculating routes, which would be a common
application for a geospatial reasoning fine-tuned model. 


### MMLU benchmark geography and environmental subset

This benchmark is a comprehensive evaluation suite designed to assess a model's
reasoning and knowledge across a wide range of subjects. When focusing on geography
and environmental science subsets, the benchmark offers an opportunity to test both
domain-specific knowledge and broader reasoning abilities relevant to geospatial tasks.
Input/Output: This benchmark consists of multiple-choice questions covering topics
from elementary to advanced levels. This benchmark is intended to assess the model’s general performance
post-processing, and its ability to apply knowledge across subjects most relevant to its
fine-tuning task.


## Comparison Models
LLaMA-2 and Gemma represent strong alternatives from Meta and Google respectively, offering diverse architectural approaches with a similar number of parameters and 
training data sources. Including these models allowed for a more meaningful evaluation of how my fine-tuned model performs 
not just against its own baseline, but also against state-of-the-art peers on spatial reasoning and general knowledge tasks.

# Usage and Intended Uses
This model is designed to assist with natural language spatial reasoning, particularly in tasks that involve multi-step relational 
inference between objects or locations described in text. This could be implemented in agentic spatial systems and/or text-based game bots.



```python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model = AutoModelForCausalLM.from_pretrained("sareena/spatial_lora_mistral")
tokenizer = AutoTokenizer.from_pretrained("sareena/spatial_lora_mistral")

inputs = tokenizer("Q: The couch is to the left of the table. The lamp is on the couch. Where is the lamp?", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

```

# Prompt Format
The model is trained on instruction-style input with a spatial reasoning question:

```text
Q: The couch is to the left of the table. The lamp is on the couch. Where is the lamp in relation to the table?
```

# Expected Output Format
The output is a short, natural language spatial answer:

```text
A: left 
```

# Limitations 

The model is still limited in deep reasoning capabilities and sometimes fails multi-hop spatial tasks. 
LoRA helps balance this trade-off, but fine-tuning on more diverse spatial tasks could yield stronger generalization. 
)erformance on the SpatialEval benchmark dropped drastically, due to incompatibility between the prompt style used for training and the
multiple-choice formatting in SpatialEval. Future work to remediate this would be to test more prompt formats in training or use instruction-tuned datasets
more similar to the downstream evaluations.m 

## Citation 

1. **Hendrycks, Dan**, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt.  
   *"Measuring Massive Multitask Language Understanding."* arXiv preprint arXiv:2009.03300 (2020).

2. **Li, Fangjun**, et al.  
   *“Advancing Spatial Reasoning in Large Language Models: An in-Depth Evaluation and Enhancement Using the StepGame Benchmark.”*  
   *arXiv.Org*, 8 Jan. 2024. [https://arxiv.org/abs/2401.03991](https://arxiv.org/abs/2401.03991)

3. **Mirzaee, Roshanak**, Hossein Rajaby Faghihi, Qiang Ning, and Parisa Kordjamshidi.  
   *"SpartQA: A Textual Question Answering Benchmark for Spatial Reasoning."* arXiv preprint arXiv:2104.05832 (2021).

4. **Shi, Zhengxiang**, et al.  
   *“StepGame: A New Benchmark for Robust Multi-Hop Spatial Reasoning in Texts.”*  
   *arXiv.Org*, 18 Apr. 2022. [https://arxiv.org/abs/2204.08292](https://arxiv.org/abs/2204.08292)

5. **Wang, Mila**, Xiang Lorraine Li, and William Yang Wang.  
   *"SpatialEval: A Benchmark for Spatial Reasoning Evaluation."* arXiv preprint arXiv:2104.08635 (2021).

6. **Weston, Jason**, Antoine Bordes, Sumit Chopra, and Tomas Mikolov.  
   *"Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks."* arXiv preprint arXiv:1502.05698 (2015).