Commit
·
271125a
1
Parent(s):
7f77d8b
Summary of changes
Browse files
README.md
CHANGED
|
@@ -174,6 +174,122 @@ We would like to thank [Vary](https://github.com/Ucas-HaoranWei/Vary/), [GOT-OCR
|
|
| 174 |
|
| 175 |
We also appreciate the benchmarks: [Fox](https://github.com/ucaslcl/Fox), [OminiDocBench](https://github.com/opendatalab/OmniDocBench).
|
| 176 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 177 |
|
| 178 |
## Citation
|
| 179 |
```bibtex
|
|
|
|
| 174 |
|
| 175 |
We also appreciate the benchmarks: [Fox](https://github.com/ucaslcl/Fox), [OminiDocBench](https://github.com/opendatalab/OmniDocBench).
|
| 176 |
|
| 177 |
+
---
|
| 178 |
+
|
| 179 |
+
## 🔍 Summary of Changes
|
| 180 |
+
|
| 181 |
+
### 1. **Formatting & Clean-Up**
|
| 182 |
+
|
| 183 |
+
* Removed extra spaces, blank lines, and inconsistent indentation.
|
| 184 |
+
* Fixed small style issues (like missing spaces in comments).
|
| 185 |
+
* Added missing newline at the end of the file.
|
| 186 |
+
|
| 187 |
+
---
|
| 188 |
+
|
| 189 |
+
### 2. **Device and Dtype Handling**
|
| 190 |
+
|
| 191 |
+
* Added automatic device detection:
|
| 192 |
+
|
| 193 |
+
```python
|
| 194 |
+
model_device = next(self.parameters()).device
|
| 195 |
+
```
|
| 196 |
+
* Added adaptive dtype logic:
|
| 197 |
+
|
| 198 |
+
```python
|
| 199 |
+
image_dtype = torch.bfloat16 if model_device.type == "cuda" else torch.float32
|
| 200 |
+
```
|
| 201 |
+
* Replaced all hardcoded `.cuda()` and `.to(torch.bfloat16)` with:
|
| 202 |
+
|
| 203 |
+
```python
|
| 204 |
+
.to(model_device)
|
| 205 |
+
.to(image_dtype)
|
| 206 |
+
```
|
| 207 |
+
|
| 208 |
+
✅ **Now works automatically on both GPU and CPU**, without device mismatch errors.
|
| 209 |
+
|
| 210 |
+
---
|
| 211 |
+
|
| 212 |
+
### 3. **Autocast and Inference Improvements**
|
| 213 |
+
|
| 214 |
+
* Wrapped generation in a conditional autocast block:
|
| 215 |
+
|
| 216 |
+
```python
|
| 217 |
+
use_autocast = model_device.type == "cuda"
|
| 218 |
+
if use_autocast:
|
| 219 |
+
with torch.autocast("cuda", dtype=torch.bfloat16):
|
| 220 |
+
with torch.no_grad():
|
| 221 |
+
...
|
| 222 |
+
else:
|
| 223 |
+
with torch.no_grad():
|
| 224 |
+
...
|
| 225 |
+
```
|
| 226 |
+
* Reduces memory usage and speeds up inference on GPU.
|
| 227 |
+
* Added `torch.no_grad()` for safer evaluation (no gradient tracking).
|
| 228 |
+
|
| 229 |
+
---
|
| 230 |
+
|
| 231 |
+
### 4. **Image Preprocessing**
|
| 232 |
+
|
| 233 |
+
* All image tensors now use:
|
| 234 |
+
|
| 235 |
+
```python
|
| 236 |
+
.to(image_dtype)
|
| 237 |
+
```
|
| 238 |
+
|
| 239 |
+
instead of hardcoded `torch.bfloat16`.
|
| 240 |
+
* Improves flexibility and prevents dtype errors when running on CPU.
|
| 241 |
+
|
| 242 |
+
---
|
| 243 |
+
|
| 244 |
+
### 5. **Generation Parameter Updates**
|
| 245 |
+
|
| 246 |
+
* Adjusted text generation settings:
|
| 247 |
+
|
| 248 |
+
```python
|
| 249 |
+
do_sample = False
|
| 250 |
+
num_beams = 1
|
| 251 |
+
max_new_tokens = 4096 # was 8192
|
| 252 |
+
min_new_tokens = 1
|
| 253 |
+
repetition_penalty = 1.2
|
| 254 |
+
pad_token_id = tokenizer.pad_token_id or tokenizer.eos_token_id
|
| 255 |
+
```
|
| 256 |
+
|
| 257 |
+
🧠 Results: Faster, more controlled generation; avoids repetitive or runaway outputs.
|
| 258 |
+
|
| 259 |
+
---
|
| 260 |
+
|
| 261 |
+
### 6. **Safer Decoding**
|
| 262 |
+
|
| 263 |
+
* Cleaned up decoding logic:
|
| 264 |
+
|
| 265 |
+
```python
|
| 266 |
+
input_length = input_ids.unsqueeze(0).to(model_device).shape[1]
|
| 267 |
+
outputs = tokenizer.decode(output_ids[0, input_length:])
|
| 268 |
+
```
|
| 269 |
+
|
| 270 |
+
✅ Avoids CUDA-specific assumptions, consistent across devices.
|
| 271 |
+
|
| 272 |
+
---
|
| 273 |
+
|
| 274 |
+
### 7. **Miscellaneous**
|
| 275 |
+
|
| 276 |
+
* Added helpful comments for clarity.
|
| 277 |
+
* Improved readability around image transformation and saving results.
|
| 278 |
+
* Added extra blank lines for cleaner structure.
|
| 279 |
+
|
| 280 |
+
---
|
| 281 |
+
|
| 282 |
+
## ⚙️ Overall Impact
|
| 283 |
+
|
| 284 |
+
| Category | Before | After |
|
| 285 |
+
| --------------- | --------------------- | ---------------------------------------- |
|
| 286 |
+
| Device handling | Hardcoded `.cuda()` | Auto-detected and flexible |
|
| 287 |
+
| Dtype | Always bfloat16 | Adaptive: bfloat16 (GPU) / float32 (CPU) |
|
| 288 |
+
| Inference | Could crash on CPU | Runs safely everywhere |
|
| 289 |
+
| Generation | Unbounded, repetitive | Tuned and stable |
|
| 290 |
+
| Readability | Mixed formatting | Clean and consistent |
|
| 291 |
+
|
| 292 |
+
---
|
| 293 |
|
| 294 |
## Citation
|
| 295 |
```bibtex
|