brillm05
commited on
Commit
·
a075c80
1
Parent(s):
9761729
update readme
Browse files- README.md +63 -33
- figs/fig1.png +2 -2
- figs/fig2.png +2 -2
- figs/fig3.png +2 -2
- figs/fig4.png +2 -2
- figs/fig5.png +2 -2
- figs/tab1.png +3 -0
- figs/{fig6.png → tab2.png} +2 -2
- figs/tab3.png +3 -0
- figs/tab4.png +3 -0
- figs/tab5.png +3 -0
README.md
CHANGED
|
@@ -4,72 +4,104 @@ We release BriLLM-Chinese and BriLLM-English.
|
|
| 4 |
|
| 5 |
Our paper: https://arxiv.org/pdf/2503.11299
|
| 6 |
|
|
|
|
|
|
|
| 7 |
Our huggingface: https://huggingface.co/BriLLM/BriLLM0.5
|
| 8 |
|
| 9 |
|
| 10 |
## Overview
|
| 11 |
-
This work introduces the first brain-inspired large language model (BriLLM). This is a non-Transformer, non-GPT, non-traditional machine learning input-output controlled generative language model. The model is based on the Signal Fully-connected flowing (SiFu) definition on the directed graph in terms of the neural network, and has the interpretability of all nodes on the graph of the whole model, instead of the traditional machine learning model that only has limited interpretability at the input and output ends.
|
| 12 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 13 |
|
| 14 |
-
## SiFU Mechanism
|
| 15 |

|
| 16 |
-
> As shown in Figure 1, SiFu model is a graph composed of multiple nodes, which are sparsely activated and utilize tensors to transmit a nominal signal.
|
| 17 |
-
Each node (ideally, a layer of neurons) represents a certain concept or word, e.g., a noun, a verb, etc.
|
| 18 |
-
Each edge models the relationship between every node pair.
|
| 19 |
-
The signal is transmitted by the magnitude of the energy. The energy will be strengthened, i.e., maximized, if it is in the right route. Or, at least, the right path always keeps the maximal energy for the transmitted signal.
|
| 20 |
-
Each node is sequentially activated in terms of the maximized energy.
|
| 21 |
-
Route or path is determined in a competitive way, i.e., the next node will be activated only if the energy can be maximally delivered in this node.
|
| 22 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 23 |
|
| 24 |
-
## Architecture
|
| 25 |

|
| 26 |
-
> As shown in Figure 2, BriLLM implements SiFu neural network for language modeling.
|
| 27 |
-
Each token in the vocabulary is modeled as a node, which is defined by a hidden layer of neurons in the neural network.
|
| 28 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 29 |
|
| 30 |
-
## Training Network
|
| 31 |

|
| 32 |
-
> To train a sample in BriLLM, every time we build an individual common neural network to perform the regular BP training. This network consists of two parts, in which the front part connects all input nodes (i.e., tokens), then it follows the rear parts which connect all possible paths in order. At last, a softmax layer collects all paths' energy tensors to indicate the right path with a 0-1 ground truth vector. We adopt a cross-entropy loss for training.
|
| 33 |
|
|
|
|
|
|
|
|
|
|
| 34 |
|
| 35 |
-
|
| 36 |
-
We use the subset from the Chinese version of Wikipedia, which contains over 100M Chinese characters. We truncate the long sentences into small sentences with a maximum length of 16.
|
| 37 |
-
We select a vocabulary of 4,000 tokens consisting of the most frequently used Chinese characters.
|
| 38 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 39 |
|
| 40 |
-
## Implementation Details.
|
| 41 |
-
BriLLM is implemented using PyTorch.
|
| 42 |
-
It uses sinusoidal positional encoding, GeLU as the activation function, cross-entropy loss for next-token prediction, and an embedding size of $d_{model} = 32$.
|
| 43 |
-
We used the AdamW optimizer with $\beta_1 = 0.9$, $\beta_2 = 0.999$ and $\epsilon = 10^{-8}$.
|
| 44 |
-
The model size is about $512 + 4000 * 4000 * (32 * 32 + 32) \approx 16B$.
|
| 45 |
-
We trained our models on one machine with 8 NVIDIA A800 GPUs for 1.5k steps.
|
| 46 |

|
| 47 |
|
|
|
|
| 48 |
|
| 49 |
-
## Complexity
|
| 50 |
-
$n$ is the sequence length, $v$ is the vocabulary size, and $d$ is the representation dimension. The computational complexity is $O(n \cdot v \cdot d^2)$.
|
| 51 |
|
|
|
|
|
|
|
|
|
|
| 52 |
|
| 53 |
-
## Case Study
|
| 54 |

|
| 55 |
-

|
| 56 |
|
| 57 |
|
| 58 |
-
|
| 59 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 60 |
|
| 61 |
|
| 62 |
## Installation
|
|
|
|
| 63 |
```bash
|
| 64 |
pip install torch
|
| 65 |
```
|
| 66 |
|
| 67 |
|
| 68 |
-
##
|
|
|
|
| 69 |
[BriLLM0.5](https://huggingface.co/BriLLM/BriLLM0.5)
|
| 70 |
|
| 71 |
|
| 72 |
-
##
|
|
|
|
| 73 |
### BriLLM-Chinese
|
| 74 |
```bash
|
| 75 |
bash run_zh.sh
|
|
@@ -81,8 +113,8 @@ bash run_en.sh
|
|
| 81 |
```
|
| 82 |
|
| 83 |
|
| 84 |
-
|
| 85 |
## Inference
|
|
|
|
| 86 |
### BriLLM-Chinese
|
| 87 |
```python
|
| 88 |
import json
|
|
@@ -121,7 +153,6 @@ decode_sentence = decode_tuple_list[0][0] + "".join([p[-1] for p in decode_tuple
|
|
| 121 |
print(decode_sentence)
|
| 122 |
```
|
| 123 |
|
| 124 |
-
|
| 125 |
### BriLLM-English
|
| 126 |
```python
|
| 127 |
import json
|
|
@@ -141,7 +172,6 @@ def decode_en_sentence(head, max_token=32, do_sample=False):
|
|
| 141 |
decode_sentence = decode_tuple_list[0][0] + "".join([p[-1] for p in decode_tuple_list])
|
| 142 |
return decode_sentence
|
| 143 |
|
| 144 |
-
|
| 145 |
with open("./vocab_wiki_4k_en.json") as f:
|
| 146 |
node_dict = json.load(f)
|
| 147 |
vocab = Vocab.from_node_dict(node_dict)
|
|
|
|
| 4 |
|
| 5 |
Our paper: https://arxiv.org/pdf/2503.11299
|
| 6 |
|
| 7 |
+
Our Github: https://github.com/brillm05/BriLLM0.5
|
| 8 |
+
|
| 9 |
Our huggingface: https://huggingface.co/BriLLM/BriLLM0.5
|
| 10 |
|
| 11 |
|
| 12 |
## Overview
|
|
|
|
| 13 |
|
| 14 |
+
BriLLM redefines the foundations of generative language modeling by departing from Transformer architectures, GPT frameworks, and traditional input-output constrained paradigms. Built on the Signal Fully-connected flowing (SiFu) mechanism—a directed graph-based neural network design—BriLLM enables full interpretability across all nodes, in contrast to conventional models limited to input-output interpretability.
|
| 15 |
+
|
| 16 |
+
In this framework, tokens are represented as graph nodes, with signal flows—either randomly initialized or user-defined—propagating along paths following a "least resistance" principle. The next token to be generated emerges as the target of this signal flow. Theoretically, BriLLM supports infinitely long n-gram modeling, with model size decoupled from input and prediction length. Its signal propagation dynamics mimic human-like cognitive patterns, enabling recall activation and inherent multi-modal compatibility.
|
| 17 |
+
|
| 18 |
+

|
| 19 |
+
|
| 20 |
+
|
| 21 |
+
## SiFu Mechanism
|
| 22 |
|
|
|
|
| 23 |

|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 24 |
|
| 25 |
+
The SiFu (Signal Fully-connected Flowing) mechanism addresses fundamental limitations of current machine learning frameworks. Unlike traditional models that process discrete input streams through opaque computations, SiFu operates on a fully connected directed graph where:
|
| 26 |
+
|
| 27 |
+
- Each node represents an interpretable unit (token, concept, etc.)
|
| 28 |
+
- Signal tensors propagate through the graph following energy dynamics
|
| 29 |
+
- The next token is determined by maximizing signal energy
|
| 30 |
+
- All nodes can serve as both input and output interfaces
|
| 31 |
|
|
|
|
| 32 |

|
|
|
|
|
|
|
| 33 |
|
| 34 |
+
Signal propagation follows the principle:
|
| 35 |
+
$v_i = \arg\max_{v'} \left\| r \oplus v_1 \otimes e_{12} \oplus v_2 \ldots \oplus v' \right\|$
|
| 36 |
+
|
| 37 |
+
where $\oplus$ and $\otimes$ denote tensor operations for node and edge interactions, and $\|\cdot\|$ represents signal energy.
|
| 38 |
+
|
| 39 |
+
Overall, SiFu's design as a directed fully connected graph with signal propagation confers two key advantages:
|
| 40 |
+
1. **Inherent full interpretability**: User-defined entities (concepts, tokens, or interpretable units) map directly to specific graph nodes;
|
| 41 |
+
2. **Unbounded contextual capacity**: Prediction is framed as signal propagation through node activations. Because signals propagate freely across nodes, sequence prediction naturally supports arbitrarily long contexts without increasing model size.
|
| 42 |
+
|
| 43 |
+
|
| 44 |
+
## Architecture
|
| 45 |
|
|
|
|
| 46 |

|
|
|
|
| 47 |
|
| 48 |
+
BriLLM implements the SiFu mechanism where each vocabulary token corresponds to a node defined by a GeLU-activated neuron layer with bias $b \in \mathbb{R}^{d_{node}}$. Edges between nodes are modeled as fully connected matrices $W_{u,v} \in \mathbb{R}^{d_{node} \times d_{node}}$, enabling bidirectional signaling.
|
| 49 |
+
|
| 50 |
+
Signal propagation begins with initial tensor $e_0 = [1, 1, \ldots, 1]^T \in \mathbb{R}^{d_{node}}$ and follows:
|
| 51 |
|
| 52 |
+
$e_{i+1} = \text{GeLU}(W_{u_i,u_{i+1}} e_i + b_{u_i,u_{i+1}} + PE_i)$
|
|
|
|
|
|
|
| 53 |
|
| 54 |
+
The final prediction maximizes the L2 norm: $v_{predict} = \arg\max_v \|E_{u,v}\|_2$
|
| 55 |
+
|
| 56 |
+
|
| 57 |
+
## Training Network
|
| 58 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 59 |

|
| 60 |
|
| 61 |
+
Training BriLLM involves constructing a dedicated neural network for each sequence sample. The network connects input nodes sequentially, with all potential paths integrated into a final softmax layer that identifies the correct path via cross-entropy loss optimization.
|
| 62 |
|
|
|
|
|
|
|
| 63 |
|
| 64 |
+
## Implementation Details
|
| 65 |
+
|
| 66 |
+
BriLLM is implemented using PyTorch. It uses sinusoidal positional encoding, GeLU as the activation function, cross-entropy loss for next-token prediction, and an embedding size of $d_{model} = 32$. We used the AdamW optimizer with $\beta_1 = 0.9$, $\beta_2 = 0.999$ and $\epsilon = 10^{-8}$. The model size is about $512 + 4000 * 4000 * (32 * 32 + 32) \approx 16B$. We trained our models on one machine with 8 NVIDIA A800 GPUs for 1.5k steps.
|
| 67 |
|
|
|
|
| 68 |

|
|
|
|
| 69 |
|
| 70 |
|
| 71 |
+
BriLLM leverages sparse token co-occurrence: most bigrams are low-frequency or absent, allowing shared parameters for inactive edges. Low-frequency bigrams use a fixed, non-updated matrix, reducing model size to 2B (Chinese) and 1B (English)—13.0\% and 5.7\% of the original size, respectively. This reduces parameters by ~90\% while accelerating training.
|
| 72 |
+
|
| 73 |
+

|
| 74 |
+
|
| 75 |
+
|
| 76 |
+
## Case Study
|
| 77 |
+
|
| 78 |
+
### Chinese Examples
|
| 79 |
+

|
| 80 |
+
|
| 81 |
+
|
| 82 |
+
### English Examples
|
| 83 |
+

|
| 84 |
+
|
| 85 |
+
|
| 86 |
+
## Comparison: Traditional LLMs vs BriLLM
|
| 87 |
+
|
| 88 |
+

|
| 89 |
|
| 90 |
|
| 91 |
## Installation
|
| 92 |
+
|
| 93 |
```bash
|
| 94 |
pip install torch
|
| 95 |
```
|
| 96 |
|
| 97 |
|
| 98 |
+
## Model Checkpoints
|
| 99 |
+
|
| 100 |
[BriLLM0.5](https://huggingface.co/BriLLM/BriLLM0.5)
|
| 101 |
|
| 102 |
|
| 103 |
+
## Training
|
| 104 |
+
|
| 105 |
### BriLLM-Chinese
|
| 106 |
```bash
|
| 107 |
bash run_zh.sh
|
|
|
|
| 113 |
```
|
| 114 |
|
| 115 |
|
|
|
|
| 116 |
## Inference
|
| 117 |
+
|
| 118 |
### BriLLM-Chinese
|
| 119 |
```python
|
| 120 |
import json
|
|
|
|
| 153 |
print(decode_sentence)
|
| 154 |
```
|
| 155 |
|
|
|
|
| 156 |
### BriLLM-English
|
| 157 |
```python
|
| 158 |
import json
|
|
|
|
| 172 |
decode_sentence = decode_tuple_list[0][0] + "".join([p[-1] for p in decode_tuple_list])
|
| 173 |
return decode_sentence
|
| 174 |
|
|
|
|
| 175 |
with open("./vocab_wiki_4k_en.json") as f:
|
| 176 |
node_dict = json.load(f)
|
| 177 |
vocab = Vocab.from_node_dict(node_dict)
|
figs/fig1.png
CHANGED
|
Git LFS Details
|
|
Git LFS Details
|
figs/fig2.png
CHANGED
|
Git LFS Details
|
|
Git LFS Details
|
figs/fig3.png
CHANGED
|
Git LFS Details
|
|
Git LFS Details
|
figs/fig4.png
CHANGED
|
Git LFS Details
|
|
Git LFS Details
|
figs/fig5.png
CHANGED
|
Git LFS Details
|
|
Git LFS Details
|
figs/tab1.png
ADDED
|
Git LFS Details
|
figs/{fig6.png → tab2.png}
RENAMED
|
File without changes
|
figs/tab3.png
ADDED
|
Git LFS Details
|
figs/tab4.png
ADDED
|
Git LFS Details
|
figs/tab5.png
ADDED
|
Git LFS Details
|