brillm05 commited on
Commit
a075c80
·
1 Parent(s): 9761729

update readme

Browse files
README.md CHANGED
@@ -4,72 +4,104 @@ We release BriLLM-Chinese and BriLLM-English.
4
 
5
  Our paper: https://arxiv.org/pdf/2503.11299
6
 
 
 
7
  Our huggingface: https://huggingface.co/BriLLM/BriLLM0.5
8
 
9
 
10
  ## Overview
11
- This work introduces the first brain-inspired large language model (BriLLM). This is a non-Transformer, non-GPT, non-traditional machine learning input-output controlled generative language model. The model is based on the Signal Fully-connected flowing (SiFu) definition on the directed graph in terms of the neural network, and has the interpretability of all nodes on the graph of the whole model, instead of the traditional machine learning model that only has limited interpretability at the input and output ends.
12
 
 
 
 
 
 
 
 
 
13
 
14
- ## SiFU Mechanism
15
  ![](./figs/fig1.png)
16
- > As shown in Figure 1, SiFu model is a graph composed of multiple nodes, which are sparsely activated and utilize tensors to transmit a nominal signal.
17
- Each node (ideally, a layer of neurons) represents a certain concept or word, e.g., a noun, a verb, etc.
18
- Each edge models the relationship between every node pair.
19
- The signal is transmitted by the magnitude of the energy. The energy will be strengthened, i.e., maximized, if it is in the right route. Or, at least, the right path always keeps the maximal energy for the transmitted signal.
20
- Each node is sequentially activated in terms of the maximized energy.
21
- Route or path is determined in a competitive way, i.e., the next node will be activated only if the energy can be maximally delivered in this node.
22
 
 
 
 
 
 
 
23
 
24
- ## Architecture
25
  ![](./figs/fig2.png)
26
- > As shown in Figure 2, BriLLM implements SiFu neural network for language modeling.
27
- Each token in the vocabulary is modeled as a node, which is defined by a hidden layer of neurons in the neural network.
28
 
 
 
 
 
 
 
 
 
 
 
 
29
 
30
- ## Training Network
31
  ![](./figs/fig3.png)
32
- > To train a sample in BriLLM, every time we build an individual common neural network to perform the regular BP training. This network consists of two parts, in which the front part connects all input nodes (i.e., tokens), then it follows the rear parts which connect all possible paths in order. At last, a softmax layer collects all paths' energy tensors to indicate the right path with a 0-1 ground truth vector. We adopt a cross-entropy loss for training.
33
 
 
 
 
34
 
35
- ## Dataset
36
- We use the subset from the Chinese version of Wikipedia, which contains over 100M Chinese characters. We truncate the long sentences into small sentences with a maximum length of 16.
37
- We select a vocabulary of 4,000 tokens consisting of the most frequently used Chinese characters.
38
 
 
 
 
 
39
 
40
- ## Implementation Details.
41
- BriLLM is implemented using PyTorch.
42
- It uses sinusoidal positional encoding, GeLU as the activation function, cross-entropy loss for next-token prediction, and an embedding size of $d_{model} = 32$.
43
- We used the AdamW optimizer with $\beta_1 = 0.9$, $\beta_2 = 0.999$ and $\epsilon = 10^{-8}$.
44
- The model size is about $512 + 4000 * 4000 * (32 * 32 + 32) \approx 16B$.
45
- We trained our models on one machine with 8 NVIDIA A800 GPUs for 1.5k steps.
46
  ![](./figs/fig4.png)
47
 
 
48
 
49
- ## Complexity
50
- $n$ is the sequence length, $v$ is the vocabulary size, and $d$ is the representation dimension. The computational complexity is $O(n \cdot v \cdot d^2)$.
51
 
 
 
 
52
 
53
- ## Case Study
54
  ![](./figs/fig5.png)
55
- ![](./figs/fig7.png)
56
 
57
 
58
- ## Comparison of LLM and BriLLM
59
- ![](./figs/fig6.png)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
60
 
61
 
62
  ## Installation
 
63
  ```bash
64
  pip install torch
65
  ```
66
 
67
 
68
- ## Checkpoint
 
69
  [BriLLM0.5](https://huggingface.co/BriLLM/BriLLM0.5)
70
 
71
 
72
- ## Train
 
73
  ### BriLLM-Chinese
74
  ```bash
75
  bash run_zh.sh
@@ -81,8 +113,8 @@ bash run_en.sh
81
  ```
82
 
83
 
84
-
85
  ## Inference
 
86
  ### BriLLM-Chinese
87
  ```python
88
  import json
@@ -121,7 +153,6 @@ decode_sentence = decode_tuple_list[0][0] + "".join([p[-1] for p in decode_tuple
121
  print(decode_sentence)
122
  ```
123
 
124
-
125
  ### BriLLM-English
126
  ```python
127
  import json
@@ -141,7 +172,6 @@ def decode_en_sentence(head, max_token=32, do_sample=False):
141
  decode_sentence = decode_tuple_list[0][0] + "".join([p[-1] for p in decode_tuple_list])
142
  return decode_sentence
143
 
144
-
145
  with open("./vocab_wiki_4k_en.json") as f:
146
  node_dict = json.load(f)
147
  vocab = Vocab.from_node_dict(node_dict)
 
4
 
5
  Our paper: https://arxiv.org/pdf/2503.11299
6
 
7
+ Our Github: https://github.com/brillm05/BriLLM0.5
8
+
9
  Our huggingface: https://huggingface.co/BriLLM/BriLLM0.5
10
 
11
 
12
  ## Overview
 
13
 
14
+ BriLLM redefines the foundations of generative language modeling by departing from Transformer architectures, GPT frameworks, and traditional input-output constrained paradigms. Built on the Signal Fully-connected flowing (SiFu) mechanism—a directed graph-based neural network design—BriLLM enables full interpretability across all nodes, in contrast to conventional models limited to input-output interpretability.
15
+
16
+ In this framework, tokens are represented as graph nodes, with signal flows—either randomly initialized or user-defined—propagating along paths following a "least resistance" principle. The next token to be generated emerges as the target of this signal flow. Theoretically, BriLLM supports infinitely long n-gram modeling, with model size decoupled from input and prediction length. Its signal propagation dynamics mimic human-like cognitive patterns, enabling recall activation and inherent multi-modal compatibility.
17
+
18
+ ![](./figs/tab1.png)
19
+
20
+
21
+ ## SiFu Mechanism
22
 
 
23
  ![](./figs/fig1.png)
 
 
 
 
 
 
24
 
25
+ The SiFu (Signal Fully-connected Flowing) mechanism addresses fundamental limitations of current machine learning frameworks. Unlike traditional models that process discrete input streams through opaque computations, SiFu operates on a fully connected directed graph where:
26
+
27
+ - Each node represents an interpretable unit (token, concept, etc.)
28
+ - Signal tensors propagate through the graph following energy dynamics
29
+ - The next token is determined by maximizing signal energy
30
+ - All nodes can serve as both input and output interfaces
31
 
 
32
  ![](./figs/fig2.png)
 
 
33
 
34
+ Signal propagation follows the principle:
35
+ $v_i = \arg\max_{v'} \left\| r \oplus v_1 \otimes e_{12} \oplus v_2 \ldots \oplus v' \right\|$
36
+
37
+ where $\oplus$ and $\otimes$ denote tensor operations for node and edge interactions, and $\|\cdot\|$ represents signal energy.
38
+
39
+ Overall, SiFu's design as a directed fully connected graph with signal propagation confers two key advantages:
40
+ 1. **Inherent full interpretability**: User-defined entities (concepts, tokens, or interpretable units) map directly to specific graph nodes;
41
+ 2. **Unbounded contextual capacity**: Prediction is framed as signal propagation through node activations. Because signals propagate freely across nodes, sequence prediction naturally supports arbitrarily long contexts without increasing model size.
42
+
43
+
44
+ ## Architecture
45
 
 
46
  ![](./figs/fig3.png)
 
47
 
48
+ BriLLM implements the SiFu mechanism where each vocabulary token corresponds to a node defined by a GeLU-activated neuron layer with bias $b \in \mathbb{R}^{d_{node}}$. Edges between nodes are modeled as fully connected matrices $W_{u,v} \in \mathbb{R}^{d_{node} \times d_{node}}$, enabling bidirectional signaling.
49
+
50
+ Signal propagation begins with initial tensor $e_0 = [1, 1, \ldots, 1]^T \in \mathbb{R}^{d_{node}}$ and follows:
51
 
52
+ $e_{i+1} = \text{GeLU}(W_{u_i,u_{i+1}} e_i + b_{u_i,u_{i+1}} + PE_i)$
 
 
53
 
54
+ The final prediction maximizes the L2 norm: $v_{predict} = \arg\max_v \|E_{u,v}\|_2$
55
+
56
+
57
+ ## Training Network
58
 
 
 
 
 
 
 
59
  ![](./figs/fig4.png)
60
 
61
+ Training BriLLM involves constructing a dedicated neural network for each sequence sample. The network connects input nodes sequentially, with all potential paths integrated into a final softmax layer that identifies the correct path via cross-entropy loss optimization.
62
 
 
 
63
 
64
+ ## Implementation Details
65
+
66
+ BriLLM is implemented using PyTorch. It uses sinusoidal positional encoding, GeLU as the activation function, cross-entropy loss for next-token prediction, and an embedding size of $d_{model} = 32$. We used the AdamW optimizer with $\beta_1 = 0.9$, $\beta_2 = 0.999$ and $\epsilon = 10^{-8}$. The model size is about $512 + 4000 * 4000 * (32 * 32 + 32) \approx 16B$. We trained our models on one machine with 8 NVIDIA A800 GPUs for 1.5k steps.
67
 
 
68
  ![](./figs/fig5.png)
 
69
 
70
 
71
+ BriLLM leverages sparse token co-occurrence: most bigrams are low-frequency or absent, allowing shared parameters for inactive edges. Low-frequency bigrams use a fixed, non-updated matrix, reducing model size to 2B (Chinese) and 1B (English)—13.0\% and 5.7\% of the original size, respectively. This reduces parameters by ~90\% while accelerating training.
72
+
73
+ ![](./figs/tab2.png)
74
+
75
+
76
+ ## Case Study
77
+
78
+ ### Chinese Examples
79
+ ![](./figs/tab3.png)
80
+
81
+
82
+ ### English Examples
83
+ ![](./figs/tab4.png)
84
+
85
+
86
+ ## Comparison: Traditional LLMs vs BriLLM
87
+
88
+ ![](./figs/tab5.png)
89
 
90
 
91
  ## Installation
92
+
93
  ```bash
94
  pip install torch
95
  ```
96
 
97
 
98
+ ## Model Checkpoints
99
+
100
  [BriLLM0.5](https://huggingface.co/BriLLM/BriLLM0.5)
101
 
102
 
103
+ ## Training
104
+
105
  ### BriLLM-Chinese
106
  ```bash
107
  bash run_zh.sh
 
113
  ```
114
 
115
 
 
116
  ## Inference
117
+
118
  ### BriLLM-Chinese
119
  ```python
120
  import json
 
153
  print(decode_sentence)
154
  ```
155
 
 
156
  ### BriLLM-English
157
  ```python
158
  import json
 
172
  decode_sentence = decode_tuple_list[0][0] + "".join([p[-1] for p in decode_tuple_list])
173
  return decode_sentence
174
 
 
175
  with open("./vocab_wiki_4k_en.json") as f:
176
  node_dict = json.load(f)
177
  vocab = Vocab.from_node_dict(node_dict)
figs/fig1.png CHANGED

Git LFS Details

  • SHA256: aad52743f47491086df89c0a0e9b0580486c887c2ab2a51a24fdec85996e6a29
  • Pointer size: 131 Bytes
  • Size of remote file: 166 kB

Git LFS Details

  • SHA256: 6d27cce07e87a51ecd95db564950876373a59fb3c46bfb917e6da46575a6b364
  • Pointer size: 131 Bytes
  • Size of remote file: 666 kB
figs/fig2.png CHANGED

Git LFS Details

  • SHA256: b56d2073ae7c6857ac24caef7a8f777403293fec5b408ca8b793141743d91a1a
  • Pointer size: 130 Bytes
  • Size of remote file: 99.7 kB

Git LFS Details

  • SHA256: 06d32b2e97931839ab59f580435ff1cf6f2f376c048e266e8788fb8ad27d0c56
  • Pointer size: 131 Bytes
  • Size of remote file: 329 kB
figs/fig3.png CHANGED

Git LFS Details

  • SHA256: 4fe198022938e2e6861d49181b2bbab93268da92f02e4621c8a4b40914755750
  • Pointer size: 131 Bytes
  • Size of remote file: 206 kB

Git LFS Details

  • SHA256: ef4218c09d8fb676f8f27d0adc66731e102d4d662b5285a1b3bf469b91eb87ec
  • Pointer size: 131 Bytes
  • Size of remote file: 204 kB
figs/fig4.png CHANGED

Git LFS Details

  • SHA256: 8e09cbb72307bd4a0a19f381f82bb56a774eac8491606b3c3bf614c9db19663c
  • Pointer size: 130 Bytes
  • Size of remote file: 65.5 kB

Git LFS Details

  • SHA256: 8db39c0eb9adcb319d3c18b51a706ce0390721e8dba092a12058f778f337664f
  • Pointer size: 131 Bytes
  • Size of remote file: 389 kB
figs/fig5.png CHANGED

Git LFS Details

  • SHA256: fe31c93e0d0fcd1e753fd9189ce3c2dd4649e7762e8ae947f6d7f30e9f3b8115
  • Pointer size: 131 Bytes
  • Size of remote file: 354 kB

Git LFS Details

  • SHA256: 2559ce2b3de7690976113036d916ea438ffc091172bdf757090e911b2ec20e76
  • Pointer size: 131 Bytes
  • Size of remote file: 274 kB
figs/tab1.png ADDED

Git LFS Details

  • SHA256: 13d7d0f761b2e57b4f81904f6c354d9a17a5081e349406495f888eb15090df4d
  • Pointer size: 131 Bytes
  • Size of remote file: 157 kB
figs/{fig6.png → tab2.png} RENAMED
File without changes
figs/tab3.png ADDED

Git LFS Details

  • SHA256: 40215243c140c8b1bf9add7f96e82d1e3567979adc9dcc4691fedc50267700ac
  • Pointer size: 131 Bytes
  • Size of remote file: 942 kB
figs/tab4.png ADDED

Git LFS Details

  • SHA256: 1363231bfe9e56bfe6eebed8f131370bb4725ca889c206f222f32e8859fc86fd
  • Pointer size: 132 Bytes
  • Size of remote file: 1.14 MB
figs/tab5.png ADDED

Git LFS Details

  • SHA256: d4f6b30c3b91b429bb546a4054d5ba32f8e57b883c77d225113ba5692821d81f
  • Pointer size: 131 Bytes
  • Size of remote file: 160 kB