VirtuoTuring commited on
Commit
2426a58
·
verified ·
1 Parent(s): 4c04fdf

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +80 -0
README.md CHANGED
@@ -101,4 +101,84 @@ Virtuo 1.0. Uso, modificação e redistribuição, incluindo comercial, com pres
101
  ## Créditos
102
  Virtuo Turing – Artificial Intelligence, S.A. (Portugal) e Octávio Viana.
103
  Base © Mistral AI (Apache-2.0).
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
104
  Website: https://justina.cloud
 
101
  ## Créditos
102
  Virtuo Turing – Artificial Intelligence, S.A. (Portugal) e Octávio Viana.
103
  Base © Mistral AI (Apache-2.0).
104
+ Website: https://justina.cloud
105
+
106
+ # Justina Clarus 24B — safetensors (v2)
107
+
108
+ Version 2. Reinforced with more training sessions and more PT-PT Q/A pairs, maintaining focus on CPC and CC and related topics.
109
+
110
+ ## What’s new in v2
111
+ - Larger number of pairs and training iterations.
112
+ - Improved stylistic consistency in technical and legal PT-PT.
113
+ - More robustness to question variation within the same domain.
114
+
115
+ ## Generalization and non-memorization
116
+ - The model does not memorize all answers verbatim. It retains general patterns and may converge to consistent formulations.
117
+ - It learned the format, tone, and patterns of formal PT-PT Q/A with specialized jargon (e.g., legal, technical). It answers consistently in that style even for questions different from those in the dataset.
118
+ Useful for: applications needing consistency with the dataset’s tone without literal reproduction. Excellent for RAG.
119
+ - It captures semantic and syntactic patterns of the PT-PT legal corpus. For identical or very close questions, answers tend to be accurate (>80–90% semantic equivalence even without verbatim).
120
+ Useful for: scenarios with varied questions within the same legal theme. Better generalization.
121
+
122
+ ## Primary uses
123
+ This model is a base for:
124
+ 1) fine-tuning to specific legal domains;
125
+ 2) integration in RAG;
126
+ 3) injecting user-supplied context at prompt time (laws, interpretations) to compose legal text.
127
+
128
+ ## Legal coverage
129
+ - CPC: types of actions, procedures, jurisdiction, appeals, res judicata.
130
+ - CC: obligations, tort, personality rights.
131
+ - Reinforcements: abuse of rights; family and minors; sexual freedom; popular actions.
132
+ - Additional context (selective): CPP, CP, CRP, ECHR, GDPR (+ Law 58/2019), Law 112/2009, Law 130/2015, LAV, Reg. (EU) 1215/2012, Reg. (EU) 2019/1111, Dir. 2004/25/EC, CVM 194/196.
133
+
134
+ ## Data and schema
135
+ Training in JSONL using self-contained Q/A pairs.
136
+
137
+ Schema
138
+ {"messages":[{"role":"user","content":"…"},{"role":"assistant","content":"…"}]}
139
+
140
+ Examples
141
+ {"messages":[{"role":"user","content":"Quais são as espécies fundamentais de ações no Direito Processual Civil segundo o artigo 10.º, n.º 1 CPC?"},{"role":"assistant","content":"Segundo o artigo 10.º, n.º 1 CPC, distinguem-se duas espécies fundamentais de ações no esquema do processo civil."}]}
142
+ {"messages":[{"role":"user","content":"O que caracteriza a ação declarativa?"},{"role":"assistant","content":"A ação declarativa pode ser de simples apreciação, de condenação ou constitutiva (art. 10.º, n.os 2 e 3 CPC), pedindo-se em cada subespécie providências distintas."}]}
143
+
144
+ ## Usage
145
+ Distributed as safetensors for transformers.
146
+
147
+ Python (FP16/BF16)
148
+ from transformers import AutoTokenizer, AutoModelForCausalLM
149
+ import torch
150
+ repo = "VirtuoTuring/justina_clarus-24b-safetensors"
151
+ tok = AutoTokenizer.from_pretrained(repo, use_fast=True)
152
+ dtype = torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16
153
+ model = AutoModelForCausalLM.from_pretrained(repo, torch_dtype=dtype, device_map="auto")
154
+ prompt = "Pergunta: Indique as espécies de ações no art. 10.º, n.º 1 CPC.\nResposta:"
155
+ out = model.generate(**tok(prompt, return_tensors="pt").to(model.device),
156
+ max_new_tokens=400, temperature=0.2, top_p=0.9)
157
+ print(tok.decode(out[0], skip_special_tokens=True))
158
+
159
+ Python 4-bit (bitsandbytes)
160
+ from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
161
+ import torch
162
+ bnb = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4",
163
+ bnb_4bit_use_double_quant=True,
164
+ bnb_4bit_compute_dtype=torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16)
165
+ tok = AutoTokenizer.from_pretrained(repo, use_fast=True)
166
+ model = AutoModelForCausalLM.from_pretrained(repo, quantization_config=bnb, device_map="auto")
167
+
168
+ ## Good practice
169
+ - Cite article numbers when applicable.
170
+ - Validate against official sources. Human review is mandatory for filings.
171
+ - For production, prefer low temperature and explicit token limits.
172
+
173
+ ## Limitations
174
+ - Context window ~4k tokens.
175
+ - Not a substitute for legal professionals or courts.
176
+ - May miss special regimes or recent legislative changes.
177
+
178
+ ## License
179
+ Virtuo 1.0. Use, modification, and redistribution, including commercial, with notices preserved and reference to Virtuo Turing – Artificial Intelligence, S.A.
180
+
181
+ ## Credits
182
+ Virtuo Turing – Artificial Intelligence, S.A. (Portugal) and Octávio Viana.
183
+ Base © Mistral AI (Apache-2.0).
184
  Website: https://justina.cloud