Update README.md

Browse files

Files changed (1) hide show

README.md +80 -0

README.md CHANGED Viewed

@@ -101,4 +101,84 @@ Virtuo 1.0. Uso, modificação e redistribuição, incluindo comercial, com pres
 ## Créditos
 Virtuo Turing – Artificial Intelligence, S.A. (Portugal) e Octávio Viana.
 Base © Mistral AI (Apache-2.0).
 Website: https://justina.cloud

 ## Créditos
 Virtuo Turing – Artificial Intelligence, S.A. (Portugal) e Octávio Viana.
 Base © Mistral AI (Apache-2.0).
+Website: https://justina.cloud
+# Justina Clarus 24B — safetensors (v2)
+Version 2. Reinforced with more training sessions and more PT-PT Q/A pairs, maintaining focus on CPC and CC and related topics.
+## What’s new in v2
+- Larger number of pairs and training iterations.
+- Improved stylistic consistency in technical and legal PT-PT.
+- More robustness to question variation within the same domain.
+## Generalization and non-memorization
+- The model does not memorize all answers verbatim. It retains general patterns and may converge to consistent formulations.
+- It learned the format, tone, and patterns of formal PT-PT Q/A with specialized jargon (e.g., legal, technical). It answers consistently in that style even for questions different from those in the dataset.
+  Useful for: applications needing consistency with the dataset’s tone without literal reproduction. Excellent for RAG.
+- It captures semantic and syntactic patterns of the PT-PT legal corpus. For identical or very close questions, answers tend to be accurate (>80–90% semantic equivalence even without verbatim).
+  Useful for: scenarios with varied questions within the same legal theme. Better generalization.
+## Primary uses
+This model is a base for:
+1) fine-tuning to specific legal domains;
+2) integration in RAG;
+3) injecting user-supplied context at prompt time (laws, interpretations) to compose legal text.
+## Legal coverage
+- CPC: types of actions, procedures, jurisdiction, appeals, res judicata.
+- CC: obligations, tort, personality rights.
+- Reinforcements: abuse of rights; family and minors; sexual freedom; popular actions.
+- Additional context (selective): CPP, CP, CRP, ECHR, GDPR (+ Law 58/2019), Law 112/2009, Law 130/2015, LAV, Reg. (EU) 1215/2012, Reg. (EU) 2019/1111, Dir. 2004/25/EC, CVM 194/196.
+## Data and schema
+Training in JSONL using self-contained Q/A pairs.
+Schema
+{"messages":[{"role":"user","content":"…"},{"role":"assistant","content":"…"}]}
+Examples
+{"messages":[{"role":"user","content":"Quais são as espécies fundamentais de ações no Direito Processual Civil segundo o artigo 10.º, n.º 1 CPC?"},{"role":"assistant","content":"Segundo o artigo 10.º, n.º 1 CPC, distinguem-se duas espécies fundamentais de ações no esquema do processo civil."}]}
+{"messages":[{"role":"user","content":"O que caracteriza a ação declarativa?"},{"role":"assistant","content":"A ação declarativa pode ser de simples apreciação, de condenação ou constitutiva (art. 10.º, n.os 2 e 3 CPC), pedindo-se em cada subespécie providências distintas."}]}
+## Usage
+Distributed as safetensors for transformers.
+Python (FP16/BF16)
+from transformers import AutoTokenizer, AutoModelForCausalLM
+import torch
+repo = "VirtuoTuring/justina_clarus-24b-safetensors"
+tok = AutoTokenizer.from_pretrained(repo, use_fast=True)
+dtype = torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16
+model = AutoModelForCausalLM.from_pretrained(repo, torch_dtype=dtype, device_map="auto")
+prompt = "Pergunta: Indique as espécies de ações no art. 10.º, n.º 1 CPC.\nResposta:"
+out = model.generate(**tok(prompt, return_tensors="pt").to(model.device),
+                     max_new_tokens=400, temperature=0.2, top_p=0.9)
+print(tok.decode(out[0], skip_special_tokens=True))
+Python 4-bit (bitsandbytes)
+from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
+import torch
+bnb = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4",
+                         bnb_4bit_use_double_quant=True,
+                         bnb_4bit_compute_dtype=torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16)
+tok = AutoTokenizer.from_pretrained(repo, use_fast=True)
+model = AutoModelForCausalLM.from_pretrained(repo, quantization_config=bnb, device_map="auto")
+## Good practice
+- Cite article numbers when applicable.
+- Validate against official sources. Human review is mandatory for filings.
+- For production, prefer low temperature and explicit token limits.
+## Limitations
+- Context window ~4k tokens.
+- Not a substitute for legal professionals or courts.
+- May miss special regimes or recent legislative changes.
+## License
+Virtuo 1.0. Use, modification, and redistribution, including commercial, with notices preserved and reference to Virtuo Turing – Artificial Intelligence, S.A.
+## Credits
+Virtuo Turing – Artificial Intelligence, S.A. (Portugal) and Octávio Viana.
+Base © Mistral AI (Apache-2.0).
 Website: https://justina.cloud