Update README.md

c1cf5c9 verified about 1 month ago

6.07 kB

metadata

language:
  - en
  - de
license: apache-2.0
tags:
  - text-generation
  - social-group-identification
  - span-extraction
  - information-extraction
  - qwen3
base_model: Qwen/Qwen3-4B
pipeline_tag: text-generation

Social Group Identification Models

A family of fine-tuned Qwen3 models for extracting social group mentions from text in English and German. These models identify human collectives characterized by shared attributes (professional, demographic, role-based, etc.) and return structured spans following precise extraction rules.

Supported groups include: teachers, students, doctors, children, parents, patients, diabetics, single parents, colleagues, and any other human collective with identifiable shared properties.

Excluded: Named individuals, organizations/institutions, non-humans, quantifiers alone.

Usage

Task Prompt

Use the following prompt with your input text, always appending /no_think at the end:

Click to expand full prompt

## Task: Identify Social Groups in Sentences
**Definition**: A social group is a collection of people characterized by shared attributes. Extract human groups that are plural or generic singular representing a category.

### Core Rules

#### 1. Social Groups Include:
**Any human collective** characterized by shared attributes (plural or generic singular representing a category), such as:
- **Professional/occupational**: "Lehrkräfte" / "teachers", "Ärzte" / "doctors", "Studenten" / "students"
- **Demographic**: "Kinder" / "children", "Jugendliche" / "teenagers", "Senioren" / "seniors"
- **Role-based**: "Eltern" / "parents", "Patienten" / "patients", "Kunden" / "customers"
- **Characteristic-based**: "Diabetiker" / "diabetics", "Alleinerziehende" / "single parents"
- **Social/relational**: "Freunde" / "friends", "Nachbarn" / "neighbors", "Kollegen" / "colleagues"
- **Any other human group** with identifiable shared properties

*Note: Generic singular forms are included when they represent the category, not specific individuals.*

#### 2. Boundary Cases to Exclude:
- **Organizations/institutions**: "Unilever", "NASA", "Harvard", "Bundestag", "SPD" (entities, not groups)
- **Named individuals**: "Angela Merkel", "John Smith" (specific persons)
- **Non-humans**: "Hunde" / "dogs", "Roboter" / "robots" (not human groups)
- **Quantifiers alone**: "alle" / "all", "viele" / "many", "einige" / "some" (not group identifiers)
- **Articles and Numeralia**: "der" / "the", "hundert" / "hundreds" (no relevant attribute)

#### 3. Span Extraction Rules:
**Longest Valid Span Principle**: Extract the complete descriptive phrase that defines the social group.

**Include Essential Modifiers**:
- Descriptive attributes: "ältere Alumni" / "older alumni"
- Professional specifications: "erfahrene Chirurgen" / "experienced surgeons"
- Demographic details: "Kinder unter 12 Jahren" / "children under 12"
- Complex descriptions: "Menschen mit chronischen Erkrankungen" / "people with chronic illnesses"

**Personal Experience Exclusion**: Remove parts that define groups only through speaker's personal relationship:
- "Kollegen, die mich mobben" → "Kollegen" / "colleagues who bully me" → "colleagues"
- "Leute mit ähnlichen Erfahrungen" → "Leute" / "people with similar experiences" → "people"

**Coordination Handling**:
- **Separate groups**: "Männer und Frauen" → [Männer || Frauen]
- **Different attributes**: "junge Ärzte und erfahrene Krankenschwestern" → [junge Ärzte || erfahrene Krankenschwestern]
- **Shared attributes**: "kleine Jungen und Mädchen" → [kleine Jungen und Mädchen]

#### 4. Extraction Guidelines:
**Syntactic Position Independence**: Extract groups regardless of grammatical role (subject, object, prepositional phrase, genitive/possessive).

**Semantic Function Independence**: Extract groups regardless of semantic function (descriptive, predicative, vocative).

### Output Format:
Social Groups: [Group 1 || Group 2 || Group 3] (**Don't add any explanation**)

**Now analyze this sentence**:

⚠️ Critical: The `/no_think` Token

Always append /no_think to your prompts. This token is essential for proper output formatting and ensures the model returns only the structured response without intermediate reasoning.

Example input: [Task prompt] + "Teachers and students discussed the curriculum." + " /no_think"
Expected output: Social Groups: [Teachers || students]

Performance Comparison

Model	Parameters	Avg F1 (All)	Avg F1 (Non-Empty)	Avg Time (s)
SocialGroupIdentification-Qwen3-0.6B-v1.2	0.6B	0.8774	0.5560	0.049
SocialGroupIdentification-Qwen3-1.7B-v1.2	1.7B	0.8792	0.5542	0.073
SocialGroupIdentification-Qwen3-4B-v1.2	4B	0.9160	0.6969	0.146
SocialGroupIdentification-Qwen3-8B-v1.2	8B	0.9175	0.6932	0.220

Test datasets: German articles, German tweets, English articles, English Reddit.

Model Details

Base Model: Qwen3 (0.6B, 1.7B, 4B, 8B variants)
Languages: English, German
Task: Span extraction for social group identification
Output Format: Structured list with || delimiter

Limitations

Optimized for German and English only
Performance varies by genre: higher on formal text (articles), lower on informal social media (tweets)

Citation

@misc{socialgroupidentification2025,
  author = {Schwager, Nils and Büttner, Jonas and Jügens, Pascal},
  title = {Social Group Identification Models},
  year = {2025},
  publisher = {HuggingFace},
  url = {https://huggingface.co/collections/nsschw/socialgroupidentification-68e3cae684fb332790c3a52b}
}