cnmoro
/

LexicalEmbed-Base

Feature Extraction

sentence-transformers

lexical_embedding

Model card Files Files and versions

LexicalEmbed-Base / README.md

cnmoro's picture

Update README.md

e301e0e verified 20 days ago

|

history blame contribute delete

1.44 kB

	---
	license: mit
	datasets:
	- cnmoro/LexicalTriplets
	language:
	- en
	- pt
	pipeline_tag: feature-extraction
	library_name: sentence-transformers
	---

	This is a model trained on [cnmoro/LexicalTriplets](https://huggingface.co/datasets/cnmoro/LexicalTriplets) to produce lexical embeddings (not semantic!)

	This can be used to compute lexical similarity between words or phrases.

	Concept:

	"Some text" will be similar to "Sm txt"

	"King" will not be similar to "Queen" or "Royalty"

	"Dog" will not be similar to "Animal"

	"Doge" will be similar to "Dog"

	```python
	import torch, re, unicodedata
	from transformers import AutoModel, AutoTokenizer

	model_name = "cnmoro/LexicalEmbed-Base"

	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = AutoModel.from_pretrained(model_name, trust_remote_code=True)
	model.eval()

	def preprocess(text):
	text = unicodedata.normalize('NFD', text)
	text = ''.join(c for c in text if unicodedata.category(c) != 'Mn')
	text = re.sub(r'[^\w\s]+', ' ', text.lower())
	return re.sub(r'\s+', ' ', text).strip()

	texts = ["hello world", "hel wor"]
	texts = [ preprocess(s) for s in texts ]
	inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")

	with torch.no_grad():
	embeddings = model(**inputs)

	cosine_sim = torch.nn.functional.cosine_similarity(embeddings[0], embeddings[1], dim=0)
	print(f"Cosine Similarity: {cosine_sim.item()}") # 0.8966174125671387
	```