Spaces:

nvidia
/

Plan2Align-NV

Paused

App Files Files Community

KuangDW commited on Apr 15

Commit

8dfab00

1 Parent(s): 2e5836c

add embed.sh and cython file

Browse files

Files changed (4) hide show

laser/.gitignore +0 -1
laser/tasks/embed/README.md +44 -0
laser/tasks/embed/embed.sh +79 -0
vecalign/.gitignore +1 -2

laser/.gitignore CHANGED Viewed

@@ -3,7 +3,6 @@ source/lib/__pycache__
 models
 tools-external
 tasks/mldoc/MLDoc
-embed
 tasks/bucc/downloaded
 tasks/similarity/dev/
 tasks/xnli/XNLI-1.0*

 models
 tools-external
 tasks/mldoc/MLDoc
 tasks/bucc/downloaded
 tasks/similarity/dev/
 tasks/xnli/XNLI-1.0*

laser/tasks/embed/README.md ADDED Viewed

	@@ -0,0 +1,44 @@

+# LASER: calculation of sentence embeddings
+Tool to calculate sentence embeddings for an arbitrary text file:
+```
+bash ./embed.sh INPUT-FILE OUTPUT-FILE [LANGUAGE]
+```
+The input will first be tokenized, and then sentence embeddings will be generated. If a `language` is specified,
+then `embed.sh` will look for a language-specific LASER3 encoder using the format: `{model_dir}/laser3-{language}.{version}.pt`.
+Otherwise it will default to LASER2 which covers the same 93 languages as [the original LASER encoder](https://arxiv.org/pdf/1812.10464.pdf).
+**NOTE:** please set the model location (`model_dir` in `embed.sh`) before running. We recommend to download the models from the NLLB
+release (see [here](/nllb/README.md)). Optionally you can also select the model version number for downloaded LASER3 models. This currently defaults to: `1` (initial release).
+## Output format
+The embeddings are stored in float32 matrices in raw binary format.
+They can be read in Python by:
+```
+import numpy as np
+dim = 1024
+X = np.fromfile("my_embeddings.bin", dtype=np.float32, count=-1)
+X.resize(X.shape[0] // dim, dim)
+```
+X is a N x 1024 matrix where N is the number of lines in the text file.
+## Examples
+In order to encode an input text in any of the 93 languages supported by LASER2 (e.g. Afrikaans, English, French):
+```
+./embed.sh input_file output_file
+```
+To use a language-specific encoder (if available), such as for example: Wolof, Hausa, or Irish:
+```
+./embed.sh input_file output_file wol_Latn
+```
+```
+./embed.sh input_file output_file hau_Latn
+```
+```
+./embed.sh input_file output_file gle_Latn
+```

laser/tasks/embed/embed.sh ADDED Viewed

	@@ -0,0 +1,79 @@

+#!/bin/bash
+# Copyright (c) Facebook, Inc. and its affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree.
+#
+# LASER  Language-Agnostic SEntence Representations
+# is a toolkit to calculate multilingual sentence embeddings
+# and to use them for document classification, bitext filtering
+# and mining
+#
+# --------------------------------------------------------
+#
+# bash script to calculate sentence embeddings for arbitrary
+# text file
+#############################
+# BEGIN PARAMETERS TO SET
+#############################
+# location of models (e.g. /path/to/models); no trailing slash
+model_dir="laser"
+# version number for LASER3 models
+version=1
+#############################
+# END PARAMETERS TO SET
+#############################
+if [ -z ${model_dir} ]; then
+    echo "Please set model directory within script"
+    exit 1
+elif [ ! -d ${model_dir} ]; then
+    echo "Can't find model directory: $model_dir"
+    exit 1
+fi
+if [ -z ${LASER} ] ; then
+  echo "Please set the environment variable 'LASER'"
+  exit 1
+fi
+if [ $# -lt 2 ] ; then
+  echo "usage: embed.sh input-file output-file [language]"
+  exit 1
+fi
+infile=$1
+outfile=$2
+language=$3
+# default to laser2
+model_file=${model_dir}/laser2.pt
+spm=${model_dir}/laser2.spm
+if [ ! -z ${language} ]; then
+    model_file=${model_dir}/laser3-$language.v$version.pt
+    lang_specific_spm=${model_dir}/laser3-$language.v$version.spm
+    if [[ -s $lang_specific_spm ]]; then
+        spm=$lang_specific_spm
+    fi
+fi
+if [[ ! -s $model_file ]]; then
+    echo "couldn't find model file: $model_file"
+    exit 1
+fi
+if [[ ! -s $spm ]]; then
+    echo "couldn't find spm: $spm"
+    exit 1
+fi
+python3 ${LASER}/source/embed.py \
+    --input     ${infile}        \
+    --encoder   ${model_file}    \
+    --spm-model $spm             \
+    --output    ${outfile}       \
+    --verbose

vecalign/.gitignore CHANGED Viewed

@@ -1,5 +1,4 @@
 build/
-dp_core.c*
 dp_core.html
 __pycache__/
 .idea
@@ -7,4 +6,4 @@ __pycache__/
 .pytest_cache/
 venv/
 fairseq/
-scores/

 build/
 dp_core.html
 __pycache__/
 .idea
 .pytest_cache/
 venv/
 fairseq/
+scores/