# Hub ## Docs - [Signing commits with GPG](https://huggingface.co/docs/hub/security-gpg.md) - [Managing Spaces with Github Actions](https://huggingface.co/docs/hub/spaces-github-actions.md) - [Next Steps](https://huggingface.co/docs/hub/repositories-next-steps.md) - [Webhook guide: build a Discussion bot based on BLOOM](https://huggingface.co/docs/hub/webhooks-guide-discussion-bot.md) - [Manual Configuration](https://huggingface.co/docs/hub/datasets-manual-configuration.md) - [Sign in with Hugging Face](https://huggingface.co/docs/hub/oauth.md) - [Datasets Overview](https://huggingface.co/docs/hub/datasets-overview.md) - [Data files Configuration](https://huggingface.co/docs/hub/datasets-data-files-configuration.md) - [Advanced Access Control in Organizations with Resource Groups](https://huggingface.co/docs/hub/security-resource-groups.md) - [Using Spaces for Organization Cards](https://huggingface.co/docs/hub/spaces-organization-cards.md) - [DDUF](https://huggingface.co/docs/hub/dduf.md) - [File formats](https://huggingface.co/docs/hub/datasets-polars-file-formats.md) - [The HF PRO subscription 🔥](https://huggingface.co/docs/hub/pro.md) - [Spaces as MCP servers](https://huggingface.co/docs/hub/spaces-mcp-servers.md) - [Malware Scanning](https://huggingface.co/docs/hub/security-malware.md) - [Shiny on Spaces](https://huggingface.co/docs/hub/spaces-sdks-docker-shiny.md) - [Digital Object Identifier (DOI)](https://huggingface.co/docs/hub/doi.md) - [Third-party scanner: JFrog](https://huggingface.co/docs/hub/security-jfrog.md) - [Using OpenCLIP at Hugging Face](https://huggingface.co/docs/hub/open_clip.md) - [Single Sign-On (SSO)](https://huggingface.co/docs/hub/security-sso.md) - [TF-Keras (legacy)](https://huggingface.co/docs/hub/tf-keras.md) - [Tasks](https://huggingface.co/docs/hub/models-tasks.md) - [Models Download Stats](https://huggingface.co/docs/hub/models-download-stats.md) - [Licenses](https://huggingface.co/docs/hub/repositories-licenses.md) - [How to configure SAML SSO with Azure](https://huggingface.co/docs/hub/security-sso-azure-saml.md) - [Spaces Overview](https://huggingface.co/docs/hub/spaces-overview.md) - [Integrate your library with the Hub](https://huggingface.co/docs/hub/models-adding-libraries.md) - [Polars](https://huggingface.co/docs/hub/datasets-polars.md) - [Notifications](https://huggingface.co/docs/hub/notifications.md) - [GGUF](https://huggingface.co/docs/hub/gguf.md) - [Using MLX at Hugging Face](https://huggingface.co/docs/hub/mlx.md) - [Using PaddleNLP at Hugging Face](https://huggingface.co/docs/hub/paddlenlp.md) - [Widget Examples](https://huggingface.co/docs/hub/models-widgets-examples.md) - [Using fastai at Hugging Face](https://huggingface.co/docs/hub/fastai.md) - [Using timm at Hugging Face](https://huggingface.co/docs/hub/timm.md) - [Dataset Cards](https://huggingface.co/docs/hub/datasets-cards.md) - [Organization cards](https://huggingface.co/docs/hub/organizations-cards.md) - [Aim on Spaces](https://huggingface.co/docs/hub/spaces-sdks-docker-aim.md) - [Spaces Configuration Reference](https://huggingface.co/docs/hub/spaces-config-reference.md) - [Using Unity Sentis Models from Hugging Face](https://huggingface.co/docs/hub/unity-sentis.md) - [Datasets Download Stats](https://huggingface.co/docs/hub/datasets-download-stats.md) - [Cookie limitations in Spaces](https://huggingface.co/docs/hub/spaces-cookie-limitations.md) - [Embed the Dataset Viewer in a webpage](https://huggingface.co/docs/hub/datasets-viewer-embed.md) - [Using ESPnet at Hugging Face](https://huggingface.co/docs/hub/espnet.md) - [Pandas](https://huggingface.co/docs/hub/datasets-pandas.md) - [Organizations, Security, and the Hub API](https://huggingface.co/docs/hub/other.md) - [Embedding Atlas](https://huggingface.co/docs/hub/datasets-embedding-atlas.md) - [The Model Hub](https://huggingface.co/docs/hub/models-the-hub.md) - [Using Sentence Transformers at Hugging Face](https://huggingface.co/docs/hub/sentence-transformers.md) - [Using SpeechBrain at Hugging Face](https://huggingface.co/docs/hub/speechbrain.md) - [Dash on Spaces](https://huggingface.co/docs/hub/spaces-sdks-docker-dash.md) - [Model Cards](https://huggingface.co/docs/hub/model-cards.md) - [Using _Adapters_ at Hugging Face](https://huggingface.co/docs/hub/adapters.md) - [Search](https://huggingface.co/docs/hub/search.md) - [Storage Regions on the Hub](https://huggingface.co/docs/hub/storage-regions.md) - [Using 🧨 `diffusers` at Hugging Face](https://huggingface.co/docs/hub/diffusers.md) - [Perform vector similarity search](https://huggingface.co/docs/hub/datasets-duckdb-vector-similarity-search.md) - [Spaces](https://huggingface.co/docs/hub/spaces.md) - [Models](https://huggingface.co/docs/hub/models.md) - [How to configure OIDC SSO with Azure](https://huggingface.co/docs/hub/security-sso-azure-oidc.md) - [Your First Docker Space: Text Generation with T5](https://huggingface.co/docs/hub/spaces-sdks-docker-first-demo.md) - [Static HTML Spaces](https://huggingface.co/docs/hub/spaces-sdks-static.md) - [Single Sign-On (SSO)](https://huggingface.co/docs/hub/enterprise-sso.md) - [Livebook on Spaces](https://huggingface.co/docs/hub/spaces-sdks-docker-livebook.md) - [Downloading datasets](https://huggingface.co/docs/hub/datasets-downloading.md) - [How to configure OIDC SSO with Google Workspace](https://huggingface.co/docs/hub/security-sso-google-oidc.md) - [Gating Group Collections](https://huggingface.co/docs/hub/enterprise-hub-gating-group-collections.md) - [How to configure SCIM with Microsoft Entra ID (Azure AD)](https://huggingface.co/docs/hub/security-sso-entra-id-scim.md) - [Using Flair at Hugging Face](https://huggingface.co/docs/hub/flair.md) - [Authentication](https://huggingface.co/docs/hub/datasets-polars-auth.md) - [Spark](https://huggingface.co/docs/hub/datasets-spark.md) - [Using ML-Agents at Hugging Face](https://huggingface.co/docs/hub/ml-agents.md) - [fenic](https://huggingface.co/docs/hub/datasets-fenic.md) - [Two-Factor Authentication (2FA)](https://huggingface.co/docs/hub/security-2fa.md) - [Langfuse on Spaces](https://huggingface.co/docs/hub/spaces-sdks-docker-langfuse.md) - [Gated datasets](https://huggingface.co/docs/hub/datasets-gated.md) - [How to handle URL parameters in Spaces](https://huggingface.co/docs/hub/spaces-handle-url-parameters.md) - [Video Dataset](https://huggingface.co/docs/hub/datasets-video.md) - [Libraries](https://huggingface.co/docs/hub/models-libraries.md) - [How to configure SAML SSO with Google Workspace](https://huggingface.co/docs/hub/security-sso-google-saml.md) - [Agents on the Hub](https://huggingface.co/docs/hub/agents.md) - [Streamlit Spaces](https://huggingface.co/docs/hub/spaces-sdks-streamlit.md) - [Using PEFT at Hugging Face](https://huggingface.co/docs/hub/peft.md) - [Model Card components](https://huggingface.co/docs/hub/model-cards-components.md) - [Hugging Face Hub documentation](https://huggingface.co/docs/hub/index.md) - [Perform SQL operations](https://huggingface.co/docs/hub/datasets-duckdb-sql.md) - [Data Studio](https://huggingface.co/docs/hub/data-studio.md) - [Academia Hub](https://huggingface.co/docs/hub/academia-hub.md) - [Collections](https://huggingface.co/docs/hub/collections.md) - [Uploading datasets](https://huggingface.co/docs/hub/datasets-adding.md) - [Using mlx-image at Hugging Face](https://huggingface.co/docs/hub/mlx-image.md) - [User Studies](https://huggingface.co/docs/hub/model-cards-user-studies.md) - [Docker Spaces](https://huggingface.co/docs/hub/spaces-sdks-docker.md) - [Advanced Topics](https://huggingface.co/docs/hub/spaces-advanced.md) - [Transforming your dataset](https://huggingface.co/docs/hub/datasets-polars-operations.md) - [Using spaCy at Hugging Face](https://huggingface.co/docs/hub/spacy.md) - [Spaces Dev Mode: Seamless development in Spaces](https://huggingface.co/docs/hub/spaces-dev-mode.md) - [Model(s) Release Checklist](https://huggingface.co/docs/hub/model-release-checklist.md) - [Embed your Space in another website](https://huggingface.co/docs/hub/spaces-embed.md) - [Analytics](https://huggingface.co/docs/hub/enterprise-hub-analytics.md) - [ZenML on Spaces](https://huggingface.co/docs/hub/spaces-sdks-docker-zenml.md) - [Datasets](https://huggingface.co/docs/hub/datasets.md) - [Team & Enterprise plans](https://huggingface.co/docs/hub/enterprise-hub.md) - [Advanced Topics](https://huggingface.co/docs/hub/models-advanced.md) - [Organizations](https://huggingface.co/docs/hub/organizations.md) - [Widgets](https://huggingface.co/docs/hub/models-widgets.md) - [Using 🤗 `transformers` at Hugging Face](https://huggingface.co/docs/hub/transformers.md) - [How to configure OIDC SSO with Okta](https://huggingface.co/docs/hub/security-sso-okta-oidc.md) - [Spaces Settings](https://huggingface.co/docs/hub/spaces-settings.md) - [Using Stanza at Hugging Face](https://huggingface.co/docs/hub/stanza.md) - [Dask](https://huggingface.co/docs/hub/datasets-dask.md) - [Using BERTopic at Hugging Face](https://huggingface.co/docs/hub/bertopic.md) - [Resource groups](https://huggingface.co/docs/hub/enterprise-hub-resource-groups.md) - [Webhooks](https://huggingface.co/docs/hub/webhooks.md) - [Advanced Single Sign-On (SSO)](https://huggingface.co/docs/hub/enterprise-hub-advanced-sso.md) - [🟧 Label Studio on Spaces](https://huggingface.co/docs/hub/spaces-sdks-docker-label-studio.md) - [How to configure SAML SSO with Okta](https://huggingface.co/docs/hub/security-sso-okta-saml.md) - [Using GPU Spaces](https://huggingface.co/docs/hub/spaces-gpus.md) - [Secrets Scanning](https://huggingface.co/docs/hub/security-secrets.md) - [Custom Python Spaces](https://huggingface.co/docs/hub/spaces-sdks-python.md) - [SQL Console: Query Hugging Face datasets in your browser](https://huggingface.co/docs/hub/datasets-viewer-sql-console.md) - [Spaces ZeroGPU: Dynamic GPU Allocation for Spaces](https://huggingface.co/docs/hub/spaces-zerogpu.md) - [Argilla on Spaces](https://huggingface.co/docs/hub/spaces-sdks-docker-argilla.md) - [Security](https://huggingface.co/docs/hub/security.md) - [Getting Started with Repositories](https://huggingface.co/docs/hub/repositories-getting-started.md) - [Using TensorBoard](https://huggingface.co/docs/hub/tensorboard.md) - [Displaying carbon emissions for your model](https://huggingface.co/docs/hub/model-cards-co2.md) - [Distilabel](https://huggingface.co/docs/hub/datasets-distilabel.md) - [marimo on Spaces](https://huggingface.co/docs/hub/spaces-sdks-docker-marimo.md) - [Using sample-factory at Hugging Face](https://huggingface.co/docs/hub/sample-factory.md) - [JupyterLab on Spaces](https://huggingface.co/docs/hub/spaces-sdks-docker-jupyter.md) - [Using 🤗 Datasets](https://huggingface.co/docs/hub/datasets-usage.md) - [Uploading models](https://huggingface.co/docs/hub/models-uploading.md) - [GGUF usage with llama.cpp](https://huggingface.co/docs/hub/gguf-llamacpp.md) - [Using Asteroid at Hugging Face](https://huggingface.co/docs/hub/asteroid.md) - [Pull requests and Discussions](https://huggingface.co/docs/hub/repositories-pull-requests-discussions.md) - [Gated models](https://huggingface.co/docs/hub/models-gated.md) - [Moderation](https://huggingface.co/docs/hub/moderation.md) - [Run with Docker](https://huggingface.co/docs/hub/spaces-run-with-docker.md) - [Using Stable-Baselines3 at Hugging Face](https://huggingface.co/docs/hub/stable-baselines3.md) - [Managing Spaces with CircleCI Workflows](https://huggingface.co/docs/hub/spaces-circleci.md) - [Annotated Model Card Template](https://huggingface.co/docs/hub/model-card-annotated.md) - [Repositories](https://huggingface.co/docs/hub/repositories.md) - [Webhook guide: Setup an automatic metadata quality review for models and datasets](https://huggingface.co/docs/hub/webhooks-guide-metadata-review.md) - [Access control in organizations](https://huggingface.co/docs/hub/organizations-security.md) - [Network Security](https://huggingface.co/docs/hub/enterprise-hub-network-security.md) - [Docker Spaces Examples](https://huggingface.co/docs/hub/spaces-sdks-docker-examples.md) - [Query datasets](https://huggingface.co/docs/hub/datasets-duckdb-select.md) - [Panel on Spaces](https://huggingface.co/docs/hub/spaces-sdks-docker-panel.md) - [FiftyOne](https://huggingface.co/docs/hub/datasets-fiftyone.md) - [Daft](https://huggingface.co/docs/hub/datasets-daft.md) - [Tabby on Spaces](https://huggingface.co/docs/hub/spaces-sdks-docker-tabby.md) - [Using OpenCV in Spaces](https://huggingface.co/docs/hub/spaces-using-opencv.md) - [WebDataset](https://huggingface.co/docs/hub/datasets-webdataset.md) - [Argilla](https://huggingface.co/docs/hub/datasets-argilla.md) - [Downloading models](https://huggingface.co/docs/hub/models-downloading.md) - [How to configure SCIM with Okta](https://huggingface.co/docs/hub/security-sso-okta-scim.md) - [How to get a user's plan and status in Spaces](https://huggingface.co/docs/hub/spaces-get-user-plan.md) - [Image Dataset](https://huggingface.co/docs/hub/datasets-image.md) - [GGUF usage with GPT4All](https://huggingface.co/docs/hub/gguf-gpt4all.md) - [Billing](https://huggingface.co/docs/hub/billing.md) - [THE LANDSCAPE OF ML DOCUMENTATION TOOLS](https://huggingface.co/docs/hub/model-card-landscape-analysis.md) - [Managing organizations](https://huggingface.co/docs/hub/organizations-managing.md) - [Using Keras at Hugging Face](https://huggingface.co/docs/hub/keras.md) - [Adding a Sign-In with HF button to your Space](https://huggingface.co/docs/hub/spaces-oauth.md) - [Appendix](https://huggingface.co/docs/hub/model-card-appendix.md) - [Advanced Security](https://huggingface.co/docs/hub/enterprise-hub-advanced-security.md) - [DuckDB](https://huggingface.co/docs/hub/datasets-duckdb.md) - [Evidence on Spaces](https://huggingface.co/docs/hub/spaces-sdks-docker-evidence.md) - [Audit Logs](https://huggingface.co/docs/hub/audit-logs.md) - [Combine datasets and export](https://huggingface.co/docs/hub/datasets-duckdb-combine-and-export.md) - [Advanced Compute Options](https://huggingface.co/docs/hub/advanced-compute-options.md) - [Repository Settings](https://huggingface.co/docs/hub/repositories-settings.md) - [Hub Rate limits](https://huggingface.co/docs/hub/rate-limits.md) - [Pickle Scanning](https://huggingface.co/docs/hub/security-pickle.md) - [Handling Spaces Dependencies in Gradio Spaces](https://huggingface.co/docs/hub/spaces-dependencies.md) - [Hub API Endpoints](https://huggingface.co/docs/hub/api.md) - [Storage limits](https://huggingface.co/docs/hub/storage-limits.md) - [Model Card Guidebook](https://huggingface.co/docs/hub/model-card-guidebook.md) - [Disk usage on Spaces](https://huggingface.co/docs/hub/spaces-storage.md) - [Audio Dataset](https://huggingface.co/docs/hub/datasets-audio.md) - [Use AI Models Locally](https://huggingface.co/docs/hub/local-apps.md) - [Gradio Spaces](https://huggingface.co/docs/hub/spaces-sdks-gradio.md) - [Hugging Face Dataset Upload Decision Guide](https://huggingface.co/docs/hub/datasets-upload-guide-llm.md) - [Use Ollama with any GGUF Model on Hugging Face Hub](https://huggingface.co/docs/hub/ollama.md) - [User access tokens](https://huggingface.co/docs/hub/security-tokens.md) - [Using SpanMarker at Hugging Face](https://huggingface.co/docs/hub/span_marker.md) - [Spaces Changelog](https://huggingface.co/docs/hub/spaces-changelog.md) - [Tokens Management](https://huggingface.co/docs/hub/enterprise-hub-tokens-management.md) - [Inference Providers](https://huggingface.co/docs/hub/models-inference.md) - [Hugging Face MCP Server](https://huggingface.co/docs/hub/hf-mcp-server.md) - [Authentication for private and gated datasets](https://huggingface.co/docs/hub/datasets-duckdb-auth.md) - [Giskard on Spaces](https://huggingface.co/docs/hub/spaces-sdks-docker-giskard.md) - [Optimizations](https://huggingface.co/docs/hub/datasets-polars-optimizations.md) - [Paper Pages](https://huggingface.co/docs/hub/paper-pages.md) - [Using SetFit with Hugging Face](https://huggingface.co/docs/hub/setfit.md) - [Jupyter Notebooks on the Hugging Face Hub](https://huggingface.co/docs/hub/notebooks.md) - [Libraries](https://huggingface.co/docs/hub/datasets-libraries.md) - [Models Frequently Asked Questions](https://huggingface.co/docs/hub/models-faq.md) - [Spaces Custom Domain](https://huggingface.co/docs/hub/spaces-custom-domain.md) - [How to Add a Space to ArXiv](https://huggingface.co/docs/hub/spaces-add-to-arxiv.md) - [File names and splits](https://huggingface.co/docs/hub/datasets-file-names-and-splits.md) - [More ways to create Spaces](https://huggingface.co/docs/hub/spaces-more-ways-to-create.md) - [PyArrow](https://huggingface.co/docs/hub/datasets-pyarrow.md) - [Using `Transformers.js` at Hugging Face](https://huggingface.co/docs/hub/transformers-js.md) - [Configure the Dataset Viewer](https://huggingface.co/docs/hub/datasets-viewer-configure.md) - [Third-party scanner: Protect AI](https://huggingface.co/docs/hub/security-protectai.md) - [User Provisioning (SCIM)](https://huggingface.co/docs/hub/enterprise-hub-scim.md) - [Webhook guide: Setup an automatic system to re-train a model when a dataset changes](https://huggingface.co/docs/hub/webhooks-guide-auto-retrain.md) - [Git over SSH](https://huggingface.co/docs/hub/security-git-ssh.md) - [Using AllenNLP at Hugging Face](https://huggingface.co/docs/hub/allennlp.md) - [Using RL-Baselines3-Zoo at Hugging Face](https://huggingface.co/docs/hub/rl-baselines3-zoo.md) - [ChatUI on Spaces](https://huggingface.co/docs/hub/spaces-sdks-docker-chatui.md) - [Datasets](https://huggingface.co/docs/hub/enterprise-hub-datasets.md) - [Xet History & Overview](https://huggingface.co/docs/hub/xet/overview.md) - [Backward Compatibility with LFS](https://huggingface.co/docs/hub/xet/legacy-git-lfs.md) - [Xet: our Storage Backend](https://huggingface.co/docs/hub/xet/index.md) - [Deduplication](https://huggingface.co/docs/hub/xet/deduplication.md) - [Security Model](https://huggingface.co/docs/hub/xet/security.md) - [Using Xet Storage](https://huggingface.co/docs/hub/xet/using-xet-storage.md) ### Signing commits with GPG https://huggingface.co/docs/hub/security-gpg.md # Signing commits with GPG `git` has an authentication layer to control who can push commits to a repo, but it does not authenticate the actual commit authors. In other words, you can commit changes as `Elon Musk `, push them to your preferred `git` host (for instance github.com), and your commit will link to Elon's GitHub profile. (Try it! But don't blame us if Elon gets mad at you for impersonating him.) The reasons we implemented GPG signing were: - To provide finer-grained security, especially as more and more Enterprise users rely on the Hub. - To provide ML benchmarks backed by a cryptographically-secure source. See Ale Segala's [How (and why) to sign `git` commits](https://withblue.ink/2020/05/17/how-and-why-to-sign-git-commits.html) for more context. You can prove a commit was authored by you with GNU Privacy Guard (GPG) and a key server. GPG is a cryptographic tool used to verify the authenticity of a message's origin. We'll explain how to set this up on Hugging Face below. The Pro Git book is, as usual, a good resource about commit signing: [Pro Git: Signing your work](https://git-scm.com/book/en/v2/Git-Tools-Signing-Your-Work). ## Setting up signed commits verification You will need to install [GPG](https://gnupg.org/) on your system in order to execute the following commands. > It's included by default in most Linux distributions. > On Windows, it is included in Git Bash (which comes with `git` for Windows). You can sign your commits locally using [GPG](https://gnupg.org/). Then configure your profile to mark these commits as **verified** on the Hub, so other people can be confident that they come from a trusted source. For a more in-depth explanation of how git and GPG interact, please visit the [git documentation on the subject](https://git-scm.com/book/en/v2/Git-Tools-Signing-Your-Work) Commits can have the following signing statuses: | Status | Explanation | | ----------------- | ------------------------------------------------------------ | | Verified | The commit is signed and the signature is verified | | Unverified | The commit is signed but the signature could not be verified | | No signing status | The commit is not signed | For a commit to be marked as **verified**, you need to upload the public key used to sign it on your Hugging Face account. Use the `gpg --list-secret-keys` command to list the GPG keys for which you have both a public and private key. A private key is required for signing commits or tags. If you don't have a GPG key pair or you don't want to use the existing keys to sign your commits, go to **Generating a new GPG key**. Otherwise, go straight to [Adding a GPG key to your account](#adding-a-gpg-key-to-your-account). ## Generating a new GPG key To generate a GPG key, run the following: ```bash gpg --gen-key ``` GPG will then guide you through the process of creating a GPG key pair. Make sure you specify an email address for this key, and that the email address matches the one you specified in your Hugging Face [account](https://huggingface.co/settings/account). ## Adding a GPG key to your account 1. First, select or generate a GPG key on your computer. Make sure the email address of the key matches the one in your Hugging Face [account](https://huggingface.co/settings/account) and that the email of your account is verified. 2. Export the public part of the selected key: ```bash gpg --armor --export ``` 3. Then visit your profile [settings page](https://huggingface.co/settings/keys) and click on **Add GPG Key**. Copy & paste the output of the `gpg --export` command in the text area and click on **Add Key**. 4. Congratulations! 🎉 You've just added a GPG key to your account! ## Configure git to sign your commits with GPG The last step is to configure git to sign your commits: ```bash git config user.signingkey git config user.email ``` Then add the `-S` flag to your `git commit` commands to sign your commits! ```bash git commit -S -m "My first signed commit" ``` Once pushed on the Hub, you should see the commit with a "Verified" badge. > [!TIP] > To sign all commits by default in any local repository on your computer, you can run git config --global commit.gpgsign true. ### Managing Spaces with Github Actions https://huggingface.co/docs/hub/spaces-github-actions.md # Managing Spaces with Github Actions You can keep your app in sync with your GitHub repository with **Github Actions**. Remember that for files larger than 10MB, Spaces requires Git-LFS. If you don't want to use Git-LFS, you may need to review your files and check your history. Use a tool like [BFG Repo-Cleaner](https://rtyley.github.io/bfg-repo-cleaner/) to remove any large files from your history. BFG Repo-Cleaner will keep a local copy of your repository as a backup. First, you should set up your GitHub repository and Spaces app together. Add your Spaces app as an additional remote to your existing Git repository. ```bash git remote add space https://huggingface.co/spaces/HF_USERNAME/SPACE_NAME ``` Then force push to sync everything for the first time: ```bash git push --force space main ``` Next, set up a GitHub Action to push your main branch to Spaces. In the example below: * Replace `HF_USERNAME` with your username and `SPACE_NAME` with your Space name. * Create a [Github secret](https://docs.github.com/en/actions/security-guides/encrypted-secrets#creating-encrypted-secrets-for-an-environment) with your `HF_TOKEN`. You can find your Hugging Face API token under **API Tokens** on your Hugging Face profile. ```yaml name: Sync to Hugging Face hub on: push: branches: [main] # to run this workflow manually from the Actions tab workflow_dispatch: jobs: sync-to-hub: runs-on: ubuntu-latest steps: - uses: actions/checkout@v3 with: fetch-depth: 0 lfs: true - name: Push to hub env: HF_TOKEN: ${{ secrets.HF_TOKEN }} run: git push https://HF_USERNAME:$HF_TOKEN@huggingface.co/spaces/HF_USERNAME/SPACE_NAME main ``` Finally, create an Action that automatically checks the file size of any new pull request: ```yaml name: Check file size on: # or directly `on: [push]` to run the action on every push on any branch pull_request: branches: [main] # to run this workflow manually from the Actions tab workflow_dispatch: jobs: sync-to-hub: runs-on: ubuntu-latest steps: - name: Check large files uses: ActionsDesk/lfs-warning@v2.0 with: filesizelimit: 10485760 # this is 10MB so we can sync to HF Spaces ``` ### Next Steps https://huggingface.co/docs/hub/repositories-next-steps.md # Next Steps These next sections highlight features and additional information that you may find useful to make the most out of the Git repositories on the Hugging Face Hub. ## How to programmatically manage repositories Hugging Face supports accessing repos with Python via the [`huggingface_hub` library](https://huggingface.co/docs/huggingface_hub/index). The operations that we've explored, such as downloading repositories and uploading files, are available through the library, as well as other useful functions! If you prefer to use git directly, please read the sections below. ## Learning more about Git A good place to visit if you want to continue learning about Git is [this Git tutorial](https://learngitbranching.js.org/). For even more background on Git, you can take a look at [GitHub's Git Guides](https://github.com/git-guides). ## How to use branches To effectively use Git repos collaboratively and to work on features without releasing premature code you can use **branches**. Branches allow you to separate your "work in progress" code from your "production-ready" code, with the additional benefit of letting multiple people work on a project without frequently conflicting with each others' contributions. You can use branches to isolate experiments in their own branch, and even [adopt team-wide practices for managing branches](https://ericmjl.github.io/essays-on-data-science/workflow/gitflow/). To learn about Git branching, you can try out the [Learn Git Branching interactive tutorial](https://learngitbranching.js.org/). ## Using tags Git allows you to *tag* commits so that you can easily note milestones in your project. As such, you can use tags to mark commits in your Hub repos! To learn about using tags, you can visit [this DevConnected post](https://devconnected.com/how-to-create-git-tags/). Beyond making it easy to identify important commits in your repo's history, using Git tags also allows you to do A/B testing, [clone a repository at a specific tag](https://www.techiedelight.com/clone-specific-tag-with-git/), and more! The `huggingface_hub` library also supports working with tags, such as [downloading files from a specific tagged commit](https://huggingface.co/docs/huggingface_hub/main/en/how-to-downstream#hfhuburl). ## How to duplicate or fork a repo (including LFS pointers) If you'd like to copy a repository, depending on whether you want to preserve the Git history there are two options. ### Duplicating without Git history In many scenarios, if you want your own copy of a particular codebase you might not be concerned about the previous Git history. In this case, you can quickly duplicate a repo with the handy [Repo Duplicator](https://huggingface.co/spaces/huggingface-projects/repo_duplicator)! You'll have to create a User Access Token, which you can read more about in the [security documentation](./security-tokens). ### Duplicating with the Git history (Fork) A duplicate of a repository with the commit history preserved is called a *fork*. You may choose to fork one of your own repos, but it also common to fork other people's projects if you would like to tinker with them. **Note that you will need to [install Git LFS](https://git-lfs.github.com/) and the [`huggingface_hub` CLI](https://huggingface.co/docs/huggingface_hub/index) to follow this process**. When you want to fork or [rebase](https://git-scm.com/docs/git-rebase) a repository with LFS files you cannot use the usual Git approach that you might be familiar with since you need to be careful to not break the LFS pointers. Forking can take time depending on your bandwidth because you will have to fetch and re-upload all the LFS files in your fork. For example, say you have an upstream repository, **upstream**, and you just created your own repository on the Hub which is **myfork** in this example. 1. Create a destination repository (e.g. **myfork**) in https://huggingface.co 2. Clone your fork repository: ``` git clone git@hf.co:me/myfork ``` 3. Fetch non-LFS files: ``` cd myfork git lfs install --skip-smudge --local # affects only this clone git remote add upstream git@hf.co:friend/upstream git fetch upstream ``` 4. Fetch large files. This can take some time depending on your download bandwidth: ``` git lfs fetch --all upstream # this can take time depending on your download bandwidth ``` 4.a. If you want to completely override the fork history (which should only have an initial commit), run: ``` git reset --hard upstream/main ``` 4.b. If you want to rebase instead of overriding, run the following command and resolve any conflicts: ``` git rebase upstream/main ``` 5. Prepare your LFS files to push: ``` git lfs install --force --local # this reinstalls the LFS hooks hf lfs-enable-largefiles . # needed if some files are bigger than 5GB ``` 6. And finally push: ``` git push --force origin main # this can take time depending on your upload bandwidth ``` Now you have your own fork or rebased repo in the Hub! ### Webhook guide: build a Discussion bot based on BLOOM https://huggingface.co/docs/hub/webhooks-guide-discussion-bot.md # Webhook guide: build a Discussion bot based on BLOOM > [!TIP] > Webhooks are now publicly available! Here's a short guide on how to use Hugging Face Webhooks to build a bot that replies to Discussion comments on the Hub with a response generated by BLOOM, a multilingual language model, using the free Inference API. ## Create your Webhook in your user profile First, let's create a Webhook from your [settings]( https://huggingface.co/settings/webhooks). - Input a few target repositories that your Webhook will listen to. - You can put a dummy Webhook URL for now, but defining your webhook will let you look at the events that will be sent to it (and you can replay them, which will be useful for debugging). - Input a secret as it will be more secure. - Subscribe to Community (PR & discussions) events, as we are building a Discussion bot. Your Webhook will look like this: ![webhook-creation](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/webhooks-guides/001-discussion-bot/webhook-creation.png) ## Create a new `Bot` user profile In this guide, we create a separate user account to host a Space and to post comments: ![discussion-bot-profile](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/webhooks-guides/001-discussion-bot/discussion-bot-profile.png) > [!TIP] > When creating a bot that will interact with other users on the Hub, we ask that you clearly label the account as a "Bot" (see profile screenshot). ## Create a Space that will react to your Webhook The third step is actually to listen to the Webhook events. An easy way is to use a Space for this. We use the user account we created, but you could do it from your main user account if you wanted to. The Space's code is [here](https://huggingface.co/spaces/discussion-bot/webhook/tree/main). We used NodeJS and Typescript to implement it, but any language or framework would work equally well. Read more about Docker Spaces [here](https://huggingface.co/docs/hub/spaces-sdks-docker). **The main `server.ts` file is [here](https://huggingface.co/spaces/discussion-bot/webhook/blob/main/server.ts)** Let's walk through what happens in this file: ```ts app.post("/", async (req, res) => { if (req.header("X-Webhook-Secret") !== process.env.WEBHOOK_SECRET) { console.error("incorrect secret"); return res.status(400).json({ error: "incorrect secret" }); } ... ``` Here, we listen to POST requests made to `/`, and then we check that the `X-Webhook-Secret` header is equal to the secret we had previously defined (you need to also set the `WEBHOOK_SECRET` secret in your Space's settings to be able to verify it). ```ts const event = req.body.event; if ( event.action === "create" && event.scope === "discussion.comment" && req.body.comment.content.includes(BOT_USERNAME) ) { ... ``` The event's payload is encoded as JSON. Here, we specify that we will run our Webhook only when: - the event concerns a discussion comment - the event is a creation, i.e. a new comment has been posted - the comment's content contains `@discussion-bot`, i.e. our bot was just mentioned in a comment. In that case, we will continue to the next step: ```ts const INFERENCE_URL = "https://api-inference.huggingface.co/models/bigscience/bloom"; const PROMPT = `Pretend that you are a bot that replies to discussions about machine learning, and reply to the following comment:\n`; const response = await fetch(INFERENCE_URL, { method: "POST", body: JSON.stringify({ inputs: PROMPT + req.body.comment.content }), }); if (response.ok) { const output = await response.json(); const continuationText = output[0].generated_text.replace( PROMPT + req.body.comment.content, "" ); ... ``` This is the coolest part: we call the Inference API for the BLOOM model, prompting it with `PROMPT`, and we get the continuation text, i.e., the part generated by the model. Finally, we will post it as a reply in the same discussion thread: ```ts const commentUrl = req.body.discussion.url.api + "/comment"; const commentApiResponse = await fetch(commentUrl, { method: "POST", headers: { Authorization: `Bearer ${process.env.HF_TOKEN}`, "Content-Type": "application/json", }, body: JSON.stringify({ comment: continuationText }), }); const apiOutput = await commentApiResponse.json(); ``` ## Configure your Webhook to send events to your Space Last but not least, you'll need to configure your Webhook to send POST requests to your Space. Let's first grab our Space's "direct URL" from the contextual menu. Click on "Embed this Space" and copy the "Direct URL". ![embed this Space](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/webhooks-guides/001-discussion-bot/embed-space.png) ![direct URL](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/webhooks-guides/001-discussion-bot/direct-url.png) Update your webhook to send requests to that URL: ![webhook settings](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/webhooks-guides/001-discussion-bot/webhook-creation.png) ## Result ![discussion-result](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/webhooks-guides/001-discussion-bot/discussion-result.png) ### Manual Configuration https://huggingface.co/docs/hub/datasets-manual-configuration.md # Manual Configuration This guide will show you how to configure a custom structure for your dataset repository. The [companion collection of example datasets](https://huggingface.co/collections/datasets-examples/manual-configuration-655e293cea26da0acab95b87) showcases each section of the documentation. A dataset with a supported structure and [file formats](./datasets-adding#file-formats) automatically has a Dataset Viewer on its dataset page on the Hub. You can use YAML to define the splits, subsets and builder parameters that are used by the Viewer. It is also possible to define multiple subsets (also called "configurations") for the same dataset (e.g. if the dataset has various independent files). ## Splits If you have multiple files and want to define which file goes into which split, you can use YAML at the top of your README.md. For example, given a repository like this one: ``` my_dataset_repository/ ├── README.md ├── data.csv └── holdout.csv ``` You can define a subset for your splits by adding the `configs` field in the YAML block at the top of your README.md: ```yaml --- configs: - config_name: default data_files: - split: train path: "data.csv" - split: test path: "holdout.csv" --- ``` You can select multiple files per split using a list of paths: ``` my_dataset_repository/ ├── README.md ├── data/ │ ├── abc.csv │ └── def.csv └── holdout/ └── ghi.csv ``` ```yaml --- configs: - config_name: default data_files: - split: train path: - "data/abc.csv" - "data/def.csv" - split: test path: "holdout/ghi.csv" --- ``` Or you can use glob patterns to automatically list all the files you need: ```yaml --- configs: - config_name: default data_files: - split: train path: "data/*.csv" - split: test path: "holdout/*.csv" --- ``` > [!WARNING] > Note that `config_name` field is required even if you have a single subset. ## Multiple Subsets Your dataset might have several subsets of data that you want to be able to use separately. For example each subset has its own dropdown in the Dataset Viewer the Hugging Face Hub. In that case you can define a list of subsets inside the `configs` field in YAML: ``` my_dataset_repository/ ├── README.md ├── main_data.csv └── additional_data.csv ``` ```yaml --- configs: - config_name: main_data data_files: "main_data.csv" - config_name: additional_data data_files: "additional_data.csv" --- ``` Note that the order of subsets shown in the viewer is the default one first, then alphabetical. > [!TIP] > You can set a default subset using `default: true` > > ```yaml > - config_name: main_data > data_files: "main_data.csv" > default: true > ``` > > This is useful to set which subset the Dataset Viewer shows first, and which subset data libraries load by default. ## Builder parameters Not only `data_files`, but other builder-specific parameters can be passed via YAML, allowing for more flexibility on how to load the data while not requiring any custom code. For example, define which separator to use in which subset to load your `csv` files: ```yaml --- configs: - config_name: tab data_files: "main_data.csv" sep: "\t" - config_name: comma data_files: "additional_data.csv" sep: "," --- ``` Refer to the [specific builders' documentation](/docs/datasets/package_reference/builder_classes) to see what parameters they have. ### Sign in with Hugging Face https://huggingface.co/docs/hub/oauth.md # Sign in with Hugging Face You can use the HF OAuth / OpenID connect flow to create a **"Sign in with HF"** flow in any website or App. This will allow users to sign in to your website or app using their HF account, by clicking a button similar to this one: ![Sign in with Hugging Face](https://huggingface.co/datasets/huggingface/badges/resolve/main/sign-in-with-huggingface-xl-dark.svg) After clicking this button your users will be presented with a permissions modal to authorize your app: ![](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/oauth-accept-application.png) ## Creating an oauth app You can create your application in your [settings](https://huggingface.co/settings/applications/new): ![](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/oauth-create-application.png) ### If you are hosting in Spaces > [!TIP] > If you host your app on Spaces, then the flow will be even easier to implement (and built-in to Gradio directly); Check our [Spaces OAuth guide](https://huggingface.co/docs/hub/spaces-oauth). ### Automated oauth app creation Hugging Face supports CIMD aka [Client ID Metadata Documents](https://datatracker.ietf.org/doc/draft-ietf-oauth-client-id-metadata-document/), which allows you to create an oauth app for your website in an automated manner: - Add an endpoint to your website `/.well-known/oauth-cimd` which returns the following JSON: ```json { client_id: "[your website url]/.well-known/oauth-cimd", client_name: "Your Website", redirect_uris: ["[your website url]/oauth/callback/huggingface"], token_endpoint_auth_method: "none", logo_uri: "https://....", // optional client_uri: "[your website url]", // optional } ``` - Use `"[your website url]/.well-known/oauth-cimd"` as client ID, and PCKE as auth mechanism This is particularly useful for ephemeral environments or MCP clients. See an [implementation example](https://github.com/huggingface/chat-ui/pull/1978) in Hugging Chat. ## Currently supported scopes The currently supported scopes are: - `openid`: Get the ID token in addition to the access token. - `profile`: Get the user's profile information (username, avatar, etc.) - `email`: Get the user's email address. - `read-billing`: Know whether the user has a payment method set up. - `read-repos`: Get read access to the user's personal repos. - `contribute-repos`: Can create repositories and access those created by this app. Cannot access any other repositories unless additional permissions are granted. - `write-repos`: Get write/read access to the user's personal repos. - `manage-repos`: Get full access to the user's personal repos. Also grants repo creation and deletion. - `inference-api`: Get access to the [Inference Providers](https://huggingface.co/docs/inference-providers/index), you will be able to make inference requests on behalf of the user. - `jobs`: Run [jobs](https://huggingface.co/docs/huggingface_hub/main/en/guides/jobs) - `webhooks`: Manage [webhooks](https://huggingface.co/docs/huggingface_hub/main/en/guides/webhooks) - `write-discussions`: Open discussions and Pull Requests on behalf of the user as well as interact with discussions (including reactions, posting/editing comments, closing discussions, ...). To open Pull Requests on private repos, you need to request the `read-repos` scope as well. All other information is available in the [OpenID metadata](https://huggingface.co/.well-known/openid-configuration). > [!WARNING] > Please contact us if you need any extra scopes. ## Accessing organization resources By default, the oauth app does not need to access organization resources. But some scopes like `read-repos` or `read-billing` apply to organizations as well. The user can select which organizations to grant access to when authorizing the app. If you require access to a specific organization, you can add `orgIds=ORG_ID` as a query parameter to the OAuth authorization URL. You have to replace `ORG_ID` with the organization ID, which is available in the `organizations.sub` field of the userinfo response. ## Branding You are free to use your own design for the button. Below are some SVG images helpfully provided. Check out [our badges](https://huggingface.co/datasets/huggingface/badges#sign-in-with-hugging-face) with explanations for integrating them in markdown or HTML. [![Sign in with Hugging Face](https://huggingface.co/datasets/huggingface/badges/resolve/main/sign-in-with-huggingface-sm.svg)](https://huggingface.co/oauth/authorize?client_id=CLIENT_ID&redirect_uri=REDIRECT_URI&scope=openid%20profile&state=STATE) [![Sign in with Hugging Face](https://huggingface.co/datasets/huggingface/badges/resolve/main/sign-in-with-huggingface-sm-dark.svg)](https://huggingface.co/oauth/authorize?client_id=CLIENT_ID&redirect_uri=REDIRECT_URI&scope=openid%20profile&state=STATE) [![Sign in with Hugging Face](https://huggingface.co/datasets/huggingface/badges/resolve/main/sign-in-with-huggingface-md.svg)](https://huggingface.co/oauth/authorize?client_id=CLIENT_ID&redirect_uri=REDIRECT_URI&scope=openid%20profile&state=STATE) [![Sign in with Hugging Face](https://huggingface.co/datasets/huggingface/badges/resolve/main/sign-in-with-huggingface-md-dark.svg)](https://huggingface.co/oauth/authorize?client_id=CLIENT_ID&redirect_uri=REDIRECT_URI&scope=openid%20profile&state=STATE) [![Sign in with Hugging Face](https://huggingface.co/datasets/huggingface/badges/resolve/main/sign-in-with-huggingface-lg.svg)](https://huggingface.co/oauth/authorize?client_id=CLIENT_ID&redirect_uri=REDIRECT_URI&scope=openid%20profile&state=STATE) [![Sign in with Hugging Face](https://huggingface.co/datasets/huggingface/badges/resolve/main/sign-in-with-huggingface-lg-dark.svg)](https://huggingface.co/oauth/authorize?client_id=CLIENT_ID&redirect_uri=REDIRECT_URI&scope=openid%20profile&state=STATE) [![Sign in with Hugging Face](https://huggingface.co/datasets/huggingface/badges/resolve/main/sign-in-with-huggingface-xl.svg)](https://huggingface.co/oauth/authorize?client_id=CLIENT_ID&redirect_uri=REDIRECT_URI&scope=openid%20profile&state=STATE) [![Sign in with Hugging Face](https://huggingface.co/datasets/huggingface/badges/resolve/main/sign-in-with-huggingface-xl-dark.svg)](https://huggingface.co/oauth/authorize?client_id=CLIENT_ID&redirect_uri=REDIRECT_URI&scope=openid%20profile&state=STATE) ### Datasets Overview https://huggingface.co/docs/hub/datasets-overview.md # Datasets Overview ## Datasets on the Hub The Hugging Face Hub hosts a [large number of community-curated datasets](https://huggingface.co/datasets) for a diverse range of tasks such as translation, automatic speech recognition, and image classification. Alongside the information contained in the [dataset card](./datasets-cards), many datasets, such as [GLUE](https://huggingface.co/datasets/nyu-mll/glue), include a [Dataset Viewer](./data-studio) to showcase the data. Each dataset is a [Git repository](./repositories) that contains the data required to generate splits for training, evaluation, and testing. For information on how a dataset repository is structured, refer to the [Data files Configuration page](./datasets-data-files-configuration). Following the supported repo structure will ensure that the dataset page on the Hub will have a Viewer. ## Search for datasets Like models and spaces, you can search the Hub for datasets using the search bar in the top navigation or on the [main datasets page](https://huggingface.co/datasets). There's a large number of languages, tasks, and licenses that you can use to filter your results to find a dataset that's right for you. ## Privacy Since datasets are repositories, you can [toggle their visibility between private and public](./repositories-settings#private-repositories) through the Settings tab. If a dataset is owned by an [organization](./organizations), the privacy settings apply to all the members of the organization. ### Data files Configuration https://huggingface.co/docs/hub/datasets-data-files-configuration.md # Data files Configuration There are no constraints on how to structure dataset repositories. However, if you want the Dataset Viewer to show certain data files, or to separate your dataset in train/validation/test splits, you need to structure your dataset accordingly. Often it is as simple as naming your data files according to their split names, e.g. `train.csv` and `test.csv`. ## What are splits and subsets? Machine learning datasets typically have splits and may also have subsets. A dataset is generally made of _splits_ (e.g. `train` and `test`) that are used during different stages of training and evaluating a model. A _subset_ (also called _configuration_) is a sub-dataset contained within a larger dataset. Subsets are especially common in multilingual speech datasets where there may be a different subset for each language. If you're interested in learning more about splits and subsets, check out the [Splits and subsets](/docs/datasets-server/configs_and_splits) guide! ![split-configs-server](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/split-configs-server.gif) ## Automatic splits detection Splits are automatically detected based on file and directory names. For example, this is a dataset with `train`, `test`, and `validation` splits: ``` my_dataset_repository/ ├── README.md ├── train.csv ├── test.csv └── validation.csv ``` To structure your dataset by naming your data files or directories according to their split names, see the [File names and splits](./datasets-file-names-and-splits) documentation and the [companion collection of example datasets](https://huggingface.co/collections/datasets-examples/file-names-and-splits-655e28af4471bd95709eb135). ## Manual splits and subsets configuration You can choose the data files to show in the Dataset Viewer for your dataset using YAML. It is useful if you want to specify which file goes into which split manually. You can also define multiple subsets for your dataset, and pass dataset building parameters (e.g. the separator to use for CSV files). Here is an example of a configuration defining a subset called "benchmark" with a `test` split. ```yaml configs: - config_name: benchmark data_files: - split: test path: benchmark.csv ``` See the documentation on [Manual configuration](./datasets-manual-configuration) for more information. Look also to the [example datasets](https://huggingface.co/collections/datasets-examples/manual-configuration-655e293cea26da0acab95b87). ## Supported file formats See the [File formats](./datasets-adding#file-formats) doc page to find the list of supported formats and recommendations for your dataset. If your dataset uses CSV or TSV files, you can find more information in the [example datasets](https://huggingface.co/collections/datasets-examples/format-csv-and-tsv-655f681cb9673a4249cccb3d). ## Image, Audio and Video datasets For image/audio/video classification datasets, you can also use directories to name the image/audio/video classes. And if your images/audio/video files have metadata (e.g. captions, bounding boxes, transcriptions, etc.), you can have metadata files next to them. We provide two guides that you can check out: - [How to create an image dataset](./datasets-image) ([example datasets](https://huggingface.co/collections/datasets-examples/image-dataset-6568e7cf28639db76eb92d65)) - [How to create an audio dataset](./datasets-audio) ([example datasets](https://huggingface.co/collections/datasets-examples/audio-dataset-66aca0b73e8f69e3d069e607)) - [How to create a video dataset](./datasets-video) ### Advanced Access Control in Organizations with Resource Groups https://huggingface.co/docs/hub/security-resource-groups.md # Advanced Access Control in Organizations with Resource Groups > [!WARNING] > This feature is part of the Team & Enterprise plans. In your Hugging Face organization, you can use Resource Groups to control which members have access to specific repositories. ## How does it work? Resource Groups allow organization administrators to group related repositories together, allowing different teams in your organization to work on independent sets of repositories. A repository can belong to only one Resource Group. Organizations members need to be added to the Resource Group to access its repositories. An Organization Member can belong to several Resource Groups. Members are assigned a role in each Resource Group that determines their permissions for the group's repositories. Four distinct roles exist for Resource Groups: - `read`: Grants read access to repositories within the Resource Group. - `contributor`: Provides extra write rights to the subset of the Organization's repositories created by the user (i.e., users can create repos and then modify only those repos). Similar to the 'Write' role, but limited to repos created by the user. - `write`: Offers write access to all repositories in the Resource Group. Users can create, delete, or rename any repository in the Resource Group. - `admin`: In addition to write permissions on repositories, admin members can administer the Resource Group — add, remove, and alter the roles of other members. They can also transfer repositories in and out of the Resource Group. In addition, Organization admins can manage all resource groups inside the organization. Resource Groups also affect the visibility of private repositories inside the organization. A private repository that is part of a Resource Group will only be visible to members of that Resource Group. Public repositories, on the other hand, are visible to anyone, inside and outside the organization. ## Getting started Head to your Organization's settings, then navigate to the "Resource Group" tab in the left menu. If you are an admin of the organization, you can create and manage Resource Groups from that page. After creating a resource group and giving it a meaningful name, you can start adding repositories and users to it. Remember that a repository can be part of only one Resource Group. You'll be warned when trying to add a repository that already belongs to another Resource Group. ## Programmatic management (API) See [Resource Groups API Section](https://huggingface.co/docs/hub/en/api#resource-groups-api) ### Using Spaces for Organization Cards https://huggingface.co/docs/hub/spaces-organization-cards.md # Using Spaces for Organization Cards Organization cards are a way to describe your organization to other users. They take the form of a `README.md` static file, inside a Space repo named `README`. Please read more in the [dedicated doc section](./organizations-cards). ### DDUF https://huggingface.co/docs/hub/dduf.md # DDUF ## Overview DDUF (**D**DUF’s **D**iffusion **U**nified **F**ormat) is a single-file format for diffusion models that aims to unify the different model distribution methods and weight-saving formats by packaging all model components into a single file. It is language-agnostic and built to be parsable from a remote location without downloading the entire file. This work draws inspiration from the [GGUF](https://github.com/ggerganov/ggml/blob/master/docs/gguf.md) format. Check out the [DDUF](https://huggingface.co/DDUF) org to start using some of the most popular diffusion models in DDUF. > [!TIP] > We welcome contributions with open arms! > > To create a widely adopted file format, we need early feedback from the community. Nothing is set in stone, and we value everyone's input. Is your use case not covered? Please let us know in the DDUF organization [discussions](https://huggingface.co/spaces/DDUF/README/discussions/2). Its key features include the following. 1. **Single file** packaging. 2. Based on **ZIP file format** to leverage existing tooling. 3. No compression, ensuring **`mmap` compatibility** for fast loading and saving. 4. **Language-agnostic**: tooling can be implemented in Python, JavaScript, Rust, C++, etc. 5. **HTTP-friendly**: metadata and file structure can be fetched remotely using HTTP Range requests. 6. **Flexible**: each model component is stored in its own directory, following the current Diffusers structure. 7. **Safe**: uses [Safetensors](https://huggingface.co/docs/diffusers/using-diffusers/other-formats#safetensors) as a weight-saving format and prohibits nested directories to prevent ZIP bombs. ## Technical specifications Technically, a `.dduf` file **is** a [`.zip` archive](https://en.wikipedia.org/wiki/ZIP_(file_format)). By building on a universally supported file format, we ensure robust tooling already exists. However, some constraints are enforced to meet diffusion models' requirements: - Data must be stored uncompressed (flag `0`), allowing lazy-loading using memory-mapping. - Data must be stored using ZIP64 protocol, enabling saving files above 4GB. - The archive can only contain `.json`, `.safetensors`, `.model` and `.txt` files. - A `model_index.json` file must be present at the root of the archive. It must contain a key-value mapping with metadata about the model and its components. - Each component must be stored in its own directory (e.g., `vae/`, `text_encoder/`). Nested files must use UNIX-style path separators (`/`). - Each directory must correspond to a component in the `model_index.json` index. - Each directory must contain a json config file (one of `config.json`, `tokenizer_config.json`, `preprocessor_config.json`, `scheduler_config.json`). - Sub-directories are forbidden. Want to check if your file is valid? Check it out using this Space: https://huggingface.co/spaces/DDUF/dduf-check. ## Usage The `huggingface_hub` provides tooling to handle DDUF files in Python. It includes built-in rules to validate file integrity and helpers to read and export DDUF files. The goal is to see this tooling adopted in the Python ecosystem, such as in the `diffusers` integration. Similar tooling can be developed for other languages (JavaScript, Rust, C++, etc.). ### How to read a DDUF file? Pass a path to `read_dduf_file` to read a DDUF file. Only the metadata is read, meaning this is a lightweight call that won't explode your memory. In the example below, we consider that you've already downloaded the [`FLUX.1-dev.dduf`](https://huggingface.co/DDUF/FLUX.1-dev-DDUF/blob/main/FLUX.1-dev.dduf) file locally. ```python >>> from huggingface_hub import read_dduf_file # Read DDUF metadata >>> dduf_entries = read_dduf_file("FLUX.1-dev.dduf") ``` `read_dduf_file` returns a mapping where each entry corresponds to a file in the DDUF archive. A file is represented by a `DDUFEntry` dataclass that contains the filename, offset, and length of the entry in the original DDUF file. This information is useful to read its content without loading the whole file. In practice, you won't have to handle low-level reading but rely on helpers instead. For instance, here is how to load the `model_index.json` content: ```python >>> import json >>> json.loads(dduf_entries["model_index.json"].read_text()) {'_class_name': 'FluxPipeline', '_diffusers_version': '0.32.0.dev0', '_name_or_path': 'black-forest-labs/FLUX.1-dev', ... ``` For binary files, you'll want to access the raw bytes using `as_mmap`. This returns bytes as a memory-mapping on the original file. The memory-mapping allows you to read only the bytes you need without loading everything in memory. For instance, here is how to load safetensors weights: ```python >>> import safetensors.torch >>> with dduf_entries["vae/diffusion_pytorch_model.safetensors"].as_mmap() as mm: ... state_dict = safetensors.torch.load(mm) # `mm` is a bytes object ``` > [!TIP] > `as_mmap` must be used in a context manager to benefit from the memory-mapping properties. ### How to write a DDUF file? Pass a folder path to `export_folder_as_dduf` to export a DDUF file. ```python # Export a folder as a DDUF file >>> from huggingface_hub import export_folder_as_dduf >>> export_folder_as_dduf("FLUX.1-dev.dduf", folder_path="path/to/FLUX.1-dev") ``` This tool scans the folder, adds the relevant entries and ensures the exported file is valid. If anything goes wrong during the process, a `DDUFExportError` is raised. For more flexibility, use [`export_entries_as_dduf`] to explicitly specify a list of files to include in the final DDUF file: ```python # Export specific files from the local disk. >>> from huggingface_hub import export_entries_as_dduf >>> export_entries_as_dduf( ... dduf_path="stable-diffusion-v1-4-FP16.dduf", ... entries=[ # List entries to add to the DDUF file (here, only FP16 weights) ... ("model_index.json", "path/to/model_index.json"), ... ("vae/config.json", "path/to/vae/config.json"), ... ("vae/diffusion_pytorch_model.fp16.safetensors", "path/to/vae/diffusion_pytorch_model.fp16.safetensors"), ... ("text_encoder/config.json", "path/to/text_encoder/config.json"), ... ("text_encoder/model.fp16.safetensors", "path/to/text_encoder/model.fp16.safetensors"), ... # ... add more entries here ... ] ... ) ``` `export_entries_as_dduf` works well if you've already saved your model on the disk. But what if you have a model loaded in memory and want to serialize it directly into a DDUF file? `export_entries_as_dduf` lets you do that by providing a Python `generator` that tells how to serialize the data iteratively: ```python (...) # Export state_dicts one by one from a loaded pipeline >>> def as_entries(pipe: DiffusionPipeline) -> Generator[Tuple[str, bytes], None, None]: ... # Build a generator that yields the entries to add to the DDUF file. ... # The first element of the tuple is the filename in the DDUF archive. The second element is the content of the file. ... # Entries will be evaluated lazily when the DDUF file is created (only 1 entry is loaded in memory at a time) ... yield "vae/config.json", pipe.vae.to_json_string().encode() ... yield "vae/diffusion_pytorch_model.safetensors", safetensors.torch.save(pipe.vae.state_dict()) ... yield "text_encoder/config.json", pipe.text_encoder.config.to_json_string().encode() ... yield "text_encoder/model.safetensors", safetensors.torch.save(pipe.text_encoder.state_dict()) ... # ... add more entries here >>> export_entries_as_dduf(dduf_path="my-cool-diffusion-model.dduf", entries=as_entries(pipe)) ``` ### Loading a DDUF file with Diffusers Diffusers has a built-in integration for DDUF files. Here is an example on how to load a pipeline from a stored checkpoint on the Hub: ```py from diffusers import DiffusionPipeline import torch pipe = DiffusionPipeline.from_pretrained( "DDUF/FLUX.1-dev-DDUF", dduf_file="FLUX.1-dev.dduf", torch_dtype=torch.bfloat16 ).to("cuda") image = pipe( "photo a cat holding a sign that says Diffusers", num_inference_steps=50, guidance_scale=3.5 ).images[0] image.save("cat.png") ``` ## F.A.Q. ### Why build on top of ZIP? ZIP provides several advantages: - Universally supported file format - No additional dependencies for reading - Built-in file indexing - Wide language support ### Why not use a TAR with a table of contents at the beginning of the archive? See the explanation in this [comment](https://github.com/huggingface/huggingface_hub/pull/2692#issuecomment-2519863726). ### Why no compression? - Enables direct memory mapping of large files - Ensures consistent and predictable remote file access - Prevents CPU overhead during file reading - Maintains compatibility with safetensors ### Can I modify a DDUF file? No. For now, DDUF files are designed to be immutable. To update a model, create a new DDUF file. ### Which frameworks/apps support DDUFs? - [Diffusers](https://github.com/huggingface/diffusers) We are constantly reaching out to other libraries and frameworks. If you are interested in adding support to your project, open a Discussion in the [DDUF org](https://huggingface.co/spaces/DDUF/README/discussions). ### File formats https://huggingface.co/docs/hub/datasets-polars-file-formats.md # File formats Polars supports the following file formats when reading from Hugging Face: - [Parquet](https://docs.pola.rs/api/python/stable/reference/api/polars.read_parquet.html) - [CSV](https://docs.pola.rs/api/python/stable/reference/api/polars.read_csv.html) - [JSON Lines](https://docs.pola.rs/api/python/stable/reference/api/polars.read_ndjson.html) The examples below show the default settings only. Use the links above to view all available parameters in the API reference guide. # Parquet Parquet is the preferred file format as it stores the schema with type information within the file. This avoids any ambiguity with parsing and speeds up reading. To read a Parquet file in Polars, use the `read_parquet` function: ```python pl.read_parquet("hf://datasets/roneneldan/TinyStories/data/train-00000-of-00004-2d5a1467fff1081b.parquet") ``` # CSV The `read_csv` function can be used to read a CSV file: ```python pl.read_csv("hf://datasets/lhoestq/demo1/data/train.csv") ``` # JSON Polars supports reading new line delimited JSON — also known as [json lines](https://jsonlines.org/) — with the `read_ndjson` function: ```python pl.read_ndjson("hf://datasets/proj-persona/PersonaHub/persona.jsonl") ``` ### The HF PRO subscription 🔥 https://huggingface.co/docs/hub/pro.md # The HF PRO subscription 🔥 The PRO subscription unlocks essential features for serious users, including: - Higher [storage capacity](./storage-limits) for public and private repositories - Higher bandwidth and API [rate limits](./rate-limits) - Included credits for [Inference Providers](/docs/inference-providers/) - Higher tier for [ZeroGPU Spaces](./spaces-zerogpu) usage - Ability to create ZeroGPU Spaces and use [Dev Mode](./spaces-dev-mode) - Ability to publish Social Posts and Community Blogs - Leverage the [Data Studio](./datasets-viewer) on private datasets - Run and schedule serverless [CPU/ GPU Jobs](https://huggingface.co/docs/huggingface_hub/en/guides/jobs) View the full list of benefits at **https://huggingface.co/pro** then subscribe over at https://huggingface.co/subscribe/pro ### Spaces as MCP servers https://huggingface.co/docs/hub/spaces-mcp-servers.md # Spaces as MCP servers You can **expose any public Space that has a visible `MCP` badge into a callable tool** that will be available in any MCP-compatible client, you can add as many Spaces as you want and without writing a single line of code. ## Setup your MCP Client From your [Hub MCP settings](https://huggingface.co/settings/mcp), select your MCP client (VSCode, Cursor, Claude Code, etc.) then follow the setup instructions. ![image/png](https://cdn-uploads.huggingface.co/production/uploads/5f17f0a0925b9863e28ad517/wWm_GeuWF17OrMyJT4tMx.png) > [!WARNING] > You need a valid Hugging Face token with READ permissions to use MCP tools. If you don't have one, create a new "Read" access token here. ## Add an existing Space to your MCP tools ![image/png](https://cdn-uploads.huggingface.co/production/uploads/5f17f0a0925b9863e28ad517/ex9KRpvamn84ZaOlSp_Bj.png) 1. Browse compatible [Spaces](https://huggingface.co/spaces?filter=mcp-server) to find Spaces that are usable via MCP. You can also look for the grey **MCP** badge on any Spaces card. 2. Click the badge and choose **Add to MCP tools** then confirm when asked. 3. The Space should be listed in your MCP Server settings in the Spaces Tools section. ![image/png](https://cdn-uploads.huggingface.co/production/uploads/5f17f0a0925b9863e28ad517/uI4PsneUZoWn_TExhNJyt.png) ## Use Spaces from your MCP client If your MCP client is configured correctly, the Spaces you added will be available instantly without changing anything (if it doesn't restart your client and it should appear). Most MCP clients will list what tools are currently loaded so you can make sure the Space is available. > [!TIP] > For ZeroGPU Spaces, your quota will be used when the tool is called, if you run out of quota you can subscribe to PRO to get 25 minutes of daily quota (x8 more quota than free users). For example your PRO account lets you generate up to 600 images per day using FLUX.1-schnell. ## Build your own MCP-compatible Gradio Space To create your own MCP-enabled Space, you need to [Create a new Gradio Space](https://huggingface.co/new-space?sdk=gradio) then make sure to enable MCP support in the code. Get started with [Gradio Spaces](https://huggingface.co/docs/hub/en/spaces-sdks-gradio) and make sure to check the [detailed MCP guide](https://www.gradio.app/guides/building-mcp-server-with-gradio) for more details. First, install Gradio with MCP support: ```bash pip install "gradio[mcp]" ``` Then create your app with clear type hints and docstrings: ```python import gradio as gr def letter_counter(word: str, letter: str) -> int: """Count occurrences of a letter in a word. Args: word: The word to search in letter: The letter to count Returns: Number of times the letter appears in the word """ return word.lower().count(letter.lower()) demo = gr.Interface(fn=letter_counter, inputs=["text", "text"], outputs="number") demo.launch(mcp_server=True) # exposes an MCP schema automatically ``` Push the app to a **Gradio Space** and it will automatically receive the **MCP** badge. Anyone can then add it as a tool with a single click. > [!TIP] > It's also quite easy to convert an existing Gradio Space to MCP server. Duplicate it from the context menu then just add the mcp_server=True parameter to your launch() method, and ensure your functions have clear type hints and docstrings - you can use AI tools to automate this quite easily (example of AI generated docstrings). ## Be creative by mixing Spaces! As Hugging Face Spaces is the largest directory of AI apps, you can find many creative tools that can be used as MCP tools. Mixing and matching different Spaces can lead to powerful and creative workflows. This video demonstrates the use of Lightricks/ltx-video-distilled and ResembleAI/Chatterbox in Claude Code to generate a video with audio. ### Malware Scanning https://huggingface.co/docs/hub/security-malware.md # Malware Scanning We run every file of your repositories through a [malware scanner](https://www.clamav.net/). Scanning is triggered at each commit. Here is an [example view](https://huggingface.co/mcpotato/42-eicar-street/tree/main) of an infected file: > [!TIP] > If your file has neither an ok nor infected badge, it could mean that it is either currently being scanned, waiting to be scanned, or that there was an error during the scan. It can take up to a few minutes to be scanned. If at least one file has a been scanned as unsafe, a message will warn the users: > [!TIP] > As the repository owner, we advise you to remove the suspicious file. The repository will appear back as safe. ### Shiny on Spaces https://huggingface.co/docs/hub/spaces-sdks-docker-shiny.md # Shiny on Spaces [Shiny](https://shiny.posit.co/) is an open-source framework for building simple, beautiful, and performant data applications. The goal when developing Shiny was to build something simple enough to teach someone in an afternoon but extensible enough to power large, mission-critical applications. You can create a useful Shiny app in a few minutes, but if the scope of your project grows, you can be sure that Shiny can accommodate that application. The main feature that differentiates Shiny from other frameworks is its reactive execution model. When you write a Shiny app, the framework infers the relationship between inputs, outputs, and intermediary calculations and uses those relationships to render only the things that need to change as a result of a user's action. The result is that users can easily develop efficient, extensible applications without explicitly caching data or writing callback functions. ## Shiny for Python [Shiny for Python](https://shiny.rstudio.com/py/) is a pure Python implementation of Shiny. This gives you access to all of the great features of Shiny like reactivity, complex layouts, and modules without needing to use R. Shiny for Python is ideal for Hugging Face applications because it integrates smoothly with other Hugging Face tools. To get started deploying a Space, click this button to select your hardware and specify if you want a public or private Space. The Space template will populate a few files to get your app started. _app.py_ This file defines your app's logic. To learn more about how to modify this file, see [the Shiny for Python documentation](https://shiny.rstudio.com/py/docs/overview.html). As your app gets more complex, it's a good idea to break your application logic up into [modules](https://shiny.rstudio.com/py/docs/workflow-modules.html). _Dockerfile_ The Dockerfile for a Shiny for Python app is very minimal because the library doesn't have many system dependencies, but you may need to modify this file if your application has additional system dependencies. The one essential feature of this file is that it exposes and runs the app on the port specified in the space README file (which is 7860 by default). __requirements.txt__ The Space will automatically install dependencies listed in the requirements.txt file. Note that you must include shiny in this file. ## Shiny for R [Shiny for R](https://shiny.rstudio.com/) is a popular and well-established application framework in the R community and is a great choice if you want to host an R app on Hugging Face infrastructure or make use of some of the great [Shiny R extensions](https://github.com/nanxstats/awesome-shiny-extensions). To integrate Hugging Face tools into an R app, you can either use [httr2](https://httr2.r-lib.org/) to call Hugging Face APIs, or [reticulate](https://rstudio.github.io/reticulate/) to call one of the Hugging Face Python SDKs. To deploy an R Shiny Space, click this button and fill out the space metadata. This will populate the Space with all the files you need to get started. _app.R_ This file contains all of your application logic. If you prefer, you can break this file up into `ui.R` and `server.R`. _Dockerfile_ The Dockerfile builds off of the [rocker shiny](https://hub.docker.com/r/rocker/shiny) image. You'll need to modify this file to use additional packages. If you are using a lot of tidyverse packages we recommend switching the base image to [rocker/shinyverse](https://hub.docker.com/r/rocker/shiny-verse). You can install additional R packages by adding them under the `RUN install2.r` section of the dockerfile, and github packages can be installed by adding the repository under `RUN installGithub.r`. There are two main requirements for this Dockerfile: - First, the file must expose the port that you have listed in the README. The default is 7860 and we recommend not changing this port unless you have a reason to. - Second, for the moment you must use the development version of [httpuv](https://github.com/rstudio/httpuv) which resolves an issue with app timeouts on Hugging Face. ### Digital Object Identifier (DOI) https://huggingface.co/docs/hub/doi.md # Digital Object Identifier (DOI) The Hugging Face Hub offers the possibility to generate DOI for your models or datasets. DOIs (Digital Object Identifiers) are strings uniquely identifying a digital object, anything from articles to figures, including datasets and models. DOIs are tied to object metadata, including the object's URL, version, creation date, description, etc. They are a commonly accepted reference to digital resources across research and academic communities; they are analogous to a book's ISBN. ## How to generate a DOI? To do this, you must go to the settings of your model or dataset. In the DOI section, a button called "Generate DOI" should appear: To generate the DOI for this model or dataset, you need to click on this button and acknowledge that some features on the hub will be restrained and some of your information (your full name) will be transferred to our partner DataCite. When generating a DOI, you can optionally personalize the author name list allowing you to credit all contributors to your model or dataset. After you agree to those terms, your model or dataset will get a DOI assigned, and a new tag should appear in your model or dataset header allowing you to cite it. ## Can I regenerate a new DOI if my model or dataset changes? If ever there’s a new version of a model or dataset, a new DOI can easily be assigned, and the previous version of the DOI gets outdated. This makes it easy to refer to a specific version of an object, even if it has changed. You just need to click on "Generate new DOI" and tadaam!🎉 a new DOI is assigned for the current revision of your model or dataset. ## Why is there a 'locked by DOI' message on delete, rename and change visibility action on my model or dataset? DOIs make finding information about a model or dataset easier and sharing them with the world via a permanent link that will never expire or change. As such, datasets/models with DOIs are intended to persist perpetually and may only be deleted, renamed and changed their visibility upon filing a request with our support (website at huggingface.co) ## Further Reading - [Introducing DOI: the Digital Object Identifier to Datasets and Models](https://huggingface.co/blog/introducing-doi) ### Third-party scanner: JFrog https://huggingface.co/docs/hub/security-jfrog.md # Third-party scanner: JFrog [JFrog](https://jfrog.com/)'s security scanner detects malicious behavior in machine learning models. ![JFrog report for the danger.dat file contained in mcpotato/42-eicar-street](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/jfrog-report.png) *Example of a report for [danger.dat](https://huggingface.co/mcpotato/42-eicar-street/blob/main/danger.dat)* We [partnered with JFrog](https://hf.co/blog/jfrog) to provide scanning in order to make the Hub safer. Model files are scanned by the JFrog scanner and we expose the scanning results on the Hub interface. JFrog's scanner is built with the goal to reduce false positives. Indeed, what we currently observe is that code contained within model weights is not always malicious. When code is detected in a file, JFrog's scanner will parse it and analyze to check for potential malicious usage. Here is an example repository you can check out to see the feature in action: [mcpotato/42-eicar-street](https://huggingface.co/mcpotato/42-eicar-street). ## Model security refresher To share models, we serialize the data structures we use to interact with the models, in order to facilitate storage and transport. Some serialization formats are vulnerable to nasty exploits, such as arbitrary code execution (looking at you pickle), making sharing models potentially dangerous. As Hugging Face has become a popular platform for model sharing, we’d like to protect the community from this, hence why we have developed tools like [picklescan](https://github.com/mmaitre314/picklescan) and why we integrate third party scanners. Pickle is not the only exploitable format out there, [see for reference](https://github.com/Azure/counterfit/wiki/Abusing-ML-model-file-formats-to-create-malware-on-AI-systems:-A-proof-of-concept) how one can exploit Keras Lambda layers to achieve arbitrary code execution. ### Using OpenCLIP at Hugging Face https://huggingface.co/docs/hub/open_clip.md # Using OpenCLIP at Hugging Face [OpenCLIP](https://github.com/mlfoundations/open_clip) is an open-source implementation of OpenAI's CLIP. ## Exploring OpenCLIP on the Hub You can find OpenCLIP models by filtering at the left of the [models page](https://huggingface.co/models?library=open_clip&sort=trending). OpenCLIP models hosted on the Hub have a model card with useful information about the models. Thanks to OpenCLIP Hugging Face Hub integration, you can load OpenCLIP models with a few lines of code. You can also deploy these models using [Inference Endpoints](https://huggingface.co/inference-endpoints). ## Installation To get started, you can follow the [OpenCLIP installation guide](https://github.com/mlfoundations/open_clip#usage). You can also use the following one-line install through pip: ``` $ pip install open_clip_torch ``` ## Using existing models All OpenCLIP models can easily be loaded from the Hub: ```py import open_clip model, preprocess = open_clip.create_model_from_pretrained('hf-hub:laion/CLIP-ViT-g-14-laion2B-s12B-b42K') tokenizer = open_clip.get_tokenizer('hf-hub:laion/CLIP-ViT-g-14-laion2B-s12B-b42K') ``` Once loaded, you can encode the image and text to do [zero-shot image classification](https://huggingface.co/tasks/zero-shot-image-classification): ```py import torch from PIL import Image import requests url = 'http://images.cocodataset.org/val2017/000000039769.jpg' image = Image.open(requests.get(url, stream=True).raw) image = preprocess(image).unsqueeze(0) text = tokenizer(["a diagram", "a dog", "a cat"]) with torch.no_grad(), torch.cuda.amp.autocast(): image_features = model.encode_image(image) text_features = model.encode_text(text) image_features /= image_features.norm(dim=-1, keepdim=True) text_features /= text_features.norm(dim=-1, keepdim=True) text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1) print("Label probs:", text_probs) ``` It outputs the probability of each possible class: ```text Label probs: tensor([[0.0020, 0.0034, 0.9946]]) ``` If you want to load a specific OpenCLIP model, you can click `Use in OpenCLIP` in the model card and you will be given a working snippet! ## Additional resources * OpenCLIP [repository](https://github.com/mlfoundations/open_clip) * OpenCLIP [docs](https://github.com/mlfoundations/open_clip/tree/main/docs) * OpenCLIP [models in the Hub](https://huggingface.co/models?library=open_clip&sort=trending) ### Single Sign-On (SSO) https://huggingface.co/docs/hub/security-sso.md # Single Sign-On (SSO) The Hugging Face Hub gives you the ability to implement mandatory Single Sign-On (SSO) for your organization. We support both SAML 2.0 and OpenID Connect (OIDC) protocols. > [!WARNING] > This feature is part of the Team & Enterprise plans. For enhanced capabilities like automated user provisioning (JIT/SCIM) and global SSO enforcement, see our Advanced SSO documentation ## How does it work? When Single Sign-On is enabled, the members of your organization must authenticate through your Identity Provider (IdP) to access any content under the organization's namespace. Public content will still be available to users who are not members of the organization. **We use email addresses to identify SSO users. As a user, make sure that your organizational email address (e.g. your company email) has been added to [your user account](https://huggingface.co/settings/account).** When users log in, they will be prompted to complete the Single Sign-On authentication flow with a banner similar to the following: Single Sign-On only applies to your organization. Members may belong to other organizations on Hugging Face. We support [role mapping](#role-mapping) and [resource group mapping](#resource-group-mapping). Based on attributes provided by your Identity Provider, you can dynamically assign [roles](./organizations-security#access-control-in-organizations) to organization members, or give them access to [resource groups](./enterprise-hub-resource-groups) defined in your organization. ### Supported Identity Providers You can easily integrate Hugging Face Hub with a variety of Identity Providers, such as Okta, OneLogin or Azure Active Directory (Azure AD). Hugging Face Hub can work with any OIDC-compliant or SAML Identity Provider. ## How to configure OIDC/SAML provider in the Hub We have some guides available to help with configuring based on your chosen SSO provider, or to take inspiration from: - [How to configure SAML with Okta in the Hub](./security-sso-okta-saml) - [How to configure OIDC with Okta in the Hub](./security-sso-okta-oidc) - [How to configure SAML with Azure in the Hub](./security-sso-azure-saml) - [How to configure OIDC with Azure in the Hub](./security-sso-azure-oidc) - [How to configure SAML with Google Workspace in the Hub](./security-sso-google-saml) - [How to configure OIDC with Google Workspace in the Hub](./security-sso-google-oidc) ### Users Management #### Session Timeout This value sets the duration of the session for members of your organization. After this time, members will be prompted to re-authenticate with your Identity Provider to access the organization's resources. The default value is 7 days. #### Role Mapping When enabled, Role Mapping allows you to dynamically assign [roles](./organizations-security#access-control-in-organizations) to organization members based on data provided by your Identity Provider. This section allows you to define a mapping from your IdP's user profile data from your IdP to the assigned role in Hugging Face. - IdP Role Attribute Mapping A JSON path to an attribute in your user's IdP profile data. - Role Mapping A mapping from the IdP attribute value to the assigned role in the Hugging Face organization. You must map at least one admin role. If there is no match, a user will be assigned the default role for your organization. The default role can be customized in the `Members` section of the organization's settings. Role synchronization is performed on login. #### Resource Group Mapping When enabled, Resource Group Mapping allows you to dynamically assign members to [resource groups](./enterprise-hub-resource-groups) in your organization, based on data provided by your Identity Provider. - IdP Attribute Path A JSON path to an attribute in your user's IdP profile data. - Resource Group Mapping A mapping from the IdP attribute value to a resource group in your Hugging Face organization. If there is no match, the user will not be assigned to any resource group. #### Matching email domains When enabled, Matching email domains only allow organization members to complete SSO if the email provided by your identity provider matches one of their emails on Hugging Face. To add an email domain, fill out the 'Matching email domains' field, click enter on your keyboard, and save. #### External Collaborators This enables certain users within your organization to access resources without completing the Single Sign-On (SSO) flow described before. This can be helpful when you work with external parties who aren't part of your organization's Identity Provider (IdP) but require access to specific resources. To add a user as an "External Collaborator" visit the `SSO/Users Management` section in your organization's settings. Once added, these users won't need to go through the SSO process. However, they will still be subject to your organization's access controls ([Resource Groups](./security-resource-groups)). It's crucial to manage their access carefully to maintain your organization's data security. ### TF-Keras (legacy) https://huggingface.co/docs/hub/tf-keras.md ## TF-Keras (legacy) `tf-keras` is the name given to Keras 2.x version. It is now hosted as a separate GitHub repo [here](https://github.com/keras-team/tf-keras). Though it's a legacy framework, there are still [4.5k+ models](https://huggingface.co/models?library=tf-keras&sort=trending) hosted on the Hub. These models can be loaded using the `huggingface_hub` library. You **must** have either `tf-keras` or `keras # The image is a sunflower! ``` You can also host your `tf-keras` model on the Hub. However, keep in mind that `tf-keras` is a legacy framework. To reach a maximum number of users, we recommend to create your model using Keras 3.x and share it natively as described above. For more details about uploading `tf-keras` models, check out [`push_to_hub_keras` documentation](https://huggingface.co/docs/huggingface_hub/main/en/package_reference/mixins#huggingface_hub.push_to_hub_keras). ```py from huggingface_hub import push_to_hub_keras push_to_hub_keras(model, "your-username/your-model-name", "your-tensorboard-log-directory", tags = ["object-detection", "some_other_tag"], **model_save_kwargs, ) ``` ## Additional resources - [GitHub repo](https://github.com/keras-team/tf-keras) * Blog post [Putting Keras on 🤗 Hub for Collaborative Training and Reproducibility](https://merveenoyan.medium.com/putting-keras-on-hub-for-collaborative-training-and-reproducibility-9018301de877) (April 2022) ### Tasks https://huggingface.co/docs/hub/models-tasks.md # Tasks ## What's a task? Tasks, or pipeline types, describe the "shape" of each model's API (inputs and outputs) and are used to determine which Inference API and widget we want to display for any given model. This classification is relatively coarse-grained (you can always add more fine-grained task names in your model tags), so **you should rarely have to create a new task**. If you want to add support for a new task, this document explains the required steps. ## Overview Having a new task integrated into the Hub means that: * Users can search for all models – and datasets – of a given task. * The Inference API supports the task. * Users can try out models directly with the widget. 🏆 Note that you don't need to implement all the steps by yourself. Adding a new task is a community effort, and multiple people can contribute. 🧑‍🤝‍🧑 To begin the process, open a new issue in the [huggingface_hub](https://github.com/huggingface/huggingface_hub/issues) repository. Please use the "Adding a new task" template. ⚠️Before doing any coding, it's suggested to go over this document. ⚠️ The first step is to upload a model for your proposed task. Once you have a model in the Hub for the new task, the next step is to enable it in the Inference API. There are three types of support that you can choose from: * 🤗 using a `transformers` model * 🐳 using a model from an [officially supported library](./models-libraries) * 🖨️ using a model with custom inference code. This experimental option has downsides, so we recommend using one of the other approaches. Finally, you can add a couple of UI elements, such as the task icon and the widget, that complete the integration in the Hub. 📷 Some steps are orthogonal; you don't need to do them in order. **You don't need the Inference API to add the icon.** This means that, even if there isn't full integration yet, users can still search for models of a given task. ## Adding new tasks to the Hub ### Using Hugging Face transformers library If your model is a `transformers`-based model, there is a 1:1 mapping between the Inference API task and a `pipeline` class. Here are some example PRs from the `transformers` library: * [Adding ImageClassificationPipeline](https://github.com/huggingface/transformers/pull/11598) * [Adding AudioClassificationPipeline](https://github.com/huggingface/transformers/pull/13342) Once the pipeline is submitted and deployed, you should be able to use the Inference API for your model. ### Using Community Inference API with a supported library The Hub also supports over 10 open-source libraries in the [Community Inference API](https://github.com/huggingface/api-inference-community). **Adding a new task is relatively straightforward and requires 2 PRs:** * PR 1: Add the new task to the API [validation](https://github.com/huggingface/api-inference-community/blob/main/api_inference_community/validation.py). This code ensures that the inference input is valid for a given task. Some PR examples: * [Add text-to-image](https://github.com/huggingface/huggingface_hub/commit/5f040a117cf2a44d704621012eb41c01b103cfca#diff-db8bbac95c077540d79900384cfd524d451e629275cbb5de7a31fc1cd5d6c189) * [Add audio-classification](https://github.com/huggingface/huggingface_hub/commit/141e30588a2031d4d5798eaa2c1250d1d1b75905#diff-db8bbac95c077540d79900384cfd524d451e629275cbb5de7a31fc1cd5d6c189) * [Add tabular-classification](https://github.com/huggingface/huggingface_hub/commit/dbea604a45df163d3f0b4b1d897e4b0fb951c650#diff-db8bbac95c077540d79900384cfd524d451e629275cbb5de7a31fc1cd5d6c189) * PR 2: Add the new task to a library docker image. You should also add a template to [`docker_images/common/app/pipelines`](https://github.com/huggingface/api-inference-community/tree/main/docker_images/common/app/pipelines) to facilitate integrating the task in other libraries. Here is an example PR: * [Add text-classification to spaCy](https://github.com/huggingface/huggingface_hub/commit/6926fd9bec23cb963ce3f58ec53496083997f0fa#diff-3f1083a92ca0047b50f9ad2d04f0fe8dfaeee0e26ab71eb8835e365359a1d0dc) ### Adding Community Inference API for a quick prototype **My model is not supported by any library. Am I doomed? 😱** We recommend using [Hugging Face Spaces](./spaces) for these use cases. ### UI elements The Hub allows users to filter models by a given task. To do this, you need to add the task to several places. You'll also get to pick an icon for the task! 1. Add the task type to `Types.ts` In [huggingface.js/packages/tasks/src/pipelines.ts](https://github.com/huggingface/huggingface.js/blob/main/packages/tasks/src/pipelines.ts), you need to do a couple of things * Add the type to `PIPELINE_DATA`. Note that pipeline types are sorted into different categories (NLP, Audio, Computer Vision, and others). * You will also need to fill minor changes in [huggingface.js/packages/tasks/src/tasks/index.ts](https://github.com/huggingface/huggingface.js/blob/main/packages/tasks/src/tasks/index.ts) 2. Choose an icon You can add an icon in the [lib/Icons](https://github.com/huggingface/huggingface.js/tree/main/packages/widgets/src/lib/components/Icons) directory. We usually choose carbon icons from https://icones.js.org/collection/carbon. Also add the icon to [PipelineIcon](https://github.com/huggingface/huggingface.js/blob/main/packages/widgets/src/lib/components/PipelineIcon/PipelineIcon.svelte). ### Widget Once the task is in production, what could be more exciting than implementing some way for users to play directly with the models in their browser? 🤩 You can find all the widgets [here](https://huggingface.co/spaces/huggingfacejs/inference-widgets). If you would be interested in contributing with a widget, you can look at the [implementation](https://github.com/huggingface/huggingface.js/tree/main/packages/widgets) of all the widgets. ### Models Download Stats https://huggingface.co/docs/hub/models-download-stats.md # Models Download Stats ## How are downloads counted for models? Counting the number of downloads for models is not a trivial task, as a single model repository might contain multiple files, including multiple model weight files (e.g., with sharded models) and different formats depending on the library (GGUF, PyTorch, TensorFlow, etc.). To avoid double counting downloads (e.g., counting a single download of a model as multiple downloads), the Hub uses a set of query files that are employed for download counting. No information is sent from the user, and no additional calls are made for this. The count is done server-side as the Hub serves files for downloads. Every HTTP request to these files, including `GET` and `HEAD`, will be counted as a download. By default, when no library is specified, the Hub uses `config.json` as the default query file. Otherwise, the query file depends on each library, and the Hub might examine files such as `pytorch_model.bin` or `adapter_config.json`. ## Which are the query files for different libraries? By default, the Hub looks at `config.json`, `config.yaml`, `hyperparams.yaml`, `params.json`, and `meta.yaml`. Some libraries override these defaults by specifying their own filter (specifying `countDownloads`). The code that defines these overrides is [open-source](https://github.com/huggingface/huggingface.js/blob/main/packages/tasks/src/model-libraries.ts). For example, for the `nemo` library, all files with `.nemo` extension are used to count downloads. ## Can I add my query files for my library? Yes, you can open a Pull Request [here](https://github.com/huggingface/huggingface.js/blob/main/packages/tasks/src/model-libraries.ts). Here is a minimal [example](https://github.com/huggingface/huggingface.js/pull/885/files) adding download metrics for VFIMamba. Check out the [integration guide](./models-adding-libraries#register-your-library) for more details. ## How are `GGUF` files handled? GGUF files are self-contained and are not tied to a single library, so all of them are counted for downloads. This will double count downloads in the case a user performs cloning of a whole repository, but most users and interfaces download a single GGUF file for a given repo. ## How is `diffusers` handled? The `diffusers` library is an edge case and has its filter configured in the internal codebase. The filter ensures repos tagged as `diffusers` count both files loaded via the library as well as through UIs that require users to manually download the top-level safetensors. ``` filter: [ { bool: { /// Include documents that match at least one of the following rules should: [ /// Downloaded from diffusers lib { term: { path: "model_index.json" }, }, /// Direct downloads (LoRa, Auto1111 and others) /// Filter out nested safetensors and pickle weights to avoid double counting downloads from the diffusers lib { regexp: { path: "[^/]*\\.safetensors" }, }, { regexp: { path: "[^/]*\\.ckpt" }, }, { regexp: { path: "[^/]*\\.bin" }, }, ], minimum_should_match: 1, }, }, ] } ``` ### Licenses https://huggingface.co/docs/hub/repositories-licenses.md # Licenses You are able to add a license to any repo that you create on the Hugging Face Hub to let other users know about the permissions that you want to attribute to your code or data. The license can be specified in your repository's `README.md` file, known as a _card_ on the Hub, in the card's metadata section. Remember to seek out and respect a project's license if you're considering using their code or data. A full list of the available licenses is available here: | Fullname | License identifier (to use in repo card) | | -------------------------------------------------------------- | ---------------------------------------- | | Apache license 2.0 | `apache-2.0` | | MIT | `mit` | | OpenRAIL license family | `openrail` | | BigScience OpenRAIL-M | `bigscience-openrail-m` | | CreativeML OpenRAIL-M | `creativeml-openrail-m` | | BigScience BLOOM RAIL 1.0 | `bigscience-bloom-rail-1.0` | | BigCode Open RAIL-M v1 | `bigcode-openrail-m` | | Academic Free License v3.0 | `afl-3.0` | | Artistic license 2.0 | `artistic-2.0` | | Boost Software License 1.0 | `bsl-1.0` | | BSD license family | `bsd` | | BSD 2-clause "Simplified" license | `bsd-2-clause` | | BSD 3-clause "New" or "Revised" license | `bsd-3-clause` | | BSD 3-clause Clear license | `bsd-3-clause-clear` | | Computational Use of Data Agreement | `c-uda` | | Creative Commons license family | `cc` | | Creative Commons Zero v1.0 Universal | `cc0-1.0` | | Creative Commons Attribution 2.0 | `cc-by-2.0` | | Creative Commons Attribution 2.5 | `cc-by-2.5` | | Creative Commons Attribution 3.0 | `cc-by-3.0` | | Creative Commons Attribution 4.0 | `cc-by-4.0` | | Creative Commons Attribution Share Alike 3.0 | `cc-by-sa-3.0` | | Creative Commons Attribution Share Alike 4.0 | `cc-by-sa-4.0` | | Creative Commons Attribution Non Commercial 2.0 | `cc-by-nc-2.0` | | Creative Commons Attribution Non Commercial 3.0 | `cc-by-nc-3.0` | | Creative Commons Attribution Non Commercial 4.0 | `cc-by-nc-4.0` | | Creative Commons Attribution No Derivatives 4.0 | `cc-by-nd-4.0` | | Creative Commons Attribution Non Commercial No Derivatives 3.0 | `cc-by-nc-nd-3.0` | | Creative Commons Attribution Non Commercial No Derivatives 4.0 | `cc-by-nc-nd-4.0` | | Creative Commons Attribution Non Commercial Share Alike 2.0 | `cc-by-nc-sa-2.0` | | Creative Commons Attribution Non Commercial Share Alike 3.0 | `cc-by-nc-sa-3.0` | | Creative Commons Attribution Non Commercial Share Alike 4.0 | `cc-by-nc-sa-4.0` | | Community Data License Agreement – Sharing, Version 1.0 | `cdla-sharing-1.0` | | Community Data License Agreement – Permissive, Version 1.0 | `cdla-permissive-1.0` | | Community Data License Agreement – Permissive, Version 2.0 | `cdla-permissive-2.0` | | Do What The F\*ck You Want To Public License | `wtfpl` | | Educational Community License v2.0 | `ecl-2.0` | | Eclipse Public License 1.0 | `epl-1.0` | | Eclipse Public License 2.0 | `epl-2.0` | | Etalab Open License 2.0 | `etalab-2.0` | | European Union Public License 1.1 | `eupl-1.1` | | European Union Public License 1.2 | `eupl-1.2` | | GNU Affero General Public License v3.0 | `agpl-3.0` | | GNU Free Documentation License family | `gfdl` | | GNU General Public License family | `gpl` | | GNU General Public License v2.0 | `gpl-2.0` | | GNU General Public License v3.0 | `gpl-3.0` | | GNU Lesser General Public License family | `lgpl` | | GNU Lesser General Public License v2.1 | `lgpl-2.1` | | GNU Lesser General Public License v3.0 | `lgpl-3.0` | | ISC | `isc` | | H Research License | `h-research` | | Intel Research Use License Agreement | `intel-research` | | LaTeX Project Public License v1.3c | `lppl-1.3c` | | Microsoft Public License | `ms-pl` | | Apple Sample Code license | `apple-ascl` | | Apple Model License for Research | `apple-amlr` | | Mozilla Public License 2.0 | `mpl-2.0` | | Open Data Commons License Attribution family | `odc-by` | | Open Database License family | `odbl` | | Open Model, Data & Weights License Agreement | `openmdw-1.0` | | Open Rail++-M License | `openrail++` | | Open Software License 3.0 | `osl-3.0` | | PostgreSQL License | `postgresql` | | SIL Open Font License 1.1 | `ofl-1.1` | | University of Illinois/NCSA Open Source License | `ncsa` | | The Unlicense | `unlicense` | | zLib License | `zlib` | | Open Data Commons Public Domain Dedication and License | `pddl` | | Lesser General Public License For Linguistic Resources | `lgpl-lr` | | DeepFloyd IF Research License Agreement | `deepfloyd-if-license` | | FAIR Noncommercial Research License | `fair-noncommercial-research-license` | | Llama 2 Community License Agreement | `llama2` | | Llama 3 Community License Agreement | `llama3` | | Llama 3.1 Community License Agreement | `llama3.1` | | Llama 3.2 Community License Agreement | `llama3.2` | | Llama 3.3 Community License Agreement | `llama3.3` | | Llama 4 Community License Agreement | `llama4` | | Grok 2 Community License Agreement | `grok2-community` | | Gemma Terms of Use | `gemma` | | Unknown | `unknown` | | Other | `other` | In case of `license: other` please add the license's text to a `LICENSE` file inside your repo (or contact us to add the license you use to this list), and set a name for it in `license_name`. ### How to configure SAML SSO with Azure https://huggingface.co/docs/hub/security-sso-azure-saml.md # How to configure SAML SSO with Azure In this guide, we will use Azure as the SSO provider and with the Security Assertion Markup Language (SAML) protocol as our preferred identity protocol. We currently support SP-initiated and IdP-initiated authentication. User provisioning is part of Enterprise Plus's [Advanced SSO](./enterprise-hub-advanced-sso). > [!WARNING] > This feature is part of the Team & Enterprise plans. ### Step 1: Create a new application in your Identity Provider Open a new tab/window in your browser and sign in to the Azure portal of your organization. Navigate to "Enterprise applications" and click the "New application" button. You'll be redirected to this page, click on "Create your own application", fill the name of your application, and then "Create" the application. Then select "Single Sign-On", and select SAML ### Step 2: Configure your application on Azure Open a new tab/window in your browser and navigate to the SSO section of your organization's settings. Select the SAML protocol. Copy the "SP Entity Id" from the organization's settings on Hugging Face, and paste it in the "Identifier (Entity Id)" field on Azure (1). Copy the "Assertion Consumer Service URL" from the organization's settings on Hugging Face, and paste it in the "Reply URL" field on Azure (2). The URL looks like this: `https://huggingface.co/organizations/[organizationIdentifier]/saml/consume`. Then under "SAML Certificates", verify that "Signin Option" is set to "Sign SAML response and assertion". Save your new application. ### Step 3: Finalize configuration on Hugging Face In your Azure application, under "Set up", find the following field: - Login Url And under "SAML Certificates": - Download the "Certificate (base64)" You will need them to finalize the SSO setup on Hugging Face. In the SSO section of your organization's settings, copy-paste these values from Azure: - Login Url -> Sign-on URL - Certificate -> Public certificate The public certificate must have the following format: ``` -----BEGIN CERTIFICATE----- {certificate} -----END CERTIFICATE----- ``` You can now click on "Update and Test SAML configuration" to save the settings. You should be redirected to your SSO provider (IdP) login prompt. Once logged in, you'll be redirected to your organization's settings page. A green check mark near the SAML selector will attest that the test was successful. ### Step 4: Enable SSO in your organization Now that Single Sign-On is configured and tested, you can enable it for members of your organization by clicking on the "Enable" button. Once enabled, members of your organization must complete the SSO authentication flow described in [How does it work?](./security-sso#how-does-it-work). ### Spaces Overview https://huggingface.co/docs/hub/spaces-overview.md # Spaces Overview Hugging Face Spaces make it easy for you to create and deploy ML-powered demos in minutes. Watch the following video for a quick introduction to Spaces: In the following sections, you'll learn the basics of creating a Space, configuring it, and deploying your code to it. ## Creating a new Space **To make a new Space**, visit the [Spaces main page](https://huggingface.co/spaces) and click on **Create new Space**. Along with choosing a name for your Space, selecting an optional license, and setting your Space's visibility, you'll be prompted to choose the **SDK** for your Space. The Hub offers three SDK options: Gradio, Docker and static HTML. If you select "Gradio" as your SDK, you'll be navigated to a new repo showing the following page: Under the hood, Spaces stores your code inside a git repository, just like the model and dataset repositories. Thanks to this, the same tools we use for all the [other repositories on the Hub](./repositories) (`git` and `git-xet`) also work for Spaces. Follow the same flow as in [Getting Started with Repositories](./repositories-getting-started) to add files to your Space. Each time a new commit is pushed, the Space will automatically rebuild and restart. For step-by-step tutorials to creating your first Space, see the guides below: * [Creating a Gradio Space](./spaces-sdks-gradio) * [Creating a Docker Space](./spaces-sdks-docker-first-demo) ## Hardware resources Each Spaces environment is limited to 16GB RAM, 2 CPU cores and 50GB of (not persistent) disk space by default, which you can use free of charge. You can upgrade to better hardware, including a variety of GPU accelerators and persistent storage, for a [competitive price](https://huggingface.co/pricing#spaces). To request an upgrade, please click the _Settings_ button in your Space and select your preferred hardware environment. | **Hardware** | **GPU Memory** | **CPU** | **Memory** | **Disk** | **Hourly Price** | |--------------------- |---------------- |---------- |------------ |---------- | ---------------- | | CPU Basic | - | 2 vCPU | 16 GB | 50 GB | Free! | | CPU Upgrade | - | 8 vCPU | 32 GB | 50 GB | $0.03 | | Nvidia T4 - small | 16GB | 4 vCPU | 15 GB | 50 GB | $0.60 | | Nvidia T4 - medium | 16GB | 8 vCPU | 30 GB | 100 GB | $0.90 | | Nvidia A10G - small | 24GB | 4 vCPU | 15 GB | 110 GB | $1.05 | | Nvidia A10G - large | 24GB | 12 vCPU | 46 GB | 200 GB | $3.15 | | 2x Nvidia A10G - large| 48GB | 24 vCPU | 92 GB | 1000 GB | $5.70 | | 4x Nvidia A10G - large| 96GB | 48 vCPU | 184 GB | 2000 GB | $10.80 | | Nvidia A100 - large | 40GB | 12 vCPU | 142 GB | 1000 GB | $4.13 | | **Storage tier** | **Size** | **Persistent** | **Monthly price** | |--------------------- |---------------------- |------------------ | --------------------- | | Ephemeral (default) | 50GB | No | Free! | | Small | Ephemeral + 20GB | Yes | $5 | | Medium | Ephemeral + 150GB | Yes | $25 | | Large | Ephemeral + 1TB | yes | $100 | Note: Find more detailed and comprehensive pricing information on [our pricing page](https://huggingface.co/pricing). Do you have an awesome Space but need help covering the hardware upgrade costs? We love helping out those with an innovative Space so please feel free to apply for a community GPU grant using the link in the _Settings_ tab of your Space and see if yours makes the cut! Read more in our dedicated sections on [Spaces GPU Upgrades](./spaces-gpus) and [Spaces Storage Upgrades](./spaces-storage). ## Managing secrets and environment variables[[managing-secrets]] If your app requires environment variables (for instance, secret keys or tokens), do not hard-code them inside your app! Instead, go to the Settings page of your Space repository and add a new **variable** or **secret**. Use variables if you need to store non-sensitive configuration values and secrets for storing access tokens, API keys, or any sensitive value or credentials. You can use: * **Variables** if you need to store non-sensitive configuration values. They are publicly accessible and viewable and will be automatically added to Spaces duplicated from yours. * **Secrets** to store access tokens, API keys, or any sensitive values or credentials. They are private and their value cannot be read from the Space's settings page once set. They won't be added to Spaces duplicated from your repository. Accessing secrets and variables is different depending on your Space SDK: - For Static Spaces, both are available through client-side JavaScript in `window.huggingface.variables` - For Docker Spaces, check out [environment management with Docker](./spaces-sdks-docker#secrets-and-variables-management) For other Spaces, both are exposed to your app as environment variables. Here is a very simple example of accessing the previously declared `MODEL_REPO_ID` variable in Python (it would be the same for secrets): ```py import os print(os.getenv('MODEL_REPO_ID')) ``` Spaces owners are warned when our `Spaces Secrets Scanner` [finds hard-coded secrets](./security-secrets). ## Duplicating a Space Duplicating a Space can be useful if you want to build a new demo using another demo as an initial template. Duplicated Spaces can also be useful if you want to have an individual Upgraded Space for your use with fast inference. If you want to duplicate a Space, you can click the three dots at the top right of the space and click **Duplicate this Space**. Once you do this, you will be able to change the following attributes: * Owner: The duplicated Space can be under your account or any organization in which you have write access * Space name * Visibility: The Space is private by default. Read more about private repositories [here](./repositories-settings#private-repositories). * Hardware: You can choose the hardware on which the Space will be running. Read more about hardware upgrades [here](./spaces-gpus). * Storage: If the original repo uses persistent storage, you will be prompted to choose a storage tier. Read more about persistent storage [here](./spaces-storage). * Secrets and variables: If the original repo has set some secrets and variables, you'll be able to set them while duplicating the repo. Some Spaces might have environment variables that you may need to set up. In these cases, the duplicate workflow will auto-populate the public Variables from the source Space, and give you a warning about setting up the Secrets. The duplicated Space will use a free CPU hardware by default, but you can later upgrade if needed. ## Networking If your Space needs to make any network requests, you can make requests through the standard HTTP and HTTPS ports (80 and 443) along with port 8080. Any requests going to other ports will be blocked. ## Lifecycle management On free hardware, your Space will "go to sleep" and stop executing after a period of time if unused. If you wish for your Space to run indefinitely, consider [upgrading to paid hardware](./spaces-gpus). You can also manually pause your Space from the **Settings** tab. A paused Space stops executing until manually restarted by its owner. Paused time is not billed. ## Helper environment variables In some cases, you might be interested in having programmatic access to the Space author or repository name. This feature is particularly useful when you expect users to duplicate your Space. To help with this, Spaces exposes different environment variables at runtime. Given a Space [`osanseviero/i-like-flan`](https://huggingface.co/spaces/osanseviero/i-like-flan): * `CPU_CORES`: 4 * `MEMORY`: 15Gi * `SPACE_AUTHOR_NAME`: osanseviero * `SPACE_REPO_NAME`: i-like-flan * `SPACE_TITLE`: I Like Flan (specified in the README file) * `SPACE_ID`: `osanseviero/i-like-flan` * `SPACE_HOST`: `osanseviero-i-like-flan.hf.space` * `SPACE_CREATOR_USER_ID`: `6032802e1f993496bc14d9e3` - This is the ID of the user that originally created the Space. It's useful if the Space is under an organization. You can get the user information with an API call to `https://huggingface.co/api/users/{SPACE_CREATOR_USER_ID}/overview`. In case [OAuth](./spaces-oauth) is enabled for your Space, the following variables will also be available: * `OAUTH_CLIENT_ID`: the client ID of your OAuth app (public) * `OAUTH_CLIENT_SECRET`: the client secret of your OAuth app * `OAUTH_SCOPES`: scopes accessible by your OAuth app. Currently, this is always `"openid profile"`. * `OPENID_PROVIDER_URL`: The URL of the OpenID provider. The OpenID metadata will be available at [`{OPENID_PROVIDER_URL}/.well-known/openid-configuration`](https://huggingface.co/.well-known/openid-configuration). ## Clone the Repository You can easily clone your Space repo locally. Start by clicking on the dropdown menu in the top right of your Space page: Select "Clone repository", and then you'll be able to follow the instructions to clone the Space repo to your local machine using HTTPS or SSH. ## Linking Models and Datasets on the Hub You can showcase all the models and datasets that your Space links to by adding their identifier in your Space's README metadata. To do so, you can define them under the `models` and `datasets` keys. In addition to listing the artefacts in the README file, you can also record them in any `.py`, `.ini` or `.html` file as well. We'll parse it auto-magically! Here's an example linking two models from a space: ``` title: My lovely space emoji: 🤗 colorFrom: blue colorTo: green sdk: docker pinned: false models: - reach-vb/musicgen-large-fp16-endpoint - reach-vb/wav2vec2-large-xls-r-1B-common_voice7-lt-ft ``` ### Integrate your library with the Hub https://huggingface.co/docs/hub/models-adding-libraries.md # Integrate your library with the Hub The Hugging Face Hub aims to facilitate sharing machine learning models, checkpoints, and artifacts. This endeavor includes integrating the Hub into many of the amazing third-party libraries in the community. Some of the ones already integrated include [spaCy](https://spacy.io/usage/projects#huggingface_hub), [Sentence Transformers](https://sbert.net/), [OpenCLIP](https://github.com/mlfoundations/open_clip), and [timm](https://huggingface.co/docs/timm/index), among many others. Integration means users can download and upload files to the Hub directly from your library. We hope you will integrate your library and join us in democratizing artificial intelligence for everyone. Integrating the Hub with your library provides many benefits, including: - Free model hosting for you and your users. - Built-in file versioning - even for huge files - made possible by [Git-Xet](./xet/using-xet-storage#git-xet). - Community features (discussions, pull requests, likes). - Usage metrics for all models ran with your library. This tutorial will help you integrate the Hub into your library so your users can benefit from all the features offered by the Hub. Before you begin, we recommend you create a [Hugging Face account](https://huggingface.co/join) from which you can manage your repositories and files. If you need help with the integration, feel free to open an [issue](https://github.com/huggingface/huggingface_hub/issues/new/choose), and we would be more than happy to help you. ## Implementation Implementing an integration of a library with the Hub often means providing built-in methods to load models from the Hub and allow users to push new models to the Hub. This section will cover the basics of how to do that using the `huggingface_hub` library. For more in-depth guidance, check out [this guide](https://huggingface.co/docs/huggingface_hub/guides/integrations). ### Installation To integrate your library with the Hub, you will need to add `huggingface_hub` library as a dependency: ```bash pip install huggingface_hub ``` For more details about `huggingface_hub` installation, check out [this guide](https://huggingface.co/docs/huggingface_hub/installation). > [!TIP] > In this guide, we will focus on Python libraries. If you've implemented your library in JavaScript, you can use [`@huggingface/hub`](https://www.npmjs.com/package/@huggingface/hub) instead. The rest of the logic (i.e. hosting files, code samples, etc.) does not depend on the code language. > > ``` > npm add @huggingface/hub > ``` Users will need to authenticate once they have successfully installed the `huggingface_hub` library. The easiest way to authenticate is to save the token on the machine. Users can do that from the terminal using the `login()` command: ``` hf auth login ``` The command tells them if they are already logged in and prompts them for their token. The token is then validated and saved in their `HF_HOME` directory (defaults to `~/.cache/huggingface/token`). Any script or library interacting with the Hub will use this token when sending requests. Alternatively, users can programmatically login using `login()` in a notebook or a script: ```py from huggingface_hub import login login() ``` Authentication is optional when downloading files from public repos on the Hub. ### Download files from the Hub Integrations allow users to download a model from the Hub and instantiate it directly from your library. This is often made possible by providing a method (usually called `from_pretrained` or `load_from_hf`) that has to be specific to your library. To instantiate a model from the Hub, your library has to: - download files from the Hub. This is what we will discuss now. - instantiate the Python model from these files. Use the [`hf_hub_download`](https://huggingface.co/docs/huggingface_hub/main/en/package_reference/file_download#huggingface_hub.hf_hub_download) method to download files from a repository on the Hub. Downloaded files are stored in the cache: `~/.cache/huggingface/hub`. Users won't have to re-download the file the next time they use it, which saves a lot of time for large files. Furthermore, if the repository is updated with a new version of the file, `huggingface_hub` will automatically download the latest version and store it in the cache. Users don't have to worry about updating their files manually. For example, download the `config.json` file from the [lysandre/arxiv-nlp](https://huggingface.co/lysandre/arxiv-nlp) repository: ```python >>> from huggingface_hub import hf_hub_download >>> config_path = hf_hub_download(repo_id="lysandre/arxiv-nlp", filename="config.json") >>> config_path '/home/lysandre/.cache/huggingface/hub/models--lysandre--arxiv-nlp/snapshots/894a9adde21d9a3e3843e6d5aeaaf01875c7fade/config.json' ``` `config_path` now contains a path to the downloaded file. You are guaranteed that the file exists and is up-to-date. If your library needs to download an entire repository, use [`snapshot_download`](https://huggingface.co/docs/huggingface_hub/main/en/package_reference/file_download#huggingface_hub.snapshot_download). It will take care of downloading all the files in parallel. The return value is a path to the directory containing the downloaded files. ```py >>> from huggingface_hub import snapshot_download >>> snapshot_download(repo_id="lysandre/arxiv-nlp") '/home/lysandre/.cache/huggingface/hub/models--lysandre--arxiv-nlp/snapshots/894a9adde21d9a3e3843e6d5aeaaf01875c7fade' ``` Many options exists to download files from a specific revision, to filter which files to download, to provide a custom cache directory, to download to a local directory, etc. Check out the [download guide](https://huggingface.co/docs/huggingface_hub/en/guides/download) for more details. ### Upload files to the Hub You might also want to provide a method so that users can push their own models to the Hub. This allows the community to build an ecosystem of models compatible with your library. The `huggingface_hub` library offers methods to create repositories and upload files: - `create_repo` creates a repository on the Hub. - `upload_file` and `upload_folder` upload files to a repository on the Hub. The `create_repo` method creates a repository on the Hub. Use the `repo_id` parameter to provide a name for your repository: ```python >>> from huggingface_hub import create_repo >>> create_repo(repo_id="test-model") 'https://huggingface.co/lysandre/test-model' ``` When you check your Hugging Face account, you should now see a `test-model` repository under your namespace. The [`upload_file`](https://huggingface.co/docs/huggingface_hub/en/package_reference/hf_api#huggingface_hub.HfApi.upload_file) method uploads a file to the Hub. This method requires the following: - A path to the file to upload. - The final path in the repository. - The repository you wish to push the files to. For example: ```python >>> from huggingface_hub import upload_file >>> upload_file( ... path_or_fileobj="/home/lysandre/dummy-test/README.md", ... path_in_repo="README.md", ... repo_id="lysandre/test-model" ... ) 'https://huggingface.co/lysandre/test-model/blob/main/README.md' ``` If you check your Hugging Face account, you should see the file inside your repository. Usually, a library will serialize the model to a local directory and then upload to the Hub the entire folder at once. This can be done using [`upload_folder`](https://huggingface.co/docs/huggingface_hub/en/package_reference/hf_api#huggingface_hub.HfApi.upload_folder): ```py >>> from huggingface_hub import upload_folder >>> upload_folder( ... folder_path="/home/lysandre/dummy-test", ... repo_id="lysandre/test-model", ... ) ``` For more details about how to upload files, check out the [upload guide](https://huggingface.co/docs/huggingface_hub/en/guides/upload). ## Model cards Model cards are files that accompany the models and provide handy information. Under the hood, model cards are simple Markdown files with additional metadata. Model cards are essential for discoverability, reproducibility, and sharing! You can find a model card as the README.md file in any model repo. See the [model cards guide](./model-cards) for more details about how to create a good model card. If your library allows pushing a model to the Hub, it is recommended to generate a minimal model card with prefilled metadata (typically `library_name`, `pipeline_tag` or `tags`) and information on how the model has been trained. This will help having a standardized description for all models built with your library. ## Register your library Well done! You should now have a library able to load a model from the Hub and eventually push new models. The next step is to make sure that your models on the Hub are well-documented and integrated with the platform. To do so, libraries can be registered on the Hub, which comes with a few benefits for the users: - a pretty label can be shown on the model page (e.g. `KerasNLP` instead of `keras-nlp`) - a link to your library repository and documentation is added to each model page - a custom download count rule can be defined - code snippets can be generated to show how to load the model using your library To register a new library, please open a Pull Request [here](https://github.com/huggingface/huggingface.js/blob/main/packages/tasks/src/model-libraries.ts) following the instructions below: - The library id should be lowercased and hyphen-separated (example: `"adapter-transformers"`). Make sure to preserve alphabetical order when opening the PR. - set `repoName` and `prettyLabel` with user-friendly casing (example: `DeepForest`). - set `repoUrl` with a link to the library source code (usually a GitHub repository). - (optional) set `docsUrl` with a link to the docs of the library. If the documentation is in the GitHub repo referenced above, no need to set it twice. - set `filter` to `false`. - (optional) define how downloads must be counted by setting `countDownload`. Downloads can be tracked by file extensions or filenames. Make sure to not duplicate the counting. For instance, if loading a model requires 3 files, the download count rule must count downloads only on 1 of the 3 files. Otherwise, the download count will be overestimated. **Note:** if the library uses one of the default config files (`config.json`, `config.yaml`, `hyperparams.yaml`, `params.json`, and `meta.yaml`, see [here](https://huggingface.co/docs/hub/models-download-stats#which-are-the-query-files-for-different-libraries)), there is no need to manually define a download count rule. - (optional) define `snippets` to let the user know how they can quickly instantiate a model. More details below. Before opening the PR, make sure that at least one model is referenced on https://huggingface.co/models?other=my-library-name. If not, the model card metadata of the relevant models must be updated with `library_name: my-library-name` (see [example](https://huggingface.co/google/gemma-scope/blob/main/README.md?code=true#L3)). If you are not the owner of the models on the Hub, please open PRs (see [example](https://huggingface.co/MCG-NJU/VFIMamba/discussions/1)). Here is a minimal [example](https://github.com/huggingface/huggingface.js/pull/885/files) adding integration for VFIMamba. ### Code snippets We recommend adding a code snippet to explain how to use a model in your downstream library. To add a code snippet, you should update the [model-libraries-snippets.ts](https://github.com/huggingface/huggingface.js/blob/main/packages/tasks/src/model-libraries-snippets.ts) file with instructions for your model. For example, the [Asteroid](https://huggingface.co/asteroid-team) integration includes a brief code snippet for how to load and use an Asteroid model: ```typescript const asteroid = (model: ModelData) => `from asteroid.models import BaseModel model = BaseModel.from_pretrained("${model.id}")`; ``` Doing so will also add a tag to your model so users can quickly identify models from your library. Once your snippet has been added to [model-libraries-snippets.ts](https://github.com/huggingface/huggingface.js/blob/main/packages/tasks/src/model-libraries-snippets.ts), you can reference it in [model-libraries.ts](https://github.com/huggingface/huggingface.js/blob/main/packages/tasks/src/model-libraries.ts) as described above. ## Document your library Finally, you can add your library to the Hub's documentation. Check for example the [Setfit PR](https://github.com/huggingface/hub-docs/pull/1150) that added [SetFit](./setfit) to the documentation. ### Polars https://huggingface.co/docs/hub/datasets-polars.md # Polars [Polars](https://pola.rs/) is an in-memory DataFrame library on top of an [OLAP](https://en.wikipedia.org/wiki/Online_analytical_processing) query engine. It is fast, easy to use, and [open source](https://github.com/pola-rs/polars/). Starting from version `1.2.0`, Polars provides _native_ support for the Hugging Face file system. This means that all the benefits of the Polars query optimizer (e.g. predicate and projection pushdown) are applied and Polars will only load the data necessary to complete the query. This significantly speeds up reading, especially for large datasets (see [optimizations](./datasets-polars-optimizations)) You can use the Hugging Face paths (`hf://`) to access data on the Hub: ## Getting started To get started, you can simply `pip install` Polars into your environment: ```bash pip install polars ``` Once you have installed Polars, you can directly query a dataset based on a Hugging Face URL. No other dependencies are needed for this. ```python import polars as pl pl.read_parquet("hf://datasets/roneneldan/TinyStories/data/train-00000-of-00004-2d5a1467fff1081b.parquet") ``` > [!TIP] > Polars provides two APIs: a lazy API (`scan_parquet`) and an eager API (`read_parquet`). We recommend using the eager API for interactive workloads and the lazy API for performance as it allows for better query optimization. For more information on the topic, check out the [Polars user guide](https://docs.pola.rs/user-guide/concepts/lazy-api/#when-to-use-which). Polars supports globbing to download multiple files at once into a single DataFrame. ```python pl.read_parquet("hf://datasets/roneneldan/TinyStories/data/train-*.parquet") ``` ### Hugging Face URLs A Hugging Face URL can be constructed from the `username` and `dataset` name like this: - `hf://datasets/{username}/{dataset}/{path_to_file}` The path may include globbing patterns such as `**/*.parquet` to query all the files matching the pattern. Additionally, for any non-supported [file formats](./datasets-polars-file-formats) you can use the auto-converted parquet files that Hugging Face provides using the `@~parquet branch`: - `hf://datasets/{my-username}/{my-dataset}@~parquet/{path_to_file}` ### Notifications https://huggingface.co/docs/hub/notifications.md # Notifications Notifications allow you to know when new activities (**Pull Requests or discussions**) happen on models, datasets, and Spaces belonging to users or organizations you are watching. By default, you'll receive a notification if: - Someone mentions you in a discussion/PR. - A new comment is posted in a discussion/PR you participated in. - A new discussion/PR or comment is posted in one of the repositories of an organization or user you are watching. ![Notifications page](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/notifications-page.png) You'll get new notifications by email and [directly on the website](https://huggingface.co/notifications), you can change this in your [notifications settings](#notifications-settings). ## Filtering and managing notifications On the [notifications page](https://huggingface.co/notifications), you have several options for filtering and managing your notifications more effectively: - Filter by Repository: Choose to display notifications from a specific repository only. - Filter by Read Status: Display only unread notifications or all notifications. - Filter by Participation: Show notifications you have participated in or those which you have been directly mentioned. Additionally, you can take the following actions to manage your notifications: - Mark as Read/Unread: Change the status of notifications to mark them as read or unread. - Mark as Done: Once marked as done, notifications will no longer appear in the notification center (they are deleted). By default, changes made to notifications will only apply to the selected notifications on the screen. However, you can also apply changes to all matching notifications (like in Gmail for instance) for greater convenience. ## Watching users and organizations By default, you'll be watching all the organizations you are a member of and will be notified of any new activity on those. You can also choose to get notified on arbitrary users or organizations. To do so, use the "Watch repos" button on their HF profiles. Note that you can also quickly watch/unwatch users and organizations directly from your [notifications settings](#notifications-settings). _Unlike GitHub or similar services, you cannot watch a specific repository. You must watch users/organizations to get notified about any new activity on any of their repositories. The goal is to simplify this functionality for users as much as possible and to make sure you don't miss anything you might be interested in._ ## Notifications settings In your [notifications settings](https://huggingface.co/settings/notifications) page, you can choose specific channels to get notified on depending on the type of activity, for example, receiving an email for direct mentions but only a web notification for new activity on watched users and organizations. By default, you'll get an email and a web notification for any new activity but feel free to adjust your settings depending on your needs. _Note that clicking the unsubscribe link in an email will unsubscribe you for the type of activity, eg direct mentions._ ![Notifications settings page](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/notifications-settings.png) You can quickly add any user/organization to your watch list by searching them by name using the dedicated search bar. Unsubscribe from a specific user/organization simply by unticking the corresponding checkbox. ## Mute notifications for a specific repository It's possible to mute notifications for a particular repository by using the "Mute notifications" action in the repository's contextual menu. This will prevent you from receiving any new notifications for that particular repository. You can unmute the repository at any time by clicking the "Unmute notifications" action in the same repository menu. ![mute notification menu](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/notifications-mute-menu.png) _Note, if a repository is muted, you won't receive any new notification unless you're directly mentioned or participating to a discussion._ The list of muted repositories is available from the notifications settings page: ![Notifications settings page muted repositories](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/notifications-settings-muted.png) ## Mute notifications for a specific discussion or PR You can also mute notifications for individual discussions or pull requests by clicking the mute icon in the header. Doing this prevents you from receiving any further notifications from that specific discussion or PR, including direct mentions. You can unmute at any time by clicking the same icon again. ![Notifications mute discussions](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/notifications-mute-discussion.png) ### GGUF https://huggingface.co/docs/hub/gguf.md # GGUF Hugging Face Hub supports all file formats, but has built-in features for [GGUF format](https://github.com/ggerganov/ggml/blob/master/docs/gguf.md), a binary format that is optimized for quick loading and saving of models, making it highly efficient for inference purposes. GGUF is designed for use with GGML and other executors. GGUF was developed by [@ggerganov](https://huggingface.co/ggerganov) who is also the developer of [llama.cpp](https://github.com/ggerganov/llama.cpp), a popular C/C++ LLM inference framework. Models initially developed in frameworks like PyTorch can be converted to GGUF format for use with those engines. As we can see in this graph, unlike tensor-only file formats like [safetensors](https://huggingface.co/docs/safetensors) – which is also a recommended model format for the Hub – GGUF encodes both the tensors and a standardized set of metadata. ## Finding GGUF files You can browse all models with GGUF files filtering by the GGUF tag: [hf.co/models?library=gguf](https://huggingface.co/models?library=gguf). Moreover, you can use [ggml-org/gguf-my-repo](https://huggingface.co/spaces/ggml-org/gguf-my-repo) tool to convert/quantize your model weights into GGUF weights. For example, you can check out [TheBloke/Mixtral-8x7B-Instruct-v0.1-GGUF](https://huggingface.co/TheBloke/Mixtral-8x7B-Instruct-v0.1-GGUF) for seeing GGUF files in action. ## Viewer for metadata & tensors info The Hub has a viewer for GGUF files that lets a user check out metadata & tensors info (name, shape, precision). The viewer is available on model page ([example](https://huggingface.co/TheBloke/Mixtral-8x7B-Instruct-v0.1-GGUF?show_tensors=mixtral-8x7b-instruct-v0.1.Q4_0.gguf)) & files page ([example](https://huggingface.co/TheBloke/Mixtral-8x7B-Instruct-v0.1-GGUF/tree/main?show_tensors=mixtral-8x7b-instruct-v0.1.Q5_K_M.gguf)). ## Usage with open-source tools * [llama.cpp](./gguf-llamacpp) * [GPT4All](./gguf-gpt4all) * [Ollama](./ollama) ## Parsing the metadata with @huggingface/gguf We've also created a javascript GGUF parser that works on remotely hosted files (e.g. Hugging Face Hub). ```bash npm install @huggingface/gguf ``` ```ts import { gguf } from "@huggingface/gguf"; // remote GGUF file from https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF const URL_LLAMA = "https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/resolve/191239b/llama-2-7b-chat.Q2_K.gguf"; const { metadata, tensorInfos } = await gguf(URL_LLAMA); ``` Find more information [here](https://github.com/huggingface/huggingface.js/tree/main/packages/gguf). ## Quantization Types | type | source | description | |---------------------------|--------|-------------| | F64 | [Wikipedia](https://en.wikipedia.org/wiki/Double-precision_floating-point_format) | 64-bit standard IEEE 754 double-precision floating-point number. | | I64 | [GH](https://github.com/ggerganov/llama.cpp/pull/6062) | 64-bit fixed-width integer number. | | F32 | [Wikipedia](https://en.wikipedia.org/wiki/Single-precision_floating-point_format) | 32-bit standard IEEE 754 single-precision floating-point number. | | I32 | [GH](https://github.com/ggerganov/llama.cpp/pull/6045) | 32-bit fixed-width integer number. | | F16 | [Wikipedia](https://en.wikipedia.org/wiki/Half-precision_floating-point_format) | 16-bit standard IEEE 754 half-precision floating-point number. | | BF16 | [Wikipedia](https://en.wikipedia.org/wiki/Bfloat16_floating-point_format) | 16-bit shortened version of the 32-bit IEEE 754 single-precision floating-point number. | | I16 | [GH](https://github.com/ggerganov/llama.cpp/pull/6045) | 16-bit fixed-width integer number. | | Q8_0 | [GH](https://github.com/huggingface/huggingface.js/pull/615#discussion_r1557654249) | 8-bit round-to-nearest quantization (`q`). Each block has 32 weights. Weight formula: `w = q * block_scale`. Legacy quantization method (not used widely as of today). | | Q8_1 | [GH](https://github.com/huggingface/huggingface.js/pull/615#discussion_r1557682290) | 8-bit round-to-nearest quantization (`q`). Each block has 32 weights. Weight formula: `w = q * block_scale + block_minimum`. Legacy quantization method (not used widely as of today) | | Q8_K | [GH](https://github.com/ggerganov/llama.cpp/pull/1684#issue-1739619305) | 8-bit quantization (`q`). Each block has 256 weights. Only used for quantizing intermediate results. All 2-6 bit dot products are implemented for this quantization type. Weight formula: `w = q * block_scale`. | | I8 | [GH](https://github.com/ggerganov/llama.cpp/pull/6045) | 8-bit fixed-width integer number. | | Q6_K | [GH](https://github.com/ggerganov/llama.cpp/pull/1684#issue-1739619305) | 6-bit quantization (`q`). Super-blocks with 16 blocks, each block has 16 weights. Weight formula: `w = q * block_scale(8-bit)`, resulting in 6.5625 bits-per-weight. | | Q5_0 | [GH](https://github.com/huggingface/huggingface.js/pull/615#discussion_r1557654249) | 5-bit round-to-nearest quantization (`q`). Each block has 32 weights. Weight formula: `w = q * block_scale`. Legacy quantization method (not used widely as of today). | | Q5_1 | [GH](https://github.com/huggingface/huggingface.js/pull/615#discussion_r1557682290) | 5-bit round-to-nearest quantization (`q`). Each block has 32 weights. Weight formula: `w = q * block_scale + block_minimum`. Legacy quantization method (not used widely as of today). | | Q5_K | [GH](https://github.com/ggerganov/llama.cpp/pull/1684#issue-1739619305) | 5-bit quantization (`q`). Super-blocks with 8 blocks, each block has 32 weights. Weight formula: `w = q * block_scale(6-bit) + block_min(6-bit)`, resulting in 5.5 bits-per-weight. | | Q4_0 | [GH](https://github.com/huggingface/huggingface.js/pull/615#discussion_r1557654249) | 4-bit round-to-nearest quantization (`q`). Each block has 32 weights. Weight formula: `w = q * block_scale`. Legacy quantization method (not used widely as of today). | | Q4_1 | [GH](https://github.com/huggingface/huggingface.js/pull/615#discussion_r1557682290) | 4-bit round-to-nearest quantization (`q`). Each block has 32 weights. Weight formula: `w = q * block_scale + block_minimum`. Legacy quantization method (not used widely as of today). | | Q4_K | [GH](https://github.com/ggerganov/llama.cpp/pull/1684#issue-1739619305) | 4-bit quantization (`q`). Super-blocks with 8 blocks, each block has 32 weights. Weight formula: `w = q * block_scale(6-bit) + block_min(6-bit)`, resulting in 4.5 bits-per-weight. | | Q3_K | [GH](https://github.com/ggerganov/llama.cpp/pull/1684#issue-1739619305) | 3-bit quantization (`q`). Super-blocks with 16 blocks, each block has 16 weights. Weight formula: `w = q * block_scale(6-bit)`, resulting in 3.4375 bits-per-weight. | | Q2_K | [GH](https://github.com/ggerganov/llama.cpp/pull/1684#issue-1739619305) | 2-bit quantization (`q`). Super-blocks with 16 blocks, each block has 16 weights. Weight formula: `w = q * block_scale(4-bit) + block_min(4-bit)`, resulting in 2.625 bits-per-weight. | | IQ4_NL | [GH](https://github.com/ggerganov/llama.cpp/pull/5590) | 4-bit quantization (`q`). Super-blocks with 256 weights. Weight `w` is obtained using `super_block_scale` & `importance matrix`. | | IQ4_XS | [HF](https://huggingface.co/CISCai/OpenCodeInterpreter-DS-6.7B-SOTA-GGUF/blob/main/README.md?code=true#L59-L70) | 4-bit quantization (`q`). Super-blocks with 256 weights. Weight `w` is obtained using `super_block_scale` & `importance matrix`, resulting in 4.25 bits-per-weight. | | IQ3_S | [HF](https://huggingface.co/CISCai/OpenCodeInterpreter-DS-6.7B-SOTA-GGUF/blob/main/README.md?code=true#L59-L70) | 3-bit quantization (`q`). Super-blocks with 256 weights. Weight `w` is obtained using `super_block_scale` & `importance matrix`, resulting in 3.44 bits-per-weight. | | IQ3_XXS | [HF](https://huggingface.co/CISCai/OpenCodeInterpreter-DS-6.7B-SOTA-GGUF/blob/main/README.md?code=true#L59-L70) | 3-bit quantization (`q`). Super-blocks with 256 weights. Weight `w` is obtained using `super_block_scale` & `importance matrix`, resulting in 3.06 bits-per-weight. | | IQ2_XXS | [HF](https://huggingface.co/CISCai/OpenCodeInterpreter-DS-6.7B-SOTA-GGUF/blob/main/README.md?code=true#L59-L70) | 2-bit quantization (`q`). Super-blocks with 256 weights. Weight `w` is obtained using `super_block_scale` & `importance matrix`, resulting in 2.06 bits-per-weight. | | IQ2_S | [HF](https://huggingface.co/CISCai/OpenCodeInterpreter-DS-6.7B-SOTA-GGUF/blob/main/README.md?code=true#L59-L70) | 2-bit quantization (`q`). Super-blocks with 256 weights. Weight `w` is obtained using `super_block_scale` & `importance matrix`, resulting in 2.5 bits-per-weight. | | IQ2_XS | [HF](https://huggingface.co/CISCai/OpenCodeInterpreter-DS-6.7B-SOTA-GGUF/blob/main/README.md?code=true#L59-L70) | 2-bit quantization (`q`). Super-blocks with 256 weights. Weight `w` is obtained using `super_block_scale` & `importance matrix`, resulting in 2.31 bits-per-weight. | | IQ1_S | [HF](https://huggingface.co/CISCai/OpenCodeInterpreter-DS-6.7B-SOTA-GGUF/blob/main/README.md?code=true#L59-L70) | 1-bit quantization (`q`). Super-blocks with 256 weights. Weight `w` is obtained using `super_block_scale` & `importance matrix`, resulting in 1.56 bits-per-weight. | | IQ1_M | [GH](https://github.com/ggerganov/llama.cpp/pull/6302) | 1-bit quantization (`q`). Super-blocks with 256 weights. Weight `w` is obtained using `super_block_scale` & `importance matrix`, resulting in 1.75 bits-per-weight. | *if there's any inaccuracy on the table above, please open a PR on [this file](https://github.com/huggingface/huggingface.js/blob/main/packages/gguf/src/quant-descriptions.ts).* ### Using MLX at Hugging Face https://huggingface.co/docs/hub/mlx.md # Using MLX at Hugging Face [MLX](https://github.com/ml-explore/mlx) is a model training and serving framework for Apple silicon made by Apple Machine Learning Research. It comes with a variety of examples: - [Generate text with MLX-LM](https://github.com/ml-explore/mlx-lm/tree/main) and [generating text with MLX-LM for models in GGUF format](https://github.com/ml-explore/mlx-examples/tree/main/llms/gguf_llm). - Large-scale text generation with [LLaMA](https://github.com/ml-explore/mlx-examples/tree/main/llms/llama). - Fine-tuning with [LoRA](https://github.com/ml-explore/mlx-examples/tree/main/lora). - Generating images with [Stable Diffusion](https://github.com/ml-explore/mlx-examples/tree/main/stable_diffusion). - Speech recognition with [OpenAI's Whisper](https://github.com/ml-explore/mlx-examples/tree/main/whisper). ## Exploring MLX on the Hub You can find MLX models by filtering at the left of the [models page](https://huggingface.co/models?library=mlx&sort=trending). There's also an open [MLX community](https://huggingface.co/mlx-community) of contributors converting and publishing weights for MLX format. Thanks to MLX Hugging Face Hub integration, you can load MLX models with a few lines of code. ## Installation MLX comes as a standalone package, and there's a subpackage called MLX-LM with Hugging Face integration for Large Language Models. To install MLX-LM, you can use the following one-line install through `pip`: ```bash pip install mlx-lm ``` You can get more information about it [here](https://github.com/ml-explore/mlx-lm/tree/main). If you install `mlx-lm`, you don't need to install `mlx`. If you don't want to use `mlx-lm` but only MLX, you can install MLX itself as follows. With `pip`: ```bash pip install mlx ``` With `conda`: ```bash conda install -c conda-forge mlx ``` ## Using Existing Models MLX-LM has useful utilities to generate text. The following line directly downloads and loads the model and starts generating text. ```bash python -m mlx_lm.generate --model mistralai/Mistral-7B-Instruct-v0.2 --prompt "hello" ``` For a full list of generation options, run ```bash python -m mlx_lm.generate --help ``` You can also load a model and start generating text through Python like below: ```python from mlx_lm import load, generate model, tokenizer = load("mistralai/Mistral-7B-Instruct-v0.2") response = generate(model, tokenizer, prompt="hello", verbose=True) ``` MLX-LM supports popular LLM architectures including LLaMA, Phi-2, Mistral, and Qwen. Models other than supported ones can easily be downloaded as follows: ```py pip install -U huggingface_hub export HF_XET_HIGH_PERFORMANCE=1 hf download --local-dir / ``` ## Converting and Sharing Models You can convert, and optionally quantize, LLMs from the Hugging Face Hub as follows: ```bash python -m mlx_lm.convert --hf-path mistralai/Mistral-7B-v0.1 -q ``` If you want to directly push the model after the conversion, you can do it like below. ```bash python -m mlx_lm.convert \ --hf-path mistralai/Mistral-7B-v0.1 \ -q \ --upload-repo / ``` ## Additional Resources * [MLX Repository](https://github.com/ml-explore/mlx) * [MLX Docs](https://ml-explore.github.io/mlx/) * [MLX-LM](https://github.com/ml-explore/mlx-lm/tree/main) * [MLX Examples](https://github.com/ml-explore/mlx-examples/tree/main) * [All MLX models on the Hub](https://huggingface.co/models?library=mlx&sort=trending) ### Using PaddleNLP at Hugging Face https://huggingface.co/docs/hub/paddlenlp.md # Using PaddleNLP at Hugging Face Leveraging the [PaddlePaddle](https://github.com/PaddlePaddle/Paddle) framework, [`PaddleNLP`](https://github.com/PaddlePaddle/PaddleNLP) is an easy-to-use and powerful NLP library with awesome pre-trained model zoo, supporting wide-range of NLP tasks from research to industrial applications. ## Exploring PaddleNLP in the Hub You can find `PaddleNLP` models by filtering at the left of the [models page](https://huggingface.co/models?library=paddlenlp&sort=downloads). All models on the Hub come up with the following features: 1. An automatically generated model card with a brief description and metadata tags that help for discoverability. 2. An interactive widget you can use to play out with the model directly in the browser. 3. An Inference API that allows to make inference requests. 4. Easily deploy your model as a Gradio app on Spaces. ## Installation To get started, you can follow [PaddlePaddle Quick Start](https://www.paddlepaddle.org.cn/en/install) to install the PaddlePaddle Framework with your favorite OS, Package Manager and Compute Platform. `paddlenlp` offers a quick one-line install through pip: ``` pip install -U paddlenlp ``` ## Using existing models Similar to `transformer` models, the `paddlenlp` library provides a simple one-liner to load models from the Hugging Face Hub by setting `from_hf_hub=True`! Depending on how you want to use them, you can use the high-level API using the `Taskflow` function or you can use `AutoModel` and `AutoTokenizer` for more control. ```py # Taskflow provides a simple end-to-end capability and a more optimized experience for inference from paddlenlp.transformers import Taskflow taskflow = Taskflow("fill-mask", task_path="PaddlePaddle/ernie-1.0-base-zh", from_hf_hub=True) # If you want more control, you will need to define the tokenizer and model. from paddlenlp.transformers import AutoTokenizer, AutoModelForMaskedLM tokenizer = AutoTokenizer.from_pretrained("PaddlePaddle/ernie-1.0-base-zh", from_hf_hub=True) model = AutoModelForMaskedLM.from_pretrained("PaddlePaddle/ernie-1.0-base-zh", from_hf_hub=True) ``` If you want to see how to load a specific model, you can click `Use in paddlenlp` and you will be given a working snippet that you can load it! ## Sharing your models You can share your `PaddleNLP` models by using the `save_to_hf_hub` method under all `Model` and `Tokenizer` classes. ```py from paddlenlp.transformers import AutoTokenizer, AutoModelForMaskedLM tokenizer = AutoTokenizer.from_pretrained("PaddlePaddle/ernie-1.0-base-zh", from_hf_hub=True) model = AutoModelForMaskedLM.from_pretrained("PaddlePaddle/ernie-1.0-base-zh", from_hf_hub=True) tokenizer.save_to_hf_hub(repo_id="/") model.save_to_hf_hub(repo_id="/") ``` ## Additional resources - PaddlePaddle Installation [guide](https://www.paddlepaddle.org.cn/en/install). - PaddleNLP [GitHub Repo](https://github.com/PaddlePaddle/PaddleNLP). - [PaddlePaddle on the Hugging Face Hub](https://huggingface.co/PaddlePaddle) ### Widget Examples https://huggingface.co/docs/hub/models-widgets-examples.md # Widget Examples Note that each widget example can also optionally describe the corresponding model output, directly in the `output` property. See [the spec](./models-widgets#example-outputs) for more details. ## Natural Language Processing ### Fill-Mask ```yaml widget: - text: "Paris is the of France." example_title: "Capital" - text: "The goal of life is ." example_title: "Philosophy" ``` ### Question Answering ```yaml widget: - text: "What's my name?" context: "My name is Clara and I live in Berkeley." example_title: "Name" - text: "Where do I live?" context: "My name is Sarah and I live in London" example_title: "Location" ``` ### Summarization ```yaml widget: - text: "The tower is 324 metres (1,063 ft) tall, about the same height as an 81-storey building, and the tallest structure in Paris. Its base is square, measuring 125 metres (410 ft) on each side. During its construction, the Eiffel Tower surpassed the Washington Monument to become the tallest man-made structure in the world, a title it held for 41 years until the Chrysler Building in New York City was finished in 1930. It was the first structure to reach a height of 300 metres. Due to the addition of a broadcasting aerial at the top of the tower in 1957, it is now taller than the Chrysler Building by 5.2 metres (17 ft). Excluding transmitters, the Eiffel Tower is the second tallest free-standing structure in France after the Millau Viaduct." example_title: "Eiffel Tower" - text: "Laika, a dog that was the first living creature to be launched into Earth orbit, on board the Soviet artificial satellite Sputnik 2, on November 3, 1957. It was always understood that Laika would not survive the mission, but her actual fate was misrepresented for decades. Laika was a small (13 pounds [6 kg]), even-tempered, mixed-breed dog about two years of age. She was one of a number of stray dogs that were taken into the Soviet spaceflight program after being rescued from the streets. Only female dogs were used because they were considered to be anatomically better suited than males for close confinement." example_title: "First in Space" ``` ### Table Question Answering ```yaml widget: - text: "How many stars does the transformers repository have?" table: Repository: - "Transformers" - "Datasets" - "Tokenizers" Stars: - 36542 - 4512 - 3934 Contributors: - 651 - 77 - 34 Programming language: - "Python" - "Python" - "Rust, Python and NodeJS" example_title: "Github stars" ``` ### Text Classification ```yaml widget: - text: "I love football so much" example_title: "Positive" - text: "I don't really like this type of food" example_title: "Negative" ``` ### Text Generation ```yaml widget: - text: "My name is Julien and I like to" example_title: "Julien" - text: "My name is Merve and my favorite" example_title: "Merve" ``` ### Text2Text Generation ```yaml widget: - text: "My name is Julien and I like to" example_title: "Julien" - text: "My name is Merve and my favorite" example_title: "Merve" ``` ### Token Classification ```yaml widget: - text: "My name is Sylvain and I live in Paris" example_title: "Parisian" - text: "My name is Sarah and I live in London" example_title: "Londoner" ``` ### Translation ```yaml widget: - text: "My name is Sylvain and I live in Paris" example_title: "Parisian" - text: "My name is Sarah and I live in London" example_title: "Londoner" ``` ### Zero-Shot Classification ```yaml widget: - text: "I have a problem with my car that needs to be resolved asap!!" candidate_labels: "urgent, not urgent, phone, tablet, computer" multi_class: true example_title: "Car problem" - text: "Last week I upgraded my iOS version and ever since then my phone has been overheating whenever I use your app." candidate_labels: "mobile, website, billing, account access" multi_class: false example_title: "Phone issue" ``` ### Sentence Similarity ```yaml widget: - source_sentence: "That is a happy person" sentences: - "That is a happy dog" - "That is a very happy person" - "Today is a sunny day" example_title: "Happy" ``` ### Conversational ```yaml widget: - text: "Hey my name is Julien! How are you?" example_title: "Julien" - text: "Hey my name is Clara! How are you?" example_title: "Clara" ``` ### Feature Extraction ```yaml widget: - text: "My name is Sylvain and I live in Paris" example_title: "Parisian" - text: "My name is Sarah and I live in London" example_title: "Londoner" ``` ## Audio ### Text-to-Speech ```yaml widget: - text: "My name is Sylvain and I live in Paris" example_title: "Parisian" - text: "My name is Sarah and I live in London" example_title: "Londoner" ``` ### Automatic Speech Recognition ```yaml widget: - src: https://cdn-media.huggingface.co/speech_samples/sample1.flac example_title: Librispeech sample 1 - src: https://cdn-media.huggingface.co/speech_samples/sample2.flac example_title: Librispeech sample 2 ``` ### Audio-to-Audio ```yaml widget: - src: https://cdn-media.huggingface.co/speech_samples/sample1.flac example_title: Librispeech sample 1 - src: https://cdn-media.huggingface.co/speech_samples/sample2.flac example_title: Librispeech sample 2 ``` ### Audio Classification ```yaml widget: - src: https://cdn-media.huggingface.co/speech_samples/sample1.flac example_title: Librispeech sample 1 - src: https://cdn-media.huggingface.co/speech_samples/sample2.flac example_title: Librispeech sample 2 ``` ### Voice Activity Detection ```yaml widget: - src: https://cdn-media.huggingface.co/speech_samples/sample1.flac example_title: Librispeech sample 1 - src: https://cdn-media.huggingface.co/speech_samples/sample2.flac example_title: Librispeech sample 2 ``` ## Computer Vision ### Image Classification ```yaml widget: - src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/tiger.jpg example_title: Tiger - src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/teapot.jpg example_title: Teapot ``` ### Object Detection ```yaml widget: - src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/football-match.jpg example_title: Football Match - src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/airport.jpg example_title: Airport ``` ### Image Segmentation ```yaml widget: - src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/football-match.jpg example_title: Football Match - src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/airport.jpg example_title: Airport ``` ### Image-to-Image ```yaml widget: - src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/canny-edge.jpg prompt: Girl with Pearl Earring # `prompt` field is optional in case the underlying model supports text guidance ``` ### Image-to-Video ```yaml widget: - src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/canny-edge.jpg prompt: Girl with Pearl Earring # `prompt` field is optional in case the underlying model supports text guidance ``` ### Text-to-Image ```yaml widget: - text: "A cat playing with a ball" example_title: "Cat" - text: "A dog jumping over a fence" example_title: "Dog" ``` ### Document Question Answering ```yaml widget: - text: "What is the invoice number?" src: "https://huggingface.co/spaces/impira/docquery/resolve/2359223c1837a7587402bda0f2643382a6eefeab/invoice.png" - text: "What is the purchase amount?" src: "https://huggingface.co/spaces/impira/docquery/resolve/2359223c1837a7587402bda0f2643382a6eefeab/contract.jpeg" ``` ### Visual Question Answering ```yaml widget: - text: "What animal is it?" src: "https://huggingface.co/datasets/mishig/sample_images/resolve/main/tiger.jpg" - text: "Where is it?" src: "https://huggingface.co/datasets/mishig/sample_images/resolve/main/palace.jpg" ``` ### Zero-Shot Image Classification ```yaml widget: - src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/cat-dog-music.png candidate_labels: playing music, playing sports example_title: Cat & Dog ``` ## Other ### Structured Data Classification ```yaml widget: - structured_data: fixed_acidity: - 7.4 - 7.8 - 10.3 volatile_acidity: - 0.7 - 0.88 - 0.32 citric_acid: - 0 - 0 - 0.45 residual_sugar: - 1.9 - 2.6 - 6.4 chlorides: - 0.076 - 0.098 - 0.073 free_sulfur_dioxide: - 11 - 25 - 5 total_sulfur_dioxide: - 34 - 67 - 13 density: - 0.9978 - 0.9968 - 0.9976 pH: - 3.51 - 3.2 - 3.23 sulphates: - 0.56 - 0.68 - 0.82 alcohol: - 9.4 - 9.8 - 12.6 example_title: "Wine" ``` ### Using fastai at Hugging Face https://huggingface.co/docs/hub/fastai.md # Using fastai at Hugging Face `fastai` is an open-source Deep Learning library that leverages PyTorch and Python to provide high-level components to train fast and accurate neural networks with state-of-the-art outputs on text, vision, and tabular data. ## Exploring fastai in the Hub You can find `fastai` models by filtering at the left of the [models page](https://huggingface.co/models?library=fastai&sort=downloads). All models on the Hub come up with the following features: 1. An automatically generated model card with a brief description and metadata tags that help for discoverability. 2. An interactive widget you can use to play out with the model directly in the browser (for Image Classification) 3. An Inference API that allows to make inference requests (for Image Classification). ## Using existing models The `huggingface_hub` library is a lightweight Python client with utility functions to download models from the Hub. ```bash pip install huggingface_hub["fastai"] ``` Once you have the library installed, you just need to use the `from_pretrained_fastai` method. This method not only loads the model, but also validates the `fastai` version when the model was saved, which is important for reproducibility. ```py from huggingface_hub import from_pretrained_fastai learner = from_pretrained_fastai("espejelomar/identify-my-cat") _,_,probs = learner.predict(img) print(f"Probability it's a cat: {100*probs[1].item():.2f}%") # Probability it's a cat: 100.00% ``` If you want to see how to load a specific model, you can click `Use in fastai` and you will be given a working snippet that you can load it! ## Sharing your models You can share your `fastai` models by using the `push_to_hub_fastai` method. ```py from huggingface_hub import push_to_hub_fastai push_to_hub_fastai(learner=learn, repo_id="espejelomar/identify-my-cat") ``` ## Additional resources * fastai [course](https://course.fast.ai/). * fastai [website](https://www.fast.ai/). * Integration with Hub [docs](https://docs.fast.ai/huggingface.html). * Integration with Hub [announcement](https://huggingface.co/blog/fastai). ### Using timm at Hugging Face https://huggingface.co/docs/hub/timm.md # Using timm at Hugging Face `timm`, also known as [pytorch-image-models](https://github.com/rwightman/pytorch-image-models), is an open-source collection of state-of-the-art PyTorch image models, pretrained weights, and utility scripts for training, inference, and validation. This documentation focuses on `timm` functionality in the Hugging Face Hub instead of the `timm` library itself. For detailed information about the `timm` library, visit [its documentation](https://huggingface.co/docs/timm). You can find a number of `timm` models on the Hub using the filters on the left of the [models page](https://huggingface.co/models?library=timm&sort=downloads). All models on the Hub come with several useful features: 1. An automatically generated model card, which model authors can complete with [information about their model](./model-cards). 2. Metadata tags help users discover the relevant `timm` models. 3. An [interactive widget](./models-widgets) you can use to play with the model directly in the browser. 4. An [Inference API](./models-inference) that allows users to make inference requests. ## Using existing models from the Hub Any `timm` model from the Hugging Face Hub can be loaded with a single line of code as long as you have `timm` installed! Once you've selected a model from the Hub, pass the model's ID prefixed with `hf-hub:` to `timm`'s `create_model` method to download and instantiate the model. ```py import timm # Loading https://huggingface.co/timm/eca_nfnet_l0 model = timm.create_model("hf-hub:timm/eca_nfnet_l0", pretrained=True) ``` If you want to see how to load a specific model, you can click **Use in timm** and you will be given a working snippet to load it! ### Inference The snippet below shows how you can perform inference on a `timm` model loaded from the Hub: ```py import timm import torch from PIL import Image from timm.data import resolve_data_config from timm.data.transforms_factory import create_transform # Load from Hub 🔥 model = timm.create_model( 'hf-hub:nateraw/resnet50-oxford-iiit-pet', pretrained=True ) # Set model to eval mode for inference model.eval() # Create Transform transform = create_transform(**resolve_data_config(model.pretrained_cfg, model=model)) # Get the labels from the model config labels = model.pretrained_cfg['label_names'] top_k = min(len(labels), 5) # Use your own image file here... image = Image.open('boxer.jpg').convert('RGB') # Process PIL image with transforms and add a batch dimension x = transform(image).unsqueeze(0) # Pass inputs to model forward function to get outputs out = model(x) # Apply softmax to get predicted probabilities for each class probabilities = torch.nn.functional.softmax(out[0], dim=0) # Grab the values and indices of top 5 predicted classes values, indices = torch.topk(probabilities, top_k) # Prepare a nice dict of top k predictions predictions = [ {"label": labels[i], "score": v.item()} for i, v in zip(indices, values) ] print(predictions) ``` This should leave you with a list of predictions, like this: ```py [ {'label': 'american_pit_bull_terrier', 'score': 0.9999998807907104}, {'label': 'staffordshire_bull_terrier', 'score': 1.0000000149011612e-07}, {'label': 'miniature_pinscher', 'score': 1.0000000149011612e-07}, {'label': 'chihuahua', 'score': 1.0000000149011612e-07}, {'label': 'beagle', 'score': 1.0000000149011612e-07} ] ``` ## Sharing your models You can share your `timm` models directly to the Hugging Face Hub. This will publish a new version of your model to the Hugging Face Hub, creating a model repo for you if it doesn't already exist. Before pushing a model, make sure that you've logged in to Hugging Face: ```sh python -m pip install huggingface_hub hf auth login ``` Alternatively, if you prefer working from a Jupyter or Colaboratory notebook, once you've installed `huggingface_hub` you can log in with: ```py from huggingface_hub import notebook_login notebook_login() ``` Then, push your model using the `push_to_hf_hub` method: ```py import timm # Build or load a model, e.g. timm's pretrained resnet18 model = timm.create_model('resnet18', pretrained=True, num_classes=4) ########################### # [Fine tune your model...] ########################### # Push it to the 🤗 Hub timm.models.hub.push_to_hf_hub( model, 'resnet18-random-classifier', model_config={'labels': ['a', 'b', 'c', 'd']} ) # Load your model from the Hub model_reloaded = timm.create_model( 'hf-hub:/resnet18-random-classifier', pretrained=True ) ``` ## Inference Widget and API All `timm` models on the Hub are automatically equipped with an [inference widget](./models-widgets), pictured below for [nateraw/timm-resnet50-beans](https://huggingface.co/nateraw/timm-resnet50-beans). Additionally, `timm` models are available through the [Inference API](./models-inference), which you can access through HTTP with cURL, Python's `requests` library, or your preferred method for making network requests. ```sh curl https://api-inference.huggingface.co/models/nateraw/timm-resnet50-beans \ -X POST \ --data-binary '@beans.jpeg' \ -H "Authorization: Bearer {$HF_API_TOKEN}" # [{"label":"angular_leaf_spot","score":0.9845947027206421},{"label":"bean_rust","score":0.01368315052241087},{"label":"healthy","score":0.001722085871733725}] ``` ## Additional resources * timm (pytorch-image-models) [GitHub Repo](https://github.com/rwightman/pytorch-image-models). * timm [documentation](https://huggingface.co/docs/timm). * Additional documentation at [timmdocs](https://timm.fast.ai) by [Aman Arora](https://github.com/amaarora). * [Getting Started with PyTorch Image Models (timm): A Practitioner’s Guide](https://towardsdatascience.com/getting-started-with-pytorch-image-models-timm-a-practitioners-guide-4e77b4bf9055) by [Chris Hughes](https://github.com/Chris-hughes10). ### Dataset Cards https://huggingface.co/docs/hub/datasets-cards.md # Dataset Cards ## What are Dataset Cards? Each dataset may be documented by the `README.md` file in the repository. This file is called a **dataset card**, and the Hugging Face Hub will render its contents on the dataset's main page. To inform users about how to responsibly use the data, it's a good idea to include information about any potential biases within the dataset. Generally, dataset cards help users understand the contents of the dataset and give context for how the dataset should be used. You can also add dataset metadata to your card. The metadata describes important information about a dataset such as its license, language, and size. It also contains tags to help users discover a dataset on the Hub, and [data files configuration](./datasets-manual-configuration) options. Tags are defined in a YAML metadata section at the top of the `README.md` file. ## Dataset card metadata A dataset repo will render its README.md as a dataset card. To control how the Hub displays the card, you should create a YAML section in the README file to define some metadata. Start by adding three --- at the top, then include all of the relevant metadata, and close the section with another group of --- like the example below: ```yaml language: - "List of ISO 639-1 code for your language" - lang1 - lang2 pretty_name: "Pretty Name of the Dataset" tags: - tag1 - tag2 license: "any valid license identifier" task_categories: - task1 - task2 ``` The metadata that you add to the dataset card enables certain interactions on the Hub. For example: * Allow users to filter and discover datasets at https://huggingface.co/datasets. * If you choose a license using the keywords listed in the right column of [this table](./repositories-licenses), the license will be displayed on the dataset page. When creating a README.md file in a dataset repository on the Hub, use Metadata UI to fill the main metadata: To see metadata fields, see the detailed [Dataset Card specifications](https://github.com/huggingface/hub-docs/blob/main/datasetcard.md?plain=1). ### Dataset card creation guide For a step-by-step guide on creating a dataset card, check out the [Create a dataset card](https://huggingface.co/docs/datasets/dataset_card) guide. Reading through existing dataset cards, such as the [ELI5 dataset card](https://huggingface.co/datasets/eli5/blob/main/README.md), is a great way to familiarize yourself with the common conventions. ### Linking a Paper If the dataset card includes a link to a Paper page (either on HF or an Arxiv abstract/PDF), the Hub will extract the arXiv ID and include it in the dataset tags with the format `arxiv:`. Clicking on the tag will let you: * Visit the Paper page * Filter for other models on the Hub that cite the same paper. Read more about paper pages [here](./paper-pages). ### Force set a dataset modality The Hub will automatically detect the modality of a dataset based on the files it contains (audio, video, geospatial, etc.). If you want to force a specific modality, you can add a tag to the dataset card metadata: `3d`, `audio`, `geospatial`, `image`, `tabular`, `text`, `timeseries`, `video`. For example, to force the modality to `audio`, add the following to the dataset card metadata: ```yaml tags: - audio ``` ### Associate a library to the dataset The dataset page automatically shows libraries and tools that are able to natively load the dataset, but if you want to show another specific library, you can add a tag to the dataset card metadata: `argilla`, `dask`, `datasets`, `distilabel`, `fiftyone`, `mlcroissant`, `pandas`, `webdataset`. See the [list of supported libraries](https://github.com/huggingface/huggingface.js/blob/main/packages/tasks/src/dataset-libraries.ts) for more information, or to propose to add a new library. For example, to associate the `argilla` library to the dataset card, add the following to the dataset card metadata: ```yaml tags: - argilla ``` ### Organization cards https://huggingface.co/docs/hub/organizations-cards.md # Organization cards You can create an organization card to help users learn more about what your organization is working on and how users can use your libraries, models, datasets, and Spaces. An organization card is displayed on an organization's profile: If you're a member of an organization, you'll see a button to create or edit your organization card on the organization's main page. Organization cards are a `README.md` static file inside a Space repo named `README`. The card can be as simple as Markdown text, or you can create a more customized appearance with HTML. The card for the [Hugging Face Course organization](https://huggingface.co/huggingface-course), shown above, [contains the following HTML](https://huggingface.co/spaces/huggingface-course/README/blob/main/README.md): ```html This is the organization grouping all the models and datasets used in the Hugging Face course. ``` For more examples, take a look at: * [Amazon's](https://huggingface.co/spaces/amazon/README/blob/main/README.md) organization card source code * [spaCy's](https://huggingface.co/spaces/spacy/README/blob/main/README.md) organization card source code. ### Aim on Spaces https://huggingface.co/docs/hub/spaces-sdks-docker-aim.md # Aim on Spaces **Aim** is an easy-to-use & supercharged open-source experiment tracker. Aim logs your training runs and enables a beautiful UI to compare them and an API to query them programmatically. ML engineers and researchers use Aim explorers to compare 1000s of training runs in a few clicks. Check out the [Aim docs](https://aimstack.readthedocs.io/en/latest/) to learn more about Aim. If you have an idea for a new feature or have noticed a bug, feel free to [open a feature request or report a bug](https://github.com/aimhubio/aim/issues/new/choose). In the following sections, you'll learn how to deploy Aim on the Hugging Face Hub Spaces and explore your training runs directly from the Hub. ## Deploy Aim on Spaces You can deploy Aim on Spaces with a single click! Once you have created the Space, you'll see the `Building` status, and once it becomes `Running,` your Space is ready to go! Now, when you navigate to your Space's **App** section, you can access the Aim UI. ## Compare your experiments with Aim on Spaces Let's use a quick example of a PyTorch CNN trained on MNIST to demonstrate end-to-end Aim on Spaces deployment. The full example is in the [Aim repo examples folder](https://github.com/aimhubio/aim/blob/main/examples/pytorch_track.py). ```python from aim import Run from aim.pytorch import track_gradients_dists, track_params_dists # Initialize a new Run aim_run = Run() ... items = {'accuracy': acc, 'loss': loss} aim_run.track(items, epoch=epoch, context={'subset': 'train'}) # Track weights and gradients distributions track_params_dists(model, aim_run) track_gradients_dists(model, aim_run) ``` The experiments tracked by Aim are stored in the `.aim` folder. **To display the logs with the Aim UI in your Space, you need to compress the `.aim` folder to a `tar.gz` file and upload it to your Space using `git` or the Files and Versions sections of your Space.** Here's a bash command for that: ```bash tar -czvf aim_repo.tar.gz .aim ``` That’s it! Now open the App section of your Space and the Aim UI is available with your logs. Here is what to expect: ![Aim UI on HF Hub Spaces](https://user-images.githubusercontent.com/23078323/232034340-0ba3ebbf-0374-4b14-ba80-1d36162fc994.png) Filter your runs using Aim’s Pythonic search. You can write pythonic [queries against](https://aimstack.readthedocs.io/en/latest/using/search.html) EVERYTHING you have tracked - metrics, hyperparams etc. Check out some [examples](https://huggingface.co/aimstack) on HF Hub Spaces. > [!TIP] > Note that if your logs are in TensorBoard format, you can easily convert them to Aim with one command and use the many advanced and high-performant training run comparison features available. ## More on HF Spaces - [HF Docker spaces](https://huggingface.co/docs/hub/spaces-sdks-docker) - [HF Docker space examples](https://huggingface.co/docs/hub/spaces-sdks-docker-examples) ## Feedback and Support If you have improvement suggestions or need support, please open an issue on [Aim GitHub repo](https://github.com/aimhubio/aim). The [Aim community Discord](https://github.com/aimhubio/aim#-community) is also available for community discussions. ### Spaces Configuration Reference https://huggingface.co/docs/hub/spaces-config-reference.md # Spaces Configuration Reference Spaces are configured through the `YAML` block at the top of the **README.md** file at the root of the repository. All the accepted parameters are listed below. **`title`** : _string_ Display title for the Space. **`emoji`** : _string_ Space emoji (emoji-only character allowed). **`colorFrom`** : _string_ Color for Thumbnail gradient (red, yellow, green, blue, indigo, purple, pink, gray). **`colorTo`** : _string_ Color for Thumbnail gradient (red, yellow, green, blue, indigo, purple, pink, gray). **`sdk`** : _string_ Can be either `gradio`, `docker`, or `static`. **`python_version`**: _string_ Any valid Python `3.x` or `3.x.x` version. Defaults to `3.10`. **`sdk_version`** : _string_ Specify the version of Gradio to use. All versions of Gradio are supported. **`suggested_hardware`** : _string_ Specify the suggested [hardware](https://huggingface.co/docs/hub/spaces-gpus) on which this Space must be run. Useful for Spaces that are meant to be duplicated by other users. Setting this value will not automatically assign an hardware to this Space. Value must be a valid hardware flavor. Current valid hardware flavors: - CPU: `"cpu-basic"`, `"cpu-upgrade"` - GPU: `"t4-small"`, `"t4-medium"`, `"l4x1"`, `"l4x4"`, `"a10g-small"`, `"a10g-large"`, `"a10g-largex2"`, `"a10g-largex4"`,`"a100-large"` - TPU: `"v5e-1x1"`, `"v5e-2x2"`, `"v5e-2x4"` **`suggested_storage`** : _string_ Specify the suggested [permanent storage](https://huggingface.co/docs/hub/spaces-storage) on which this Space must be run. Useful for Spaces that are meant to be duplicated by other users. Setting this value will not automatically assign a permanent storage to this Space. Value must be one of `"small"`, `"medium"` or `"large"`. **`app_file`** : _string_ Path to your main application file (which contains either `gradio` Python code or `static` html code). Path is relative to the root of the repository. **`app_build_command`** : _string_ For static Spaces, command to run first to generate the HTML to render. Example: `npm run build`. This is used in conjunction with `app_file` which points to the built index file: e.g. `app_file: dist/index.html`. Each update, the build command will run in a Job and the build output will be stored in `refs/convert/build`, which will be served by the Space. See an example at https://huggingface.co/spaces/coyotte508/static-vite **`app_port`** : _int_ Port on which your application is running. Used only if `sdk` is `docker`. Default port is `7860`. **`base_path`**: _string_ For non-static Spaces, initial url to render. Needs to start with `/`. For static Spaces, use `app_file` instead. **`fullWidth`**: _boolean_ Whether your Space is rendered inside a full-width (when `true`) or fixed-width column (ie. "container" CSS) inside the iframe. Defaults to `true`. **`header`**: _string_ Can be either `mini` or `default`. If `header` is set to `mini` the space will be displayed full-screen with a mini floating header . **`short_description`**: _string_ A short description of the Space. This will be displayed in the Space's thumbnail. **`models`** : _List[string]_ HF model IDs (like `openai-community/gpt2` or `deepset/roberta-base-squad2`) used in the Space. Will be parsed automatically from your code if not specified here. **`datasets`** : _List[string]_ HF dataset IDs (like `mozilla-foundation/common_voice_13_0` or `oscar-corpus/OSCAR-2109`) used in the Space. Will be parsed automatically from your code if not specified here. **`tags`** : _List[string]_ List of terms that describe your Space task or scope. **`thumbnail`**: _string_ URL for defining a custom thumbnail for social sharing. **`pinned`** : _boolean_ Whether the Space stays on top of your profile. Can be useful if you have a lot of Spaces so you and others can quickly see your best Space. **`hf_oauth`** : _boolean_ Whether a connected OAuth app is associated to this Space. See [Adding a Sign-In with HF button to your Space](https://huggingface.co/docs/hub/spaces-oauth) for more details. **`hf_oauth_scopes`** : _List[string]_ Authorized scopes of the connected OAuth app. `openid` and `profile` are authorized by default and do not need this parameter. See [Adding a Sign-In with HF button to your space](https://huggingface.co/docs/hub/spaces-oauth) for more details. **`hf_oauth_expiration_minutes`** : _int_ Duration of the OAuth token in minutes. Defaults to 480 minutes (8 hours). Maximum duration is 43200 minutes (30 days). See [Adding a Sign-In with HF button to your space](https://huggingface.co/docs/hub/spaces-oauth) for more details. **`hf_oauth_authorized_org`** : _string_ or _List[string]_ Restrict OAuth access to members of specific organizations. See [Adding a Sign-In with HF button to your space](https://huggingface.co/docs/hub/spaces-oauth) for more details. **`disable_embedding`** : _boolean_ Whether the Space iframe can be embedded in other websites. Defaults to false, i.e. Spaces *can* be embedded. **`startup_duration_timeout`**: _string_ Set a custom startup duration timeout for your Space. This is the maximum time your Space is allowed to start before it times out and is flagged as unhealthy. Defaults to 30 minutes, but any valid duration (like `1h`, `30m`) is acceptable. **`custom_headers`** : _Dict[string, string]_ Set custom HTTP headers that will be added to all HTTP responses when serving your Space. For now, only the [cross-origin-embedder-policy](https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Cross-Origin-Embedder-Policy) (COEP), [cross-origin-opener-policy](https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Cross-Origin-Opener-Policy) (COOP), and [cross-origin-resource-policy](https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Cross-Origin-Resource-Policy) (CORP) headers are allowed. These headers can be used to set up a cross-origin isolated environment and enable powerful features like `SharedArrayBuffer`, for example: ```yaml custom_headers: cross-origin-embedder-policy: require-corp cross-origin-opener-policy: same-origin cross-origin-resource-policy: cross-origin ``` *Note:* all headers and values must be lowercase. **`preload_from_hub`**: _List[string]_ Specify a list of Hugging Face Hub models or other large files to be preloaded during the build time of your Space. This optimizes the startup time by having the files ready when your application starts. This is particularly useful for Spaces that rely on large models or datasets that would otherwise need to be downloaded at runtime. The format for each item is `"repository_name"` to download all files from a repository, or `"repository_name file1,file2"` for downloading specific files within that repository. You can also specify a specific commit to download using the format `"repository_name file1,file2 commit_sha256"`. Example usage: ```yaml preload_from_hub: - warp-ai/wuerstchen-prior text_encoder/model.safetensors,prior/diffusion_pytorch_model.safetensors - coqui/XTTS-v1 - openai-community/gpt2 config.json 11c5a3d5811f50298f278a704980280950aedb10 ``` In this example, the Space will preload specific .safetensors files from `warp-ai/wuerstchen-prior`, the complete `coqui/XTTS-v1` repository, and a specific revision of the `config.json` file in the `openai-community/gpt2` repository from the Hugging Face Hub during build time. > [!WARNING] > Files are saved in the default `huggingface_hub` disk cache `~/.cache/huggingface/hub`. If you application expects them elsewhere or you changed your `HF_HOME` variable, this pre-loading does not follow that at this time. ### Using Unity Sentis Models from Hugging Face https://huggingface.co/docs/hub/unity-sentis.md # Using Unity Sentis Models from Hugging Face [Unity 3D](https://unity.com/) is one of the most popular game engines in the world. [Unity Sentis](https://unity.com/products/sentis) is the inference engine that runs on Unity 2023 or above. It is an API that allows you to easily integrate and run neural network models in your game or application making use of hardware acceleration. Because Unity can export to many different form factors including PC, mobile and consoles, it means that this is an easy way to run neural network models on many different types of hardware. ## Exploring Sentis Models in the Hub You will find `unity-sentis` models by filtering at the left of the [models page](https://huggingface.co/models?library=unity-sentis). All the Sentis models in the Hub come with code and instructions to easily get you started using the model in Unity. All Sentis models under the `unity` namespace (for example, [unity/sentis-yolotinyv7](https://huggingface.co/unity/sentis-yolotinyv7) have been validated to work, so you can be sure they will run in Unity. To get more details about using Sentis, you can read its [documentation](https://docs.unity3d.com/Packages/com.unity.sentis@latest). To get help from others using Sentis, you can ask in its [discussion forum](https://discussions.unity.com/c/ai-beta/sentis) ## Types of files Each repository will contain several types of files: * ``sentis`` files: These are the main model files that contain the neural networks that run on Unity. * ``ONNX`` files: This is an alternative format you can include in addition to, or instead of, the Sentis files. It can be useful for visualization with third party tools such as [Netron](https://github.com/lutzroeder/netron). * ``cs`` file: These are C# files that contain the code to run the model on Unity. * ``info.json``: This file contains information about the files in the repository. * Data files. These are other files that are needed to run the model. They could include vocabulary files, lists of class names etc. Some typical files will have extensions ``json`` or ``txt``. * ``README.md``. This is the model card. It contains instructions on how to use the model and other relevant information. ## Running the model Always refer to the instructions on the model card. It is expected that you have some knowledge of Unity and some basic knowledge of C#. 1. Open Unity 2023 or above and create a new scene. 2. Install the ``com.unity.sentis`` package from the [package manager](https://docs.unity3d.com/Manual/upm-ui-quick.html). 3. Download your model files (``*.sentis``) and data files and put them in the StreamingAssets folder which is a subfolder inside the Assets folder. (If this folder does not exist you can create it). 4. Place your C# file on an object in the scene such as the Main Camera. 5. Refer to the model card to see if there are any other objects you need to create in the scene. In most cases, we only provide the basic implementation to get you up and running. It is up to you to find creative uses. For example, you may want to combine two or more models to do interesting things. ## Sharing your own Sentis models We encourage you to share your own Sentis models on Hugging Face. These may be models you trained yourself or models you have converted to the [Sentis format](https://docs.unity3d.com/Packages/com.unity.sentis@1.3/manual/serialize-a-model.html) and have tested to run in Unity. Please provide the models in the Sentis format for each repository you upload. This provides an extra check that they will run in Unity and is also the preferred format for large models. You can also include the original ONNX versions of the model files. Provide a C# file with a minimal implementation. For example, an image processing model should have code that shows how to prepare the image for the input and construct the image from the output. Alternatively, you can link to some external sample code. This will make it easy for others to download and use the model in Unity. Provide any data files needed to run the model. For example, vocabulary files. Finally, please provide an ``info.json`` file, which lists your project's files. This helps in counting the downloads. Some examples of the contents of ``info.json`` are: ``` { "code": [ "mycode.cs"], "models": [ "model1.sentis", "model2.sentis"], "data": [ "vocab.txt" ] } ``` Or if your code sample is external: ``` { "sampleURL": [ "http://sampleunityproject"], "models": [ "model1.sentis", "model2.sentis"] } ``` ## Additional Information We also have some full [sample projects](https://github.com/Unity-Technologies/sentis-samples) to help you get started using Sentis. ### Datasets Download Stats https://huggingface.co/docs/hub/datasets-download-stats.md # Datasets Download Stats ## How are downloads counted for datasets? Counting the number of downloads for datasets is not a trivial task, as a single dataset repository might contain multiple files, from multiple subsets and splits (e.g. train/validation/test) and sometimes with many files in a single split. To solve this issue and avoid counting one person's download multiple times, we treat all files downloaded by a user (based on their IP address) within a 5-minute window in a given repository as a single dataset download. This counting happens automatically on our servers when files are downloaded (through GET or HEAD requests), with no need to collect any user information or make additional calls. ## Before September 2024 The Hub used to provide download stats only for the datasets loadable via the `datasets` library. To determine the number of downloads, the Hub previously counted every time `load_dataset` was called in Python, excluding Hugging Face's CI tooling on GitHub. No information was sent from the user, and no additional calls were made for this. The count was done server-side as we served files for downloads. This meant that: * The download count was the same regardless of whether the data is directly stored on the Hub repo or if the repository has a [script](/docs/datasets/dataset_script) to load the data from an external source. * If a user manually downloaded the data using tools like `wget` or the Hub's user interface (UI), those downloads were not included in the download count. ### Cookie limitations in Spaces https://huggingface.co/docs/hub/spaces-cookie-limitations.md # Cookie limitations in Spaces In Hugging Face Spaces, applications have certain limitations when using cookies. This is primarily due to the structure of the Spaces' pages (`https://huggingface.co/spaces//`), which contain applications hosted on a different domain (`*.hf.space`) within an iframe. For security reasons, modern browsers tend to restrict the use of cookies from iframe pages hosted on a different domain than the parent page. ## Impact on Hosting Streamlit Apps with Docker SDK One instance where these cookie restrictions can become problematic is when hosting Streamlit applications using the Docker SDK. By default, Streamlit enables cookie-based XSRF protection. As a result, certain components that submit data to the server, such as `st.file_uploader()`, will not work properly on HF Spaces where cookie usage is restricted. To work around this issue, you would need to set the `server.enableXsrfProtection` option in Streamlit to `false`. There are two ways to do this: 1. Command line argument: The option can be specified as a command line argument when running the Streamlit application. Here is the example command: ```shell streamlit run app.py --server.enableXsrfProtection false ``` 2. Configuration file: Alternatively, you can specify the option in the Streamlit configuration file `.streamlit/config.toml`. You would write it like this: ```toml [server] enableXsrfProtection = false ``` > [!TIP] > When you are using the Streamlit SDK, you don't need to worry about this because the SDK does it for you. ### Embed the Dataset Viewer in a webpage https://huggingface.co/docs/hub/datasets-viewer-embed.md # Embed the Dataset Viewer in a webpage You can embed the Dataset Viewer in your own webpage using an iframe. The URL to use is `https://huggingface.co/datasets///embed/viewer`, where `` is the owner of the dataset (user or organization) and `` is the name of the dataset. You can also pass other parameters like the subset, split, filter, search or selected row. For example, the following iframe embeds the Dataset Viewer for the `glue` dataset from the `nyu-mll` organization: ```html ``` You can also get the embed code directly from the Dataset Viewer interface. Click on the `Embed` button in the top right corner of the Dataset Viewer: It will open a modal with the iframe code that you can copy and paste into your webpage: ## Parameters All the parameters of the dataset viewer page can also be passed to the embedded viewer (filter, search, specific split, etc.) by adding them to the iframe URL. For example, to show the results of the search on `mangrove` in the `test` split of the `rte` subset of the `nyu-mll/glue` dataset, you can use the following URL: ```html ``` You can get this code directly from the Dataset Viewer interface by performing the search, clicking on the `⋮` button then `Embed`: It will open a modal with the iframe code that you can copy and paste into your webpage: ## Examples The embedded dataset viewer is used in multiple Machine Learning tools and platforms to display datasets. Here are a few examples. Open a [pull request](https://github.com/huggingface/hub-docs/blob/main/docs/hub/datasets-viewer-embed.md) if you want to appear in this section! ### Tool: ZenML [`htahir1`](https://huggingface.co/htahir1) shares a [blog post](https://www.zenml.io/blog/embedding-huggingface-datasets-visualizations-with-zenml) showing how you can use the [ZenML](https://huggingface.co/zenml) integration with the Datasets Viewer to visualize a Hugging Face dataset within a ZenML pipeline. ### Tool: Metaflow + Outerbounds [`eddie-OB`](https://huggingface.co/eddie-OB) shows in a [demo video](https://www.linkedin.com/posts/eddie-mattia_the-team-at-hugging-facerecently-released-activity-7219416449084272641-swIu) how to include the dataset viewer in Metaflow cards on [Outerbounds](https://huggingface.co/outerbounds). ### Tool: AutoTrain [`abhishek`](https://huggingface.co/abhishek) showcases how the dataset viewer is integrated into [AutoTrain](https://huggingface.co/autotrain) in a [demo video](https://x.com/abhi1thakur/status/1813892464144798171). ### Datasets: Alpaca-style datasets gallery [`davanstrien`](https://huggingface.co/davanstrien) showcases the [collection of Alpaca-style datasets](https://huggingface.co/collections/librarian-bots/alpaca-style-datasets-66964d3e490f463859002588) in a [space](https://huggingface.co/spaces/davanstrien/collection_dataset_viewer). ### Datasets: Docmatix [`andito`](https://huggingface.co/andito) uses the embedded viewer in the [blog post](https://huggingface.co/blog/docmatix) announcing the release of [Docmatix](https://huggingface.co/datasets/HuggingFaceM4/Docmatix), a huge dataset for Document Visual Question Answering (DocVQA). ### App: Electric Vehicle Charge Finder [`cfahlgren1`](https://huggingface.co/cfahlgren1) [embeds](https://x.com/calebfahlgren/status/1813356638239125735) the dataset viewer in the [Electric Vehicle Charge Finder app](https://charge-finder.vercel.app/). ### App: Masader - Arabic NLP data catalogue [`Zaid`](https://huggingface.co/Zaid) [showcases](https://x.com/zaidalyafeai/status/1815365207775932576) the dataset viewer in [Masader - the Arabic NLP data catalogue0](https://arbml.github.io/masader//). ### Using ESPnet at Hugging Face https://huggingface.co/docs/hub/espnet.md # Using ESPnet at Hugging Face `espnet` is an end-to-end toolkit for speech processing, including automatic speech recognition, text to speech, speech enhancement, dirarization and other tasks. ## Exploring ESPnet in the Hub You can find hundreds of `espnet` models by filtering at the left of the [models page](https://huggingface.co/models?library=espnet&sort=downloads). All models on the Hub come up with useful features: 1. An automatically generated model card with a description, a training configuration, licenses and more. 2. Metadata tags that help for discoverability and contain information such as license, language and datasets. 3. An interactive widget you can use to play out with the model directly in the browser. 4. An Inference API that allows to make inference requests. ## Using existing models For a full guide on loading pre-trained models, we recommend checking out the [official guide](https://github.com/espnet/espnet_model_zoo)). If you're interested in doing inference, different classes for different tasks have a `from_pretrained` method that allows loading models from the Hub. For example: * `Speech2Text` for Automatic Speech Recognition. * `Text2Speech` for Text to Speech. * `SeparateSpeech` for Audio Source Separation. Here is an inference example: ```py import soundfile from espnet2.bin.tts_inference import Text2Speech text2speech = Text2Speech.from_pretrained("model_name") speech = text2speech("foobar")["wav"] soundfile.write("out.wav", speech.numpy(), text2speech.fs, "PCM_16") ``` If you want to see how to load a specific model, you can click `Use in ESPnet` and you will be given a working snippet that you can load it! ## Sharing your models `ESPnet` outputs a `zip` file that can be uploaded to Hugging Face easily. For a full guide on sharing models, we recommend checking out the [official guide](https://github.com/espnet/espnet_model_zoo#register-your-model)). The `run.sh` script allows to upload a given model to a Hugging Face repository. ```bash ./run.sh --stage 15 --skip_upload_hf false --hf_repo username/model_repo ``` ## Additional resources * ESPnet [docs](https://espnet.github.io/espnet/index.html). * ESPnet model zoo [repository](https://github.com/espnet/espnet_model_zoo). * Integration [docs](https://github.com/asteroid-team/asteroid/blob/master/docs/source/readmes/pretrained_models.md). ### Pandas https://huggingface.co/docs/hub/datasets-pandas.md # Pandas [Pandas](https://github.com/pandas-dev/pandas) is a widely used Python data analysis toolkit. Since it uses [fsspec](https://filesystem-spec.readthedocs.io) to read and write remote data, you can use the Hugging Face paths ([`hf://`](/docs/huggingface_hub/guides/hf_file_system#integrations)) to read and write data on the Hub. ## Load a DataFrame You can load data from local files or from remote storage like Hugging Face Datasets. Pandas supports many formats including CSV, JSON and Parquet: ```python >>> import pandas as pd >>> df = pd.read_csv("path/to/data.csv") ``` To load a file from Hugging Face, the path needs to start with `hf://`. For example, the path to the [stanfordnlp/imdb](https://huggingface.co/datasets/stanfordnlp/imdb) dataset repository is `hf://datasets/stanfordnlp/imdb`. The dataset on Hugging Face contains multiple Parquet files. The Parquet file format is designed to make reading and writing data frames efficient, and to make sharing data across data analysis languages easy. Here is how to load the file `plain_text/train-00000-of-00001.parquet`: ```python >>> import pandas as pd >>> df = pd.read_parquet("hf://datasets/stanfordnlp/imdb/plain_text/train-00000-of-00001.parquet") >>> df text label 0 I rented I AM CURIOUS-YELLOW from my video sto... 0 1 "I Am Curious: Yellow" is a risible and preten... 0 2 If only to avoid making this type of film in t... 0 3 This film was probably inspired by Godard's Ma... 0 4 Oh, brother...after hearing about this ridicul... 0 ... ... ... 24995 A hit at the time but now better categorised a... 1 24996 I love this movie like no other. Another time ... 1 24997 This film and it's sequel Barry Mckenzie holds... 1 24998 'The Adventures Of Barry McKenzie' started lif... 1 24999 The story centers around Barry McKenzie who mu... 1 ``` For more information on the Hugging Face paths and how they are implemented, please refer to the [the client library's documentation on the HfFileSystem](/docs/huggingface_hub/guides/hf_file_system). ## Save a DataFrame You can save a pandas DataFrame using `to_csv/to_json/to_parquet` to a local file or to Hugging Face directly. To save the DataFrame on Hugging Face, you first need to [Login with your Hugging Face account](/docs/huggingface_hub/quick-start#login), for example using: ``` hf auth login ``` Then you can [Create a dataset repository](/docs/huggingface_hub/quick-start#create-a-repository), for example using: ```python from huggingface_hub import HfApi HfApi().create_repo(repo_id="username/my_dataset", repo_type="dataset") ``` Finally, you can use [Hugging Face paths](/docs/huggingface_hub/guides/hf_file_system#integrations) in Pandas: ```python import pandas as pd df.to_parquet("hf://datasets/username/my_dataset/imdb.parquet") # or write in separate files if the dataset has train/validation/test splits df_train.to_parquet("hf://datasets/username/my_dataset/train.parquet") df_valid.to_parquet("hf://datasets/username/my_dataset/validation.parquet") df_test .to_parquet("hf://datasets/username/my_dataset/test.parquet") ``` ## Use Images You can load a folder with a metadata file containing a field for the names or paths to the images, structured like this: ``` Example 1: Example 2: folder/ folder/ ├── metadata.csv ├── metadata.csv ├── img000.png └── images ├── img001.png ├── img000.png ... ... └── imgNNN.png └── imgNNN.png ``` You can iterate on the images paths like this: ```python import pandas as pd folder_path = "path/to/folder/" df = pd.read_csv(folder_path + "metadata.csv") for image_path in (folder_path + df["file_name"]): ... ``` Since the dataset is in a [supported structure](https://huggingface.co/docs/hub/en/datasets-image#additional-columns) (a `metadata.csv` or `.jsonl` file with a `file_name` field), you can save this dataset to Hugging Face and the Dataset Viewer shows both the metadata and images on Hugging Face. ```python from huggingface_hub import HfApi api = HfApi() api.upload_folder( folder_path=folder_path, repo_id="username/my_image_dataset", repo_type="dataset", ) ``` ### Image methods and Parquet Using [pandas-image-methods](https://github.com/lhoestq/pandas-image-methods) you enable `PIL.Image` methods on an image column. It also enables saving the dataset as one single Parquet file containing both the images and the metadata: ```python import pandas as pd from pandas_image_methods import PILMethods pd.api.extensions.register_series_accessor("pil")(PILMethods) df["image"] = (folder_path + df["file_name"]).pil.open() df.to_parquet("data.parquet") ``` All the `PIL.Image` methods are available, e.g. ```python df["image"] = df["image"].pil.rotate(90) ``` ## Use Audios You can load a folder with a metadata file containing a field for the names or paths to the audios, structured like this: ``` Example 1: Example 2: folder/ folder/ ├── metadata.csv ├── metadata.csv ├── rec000.wav └── audios ├── rec001.wav ├── rec000.wav ... ... └── recNNN.wav └── recNNN.wav ``` You can iterate on the audios paths like this: ```python import pandas as pd folder_path = "path/to/folder/" df = pd.read_csv(folder_path + "metadata.csv") for audio_path in (folder_path + df["file_name"]): ... ``` Since the dataset is in a [supported structure](https://huggingface.co/docs/hub/en/datasets-audio#additional-columns) (a `metadata.csv` or `.jsonl` file with a `file_name` field), you can save it to Hugging Face, and the Hub Dataset Viewer shows both the metadata and audio. ```python from huggingface_hub import HfApi api = HfApi() api.upload_folder( folder_path=folder_path, repo_id="username/my_audio_dataset", repo_type="dataset", ) ``` ### Audio methods and Parquet Using [pandas-audio-methods](https://github.com/lhoestq/pandas-audio-methods) you enable `soundfile` methods on an audio column. It also enables saving the dataset as one single Parquet file containing both the audios and the metadata: ```python import pandas as pd from pandas_audio_methods import SFMethods pd.api.extensions.register_series_accessor("sf")(SFMethods) df["audio"] = (folder_path + df["file_name"]).sf.open() df.to_parquet("data.parquet") ``` This makes it easy to use with `librosa` e.g. for resampling: ```python df["audio"] = [librosa.load(audio, sr=16_000) for audio in df["audio"]] df["audio"] = df["audio"].sf.write() ``` ## Use Transformers You can use `transformers` pipelines on pandas DataFrames to classify, generate text, images, etc. This section shows a few examples with `tqdm` for progress bars. > [!TIP] > Pipelines don't accept a `tqdm` object as input but you can use a python generator instead, in the form `x for x in tqdm(...)` ### Text Classification ```python from transformers import pipeline from tqdm import tqdm pipe = pipeline("text-classification", model="clapAI/modernBERT-base-multilingual-sentiment") # Compute labels df["label"] = [y["label"] for y in pipe(x for x in tqdm(df["text"]))] # Compute labels and scores df[["label", "score"]] = [(y["label"], y["score"]) for y in pipe(x for x in tqdm(df["text"]))] ``` ### Text Generation ```python from transformers import pipeline from tqdm import tqdm pipe = pipeline("text-generation", model="Qwen/Qwen2.5-1.5B-Instruct") # Generate chat response prompt = "What is the main topic of this sentence ? REPLY IN LESS THAN 3 WORDS. Sentence: '{}'" df["output"] = [y["generated_text"][1]["content"] for y in pipe([{"role": "user", "content": prompt.format(x)}] for x in tqdm(df["text"]))] ``` ### Organizations, Security, and the Hub API https://huggingface.co/docs/hub/other.md # Organizations, Security, and the Hub API ## Contents - [Organizations](./organizations) - [Managing Organizations](./organizations-managing) - [Organization Cards](./organizations-cards) - [Access control in organizations](./organizations-security) - [Enterprise Hub](./enterprise-hub) - [Moderation](./moderation) - [Billing](./billing) - [Digital Object Identifier (DOI)](./doi) - [Security](./security) - [User Access Tokens](./security-tokens) - [Signing commits with GPG](./security-gpg) - [Malware Scanning](./security-malware) - [Pickle Scanning](./security-pickle) - [Hub API Endpoints](./api) - [Webhooks](./webhooks) ### Embedding Atlas https://huggingface.co/docs/hub/datasets-embedding-atlas.md # Embedding Atlas [Embedding Atlas](https://apple.github.io/embedding-atlas/) is an interactive visualization tool for exploring large embedding spaces. It enables you to visualize, cross-filter, and search embeddings alongside associated metadata, helping you understand patterns and relationships in high-dimensional data. All computation happens in your computer, ensuring your data remains private and secure. Here is an [example atlas](https://huggingface.co/spaces/davanstrien/megascience) for the [MegaScience](https://huggingface.co/datasets/MegaScience/MegaScience) dataset hosted as a Static Space: ## Key Features - **Interactive exploration**: Navigate through millions of embeddings with smooth, responsive visualization - **Browser-based computation**: Compute embeddings and projections locally without sending data to external servers - **Cross-filtering**: Link and filter data across multiple metadata columns - **Search capabilities**: Find similar data points to a given query or existing item - **Multiple integration options**: Use via command line, Jupyter widgets, or web interface ## Prerequisites First, install Embedding Atlas: ```bash pip install embedding-atlas ``` If you plan to load private datasets from the Hugging Face Hub, you'll also need to [login with your Hugging Face account](/docs/huggingface_hub/quick-start#login): ```bash hf auth login ``` ## Loading Datasets from the Hub Embedding Atlas provides seamless integration with the Hugging Face Hub, allowing you to visualize embeddings from any dataset directly. ### Using the Command Line The simplest way to visualize a Hugging Face dataset is through the command line interface. Try it with the IMDB dataset: ```bash # Load the IMDB dataset from the Hub embedding-atlas stanfordnlp/imdb # Specify the text column for embedding computation embedding-atlas stanfordnlp/imdb --text "text" # Load only a sample for faster exploration embedding-atlas stanfordnlp/imdb --text "text" --sample 5000 ``` For your own datasets, use the same pattern: ```bash # Load your dataset from the Hub embedding-atlas username/dataset-name # Load multiple splits embedding-atlas username/dataset-name --split train --split test # Specify custom text column embedding-atlas username/dataset-name --text "content" ``` ### Using Python and Jupyter You can also use Embedding Atlas in Jupyter notebooks for interactive exploration: ```python from embedding_atlas.widget import EmbeddingAtlasWidget from datasets import load_dataset import pandas as pd # Load the IMDB dataset from Hugging Face Hub dataset = load_dataset("stanfordnlp/imdb", split="train[:5000]") # Convert to pandas DataFrame df = dataset.to_pandas() # Create interactive widget widget = EmbeddingAtlasWidget(df) widget ``` For your own datasets: ```python from embedding_atlas.widget import EmbeddingAtlasWidget from datasets import load_dataset import pandas as pd # Load your dataset from the Hub dataset = load_dataset("username/dataset-name", split="train") df = dataset.to_pandas() # Create interactive widget widget = EmbeddingAtlasWidget(df) widget ``` ### Working with Pre-computed Embeddings If you have datasets with pre-computed embeddings, you can load them directly: ```bash # Load dataset with pre-computed coordinates embedding-atlas username/dataset-name \ --x "embedding_x" \ --y "embedding_y" # Load with pre-computed nearest neighbors embedding-atlas username/dataset-name \ --neighbors "neighbors_column" ``` ## Customizing Embeddings Embedding Atlas uses [SentenceTransformers](https://huggingface.co/sentence-transformers) by default but supports custom embedding models: ```bash # Use a specific embedding model embedding-atlas stanfordnlp/imdb \ --text "text" \ --model "sentence-transformers/all-MiniLM-L6-v2" # For models requiring remote code execution embedding-atlas username/dataset-name \ --model "custom/model" \ --trust-remote-code ``` ### UMAP Projection Parameters Fine-tune the dimensionality reduction for your specific use case: ```bash embedding-atlas stanfordnlp/imdb \ --text "text" \ --umap-n-neighbors 30 \ --umap-min-dist 0.1 \ --umap-metric "cosine" ``` ## Use Cases ### Exploring Text Datasets Visualize and explore text corpora to identify clusters, outliers, and patterns: ```python from embedding_atlas.widget import EmbeddingAtlasWidget from datasets import load_dataset import pandas as pd # Load a text classification dataset dataset = load_dataset("stanfordnlp/imdb", split="train[:5000]") df = dataset.to_pandas() # Visualize with metadata widget = EmbeddingAtlasWidget(df) widget ``` ## Additional Resources - [Embedding Atlas GitHub Repository](https://github.com/apple/embedding-atlas) - [Official Documentation](https://apple.github.io/embedding-atlas/) - [Interactive Demo](https://apple.github.io/embedding-atlas/upload/) - [Command Line Reference](https://apple.github.io/embedding-atlas/tool.html) ### The Model Hub https://huggingface.co/docs/hub/models-the-hub.md # The Model Hub ## What is the Model Hub? The Model Hub is where the members of the Hugging Face community can host all of their model checkpoints for simple storage, discovery, and sharing. Download pre-trained models with the [`huggingface_hub` client library](https://huggingface.co/docs/huggingface_hub/index), with 🤗 [`Transformers`](https://huggingface.co/docs/transformers/index) for fine-tuning and other usages or with any of the over [15 integrated libraries](./models-libraries). You can even leverage [Inference Providers](/docs/inference-providers/) or [Inference Endpoints](https://huggingface.co/docs/inference-endpoints) to use models in production settings. You can refer to the following video for a guide on navigating the Model Hub: To learn how to upload models to the Hub, you can refer to the [Repositories Getting Started Guide](./repositories-getting-started). ### Using Sentence Transformers at Hugging Face https://huggingface.co/docs/hub/sentence-transformers.md # Using Sentence Transformers at Hugging Face `sentence-transformers` is a library that provides easy methods to compute embeddings (dense vector representations) for sentences, paragraphs and images. Texts are embedded in a vector space such that similar text is close, which enables applications such as semantic search, clustering, and retrieval. ## Exploring sentence-transformers in the Hub You can find over 500 hundred `sentence-transformer` models by filtering at the left of the [models page](https://huggingface.co/models?library=sentence-transformers&sort=downloads). Most of these models support different tasks, such as doing [`feature-extraction`](https://huggingface.co/models?library=sentence-transformers&pipeline_tag=feature-extraction&sort=downloads) to generate the embedding, and [`sentence-similarity`](https://huggingface.co/models?library=sentence-transformers&pipeline_tag=sentence-similarity&sort=downloads) as a way to determine how similar is a given sentence to other. You can also find an overview of the official pre-trained models in [the official docs](https://www.sbert.net/docs/pretrained_models.html). All models on the Hub come up with features: 1. An automatically generated model card with a description, example code snippets, architecture overview, and more. 2. Metadata tags that help for discoverability and contain information such as license. 3. An interactive widget you can use to play out with the model directly in the browser. 4. An Inference API that allows to make inference requests. ## Using existing models The pre-trained models on the Hub can be loaded with a single line of code ```py from sentence_transformers import SentenceTransformer model = SentenceTransformer('model_name') ``` Here is an example that encodes sentences and then computes the distance between them for doing semantic search. ```py from sentence_transformers import SentenceTransformer, util model = SentenceTransformer('multi-qa-MiniLM-L6-cos-v1') query_embedding = model.encode('How big is London') passage_embedding = model.encode(['London has 9,787,426 inhabitants at the 2011 census', 'London is known for its finacial district']) print("Similarity:", util.dot_score(query_embedding, passage_embedding)) ``` If you want to see how to load a specific model, you can click `Use in sentence-transformers` and you will be given a working snippet that you can load it! ## Sharing your models You can share your Sentence Transformers by using the `save_to_hub` method from a trained model. ```py from sentence_transformers import SentenceTransformer # Load or train a model model.save_to_hub("my_new_model") ``` This command creates a repository with an automatically generated model card, an inference widget, example code snippets, and more! [Here](https://huggingface.co/osanseviero/my_new_model) is an example. ## Additional resources * Sentence Transformers [library](https://github.com/UKPLab/sentence-transformers). * Sentence Transformers [docs](https://www.sbert.net/). * Integration with Hub [announcement](https://huggingface.co/blog/sentence-transformers-in-the-hub). ### Using SpeechBrain at Hugging Face https://huggingface.co/docs/hub/speechbrain.md # Using SpeechBrain at Hugging Face `speechbrain` is an open-source and all-in-one conversational toolkit for audio/speech. The goal is to create a single, flexible, and user-friendly toolkit that can be used to easily develop state-of-the-art speech technologies, including systems for speech recognition, speaker recognition, speech enhancement, speech separation, language identification, multi-microphone signal processing, and many others. ## Exploring SpeechBrain in the Hub You can find `speechbrain` models by filtering at the left of the [models page](https://huggingface.co/models?library=speechbrain). All models on the Hub come up with the following features: 1. An automatically generated model card with a brief description. 2. Metadata tags that help for discoverability with information such as the language, license, paper, and more. 3. An interactive widget you can use to play out with the model directly in the browser. 4. An Inference API that allows to make inference requests. ## Using existing models `speechbrain` offers different interfaces to manage pretrained models for different tasks, such as `EncoderClassifier`, `EncoderClassifier`, `SepformerSeperation`, and `SpectralMaskEnhancement`. These classes have a `from_hparams` method you can use to load a model from the Hub Here is an example to run inference for sound recognition in urban sounds. ```py import torchaudio from speechbrain.pretrained import EncoderClassifier classifier = EncoderClassifier.from_hparams( source="speechbrain/urbansound8k_ecapa" ) out_prob, score, index, text_lab = classifier.classify_file('speechbrain/urbansound8k_ecapa/dog_bark.wav') ``` If you want to see how to load a specific model, you can click `Use in speechbrain` and you will be given a working snippet that you can load it! ## Additional resources * SpeechBrain [website](https://speechbrain.github.io/). * SpeechBrain [docs](https://speechbrain.readthedocs.io/en/latest/index.html). ### Dash on Spaces https://huggingface.co/docs/hub/spaces-sdks-docker-dash.md # Dash on Spaces With Dash Open Source, you can create data apps on your laptop in pure Python, no JavaScript required. Get familiar with Dash by building a [sample app](https://dash.plotly.com/tutorial) with open source. Scale up with [Dash Enterprise](https://plotly.com/dash/) when your Dash app is ready for department or company-wide consumption. Or, launch your initiative with Dash Enterprise from the start to unlock developer productivity gains and hands-on acceleration from Plotly's team. ## Deploy Dash on Spaces To get started with Dash on Spaces, click the button below: This will start building your Space using Plotly's Dash Docker template. If successful, you should see a similar application to the [Dash template app](https://huggingface.co/spaces/dash/dash-app-template). ## Customizing your Dash app If you have never built with Dash before, we recommend getting started with our [Dash in 20 minutes tutorial](https://dash.plotly.com/tutorial). When you create a Dash Space, you'll get a few key files to help you get started: ### 1. app.py This is the main app file that defines the core logic of your project. Dash apps are often structured as modules, and you can optionally seperate your layout, callbacks, and data into other files, like `layout.py`, etc. Inside of `app.py` you will see: 1. `from dash import Dash, html` We import the `Dash` object to define our app, and the `html` library, which gives us building blocks to assemble our project. 2. `app = Dash()` Here, we define our app. Layout, server, and callbacks are _bound_ to the `app` object. 3. `server = app.server` Here, we define our server variable, which is used to run the app in production. 4. `app.layout = ` The starter app layout is defined as a list of Dash components, an indivdual Dash component, or a function that returns either. The `app.layout` is your initial layout that will be updated as a single-page application by callbacks and other logic in your project. 5. `if __name__ == '__main__': app.run(debug=True)` If you are running your project locally with `python app.py`, `app.run(...)` will execute and start up a development server to work on your project, with features including hot reloading, the callback graph, and more. In production, we recommend `gunicorn`, which is a production-grade server. Debug features will not be enabled when running your project with `gunicorn`, so this line will never be reached. ### 2. Dockerfile The Dockerfile for a Dash app is minimal since Dash has few system dependencies. The key requirements are: - It installs the dependencies listed in `requirements.txt` (using `uv`) - It creates a non-root user for security - It runs the app with `gunicorn` using `gunicorn app:server --workers 4` You may need to modify this file if your application requires additional system dependencies, permissions, or other CLI flags. ### 3. requirements.txt The Space will automatically install dependencies listed in the `requirements.txt` file. At minimum, you must include `dash` and `gunicorn` in this file. You will want to add any other required packages your app needs. The Dash Space template provides a basic setup that you can extend based on your needs. ## Additional Resources and Support - [Dash documentation](https://dash.plotly.com) - [Dash GitHub repository](https://github.com/plotly/dash) - [Dash Community Forums](https://community.plotly.com) - [Dash Enterprise](https://plotly.com/dash) - [Dash template Space](https://huggingface.co/spaces/plotly/dash-app-template) ## Troubleshooting If you encounter issues: 1. Make sure your notebook runs locally in app mode using `python app.py` 2. Check that all required packages are listed in `requirements.txt` 3. Verify the port configuration matches (7860 is the default for Spaces) 4. Check Space logs for any Python errors For more help, visit the [Plotly Community Forums](https://community.plotly.com) or [open an issue](https://github.com/plotly/dash/issues). ### Model Cards https://huggingface.co/docs/hub/model-cards.md # Model Cards ## What are Model Cards? Model cards are files that accompany the models and provide handy information. Under the hood, model cards are simple Markdown files with additional metadata. Model cards are essential for discoverability, reproducibility, and sharing! You can find a model card as the `README.md` file in any model repo. The model card should describe: - the model - its intended uses & potential limitations, including biases and ethical considerations as detailed in [Mitchell, 2018](https://arxiv.org/abs/1810.03993) - the training params and experimental info (you can embed or link to an experiment tracking platform for reference) - which datasets were used to train your model - the model's evaluation results The model card template is available [here](https://github.com/huggingface/huggingface_hub/blob/main/src/huggingface_hub/templates/modelcard_template.md). How to fill out each section of the model card is described in [the Annotated Model Card](https://huggingface.co/docs/hub/model-card-annotated). Model Cards on the Hub have two key parts, with overlapping information: - [Metadata](#model-card-metadata) - [Text descriptions](#model-card-text) ## Model card metadata A model repo will render its `README.md` as a model card. The model card is a [Markdown](https://en.wikipedia.org/wiki/Markdown) file, with a [YAML](https://en.wikipedia.org/wiki/YAML) section at the top that contains metadata about the model. The metadata you add to the model card supports discovery and easier use of your model. For example: * Allowing users to filter models at https://huggingface.co/models. * Displaying the model's license. * Adding datasets to the metadata will add a message reading `Datasets used to train:` to your model page and link the relevant datasets, if they're available on the Hub. Dataset, metric, and language identifiers are those listed on the [Datasets](https://huggingface.co/datasets), [Metrics](https://huggingface.co/metrics) and [Languages](https://huggingface.co/languages) pages. ### Adding metadata to your model card There are a few different ways to add metadata to your model card including: - Using the metadata UI - Directly editing the YAML section of the `README.md` file - Via the [`huggingface_hub`](https://huggingface.co/docs/huggingface_hub) Python library, see the [docs](https://huggingface.co/docs/huggingface_hub/guides/model-cards#update-metadata) for more details. Many libraries with [Hub integration](./models-libraries) will automatically add metadata to the model card when you upload a model. #### Using the metadata UI You can add metadata to your model card using the metadata UI. To access the metadata UI, go to the model page and click on the `Edit model card` button in the top right corner of the model card. This will open an editor showing the model card `README.md` file, as well as a UI for editing the metadata. This UI will allow you to add key metadata to your model card and many of the fields will autocomplete based on the information you provide. Using the UI is the easiest way to add metadata to your model card, but it doesn't support all of the metadata fields. If you want to add metadata that isn't supported by the UI, you can edit the YAML section of the `README.md` file directly. #### Editing the YAML section of the `README.md` file You can also directly edit the YAML section of the `README.md` file. If the model card doesn't already have a YAML section, you can add one by adding three `---` at the top of the file, then include all of the relevant metadata, and close the section with another group of `---` like the example below: ```yaml --- language: - "List of ISO 639-1 code for your language" - lang1 - lang2 thumbnail: "url to a thumbnail used in social sharing" tags: - tag1 - tag2 license: "any valid license identifier" datasets: - dataset1 - dataset2 metrics: - metric1 - metric2 base_model: "base model Hub identifier" --- ``` You can find the detailed model card metadata specification here. ### Specifying a library You can specify the supported libraries in the model card metadata section. Find more about our supported libraries [here](./models-libraries). The library will be specified in the following order of priority: 1. Specifying `library_name` in the model card (recommended if your model is not a `transformers` model). This information can be added via the metadata UI or directly in the model card YAML section: ```yaml library_name: flair ``` 2. Having a tag with the name of a library that is supported ```yaml tags: - flair ``` If it's not specified, the Hub will try to automatically detect the library type. However, this approach is discouraged, and repo creators should use the explicit `library_name` as much as possible. 1. By looking into the presence of files such as `*.nemo` or `*.mlmodel`, the Hub can determine if a model is from NeMo or CoreML. 2. In the past, if nothing was detected and there was a `config.json` file, it was assumed the library was `transformers`. For model repos created after August 2024, this is not the case anymore – so you need to `library_name: transformers` explicitly. ### Specifying a base model If your model is a fine-tune, an adapter, or a quantized version of a base model, you can specify the base model in the model card metadata section. This information can also be used to indicate if your model is a merge of multiple existing models. Hence, the `base_model` field can either be a single model ID, or a list of one or more base_models (specified by their Hub identifiers). ```yaml base_model: HuggingFaceH4/zephyr-7b-beta ``` This metadata will be used to display the base model on the model page. Users can also use this information to filter models by base model or find models that are derived from a specific base model: For a fine-tuned model: For an adapter (LoRA, PEFT, etc): For a quantized version of another model: For a merge of two or more models: In the merge case, you specify a list of two or more base_models: ```yaml base_model: - Endevor/InfinityRP-v1-7B - l3utterfly/mistral-7b-v0.1-layla-v4 ``` The Hub will infer the type of relationship from the current model to the base model (`"adapter", "merge", "quantized", "finetune"`) but you can also set it explicitly if needed: `base_model_relation: quantized` for instance. ### Specifying a new version If a new version of your model is available in the Hub, you can specify it in a `new_version` field. For example, on `l3utterfly/mistral-7b-v0.1-layla-v3`: ```yaml new_version: l3utterfly/mistral-7b-v0.1-layla-v4 ``` This metadata will be used to display a link to the latest version of a model on the model page. If the model linked in `new_version` also has a `new_version` field, the very latest version will always be displayed. ### Specifying a dataset You can specify the datasets used to train your model in the model card metadata section. The datasets will be displayed on the model page and users will be able to filter models by dataset. You should use the Hub dataset identifier, which is the same as the dataset's repo name as the identifier: ```yaml datasets: - imdb - HuggingFaceH4/no_robots ``` ### Specifying a task (`pipeline_tag`) You can specify the `pipeline_tag` in the model card metadata. The `pipeline_tag` indicates the type of task the model is intended for. This tag will be displayed on the model page and users can filter models on the Hub by task. This tag is also used to determine which [widget](./models-widgets#enabling-a-widget) to use for the model and which APIs to use under the hood. For `transformers` models, the pipeline tag is automatically inferred from the model's `config.json` file but you can override it in the model card metadata if required. Editing this field in the metadata UI will ensure that the pipeline tag is valid. Some other libraries with Hub integration will also automatically add the pipeline tag to the model card metadata. ### Specifying a license You can specify the license in the model card metadata section. The license will be displayed on the model page and users will be able to filter models by license. Using the metadata UI, you will see a dropdown of the most common licenses. If required, you can also specify a custom license by adding `other` as the license value and specifying the name and a link to the license in the metadata. ```yaml # Example from https://huggingface.co/coqui/XTTS-v1 --- license: other license_name: coqui-public-model-license license_link: https://coqui.ai/cpml --- ``` If the license is not available via a URL you can link to a LICENSE stored in the model repo. ### Evaluation Results You can specify your **model's evaluation results** in a structured way in the model card metadata. Results are parsed by the Hub and displayed in a widget on the model page. Here is an example on how it looks like for the [bigcode/starcoder](https://huggingface.co/bigcode/starcoder) model: The metadata spec was based on Papers with code's [model-index specification](https://github.com/paperswithcode/model-index). This allow us to directly index the results into Papers with code's leaderboards when appropriate. You can also link the source from where the eval results has been computed. Here is a partial example to describe [01-ai/Yi-34B](https://huggingface.co/01-ai/Yi-34B)'s score on the ARC benchmark. The result comes from the [Open LLM Leaderboard](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard) which is defined as the `source`: ```yaml --- model-index: - name: Yi-34B results: - task: type: text-generation dataset: name: ai2_arc type: ai2_arc metrics: - name: AI2 Reasoning Challenge (25-Shot) type: AI2 Reasoning Challenge (25-Shot) value: 64.59 source: name: Open LLM Leaderboard url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard --- ``` For more details on how to format this data, check out the [Model Card specifications](https://github.com/huggingface/hub-docs/blob/main/modelcard.md?plain=1). ### CO2 Emissions The model card is also a great place to show information about the CO2 impact of your model. Visit our [guide on tracking and reporting CO2 emissions](./model-cards-co2) to learn more. ### Linking a Paper If the model card includes a link to a Paper page (either on HF or an Arxiv abstract/PDF), the Hugging Face Hub will extract the arXiv ID and include it in the model tags with the format `arxiv:`. Clicking on the tag will let you: * Visit the Paper page * Filter for other models on the Hub that cite the same paper. Read more about Paper pages [here](./paper-pages). ## Model Card text Details on how to fill out a human-readable model card without Hub-specific metadata (so that it may be printed out, cut+pasted, etc.) is available in the [Annotated Model Card](./model-card-annotated). ## FAQ ### How are model tags determined? Each model page lists all the model's tags in the page header, below the model name. These are primarily computed from the model card metadata, although some are added automatically, as described in [Enabling a Widget](./models-widgets#enabling-a-widget). ### Can I add custom tags to my model? Yes, you can add custom tags to your model by adding them to the `tags` field in the model card metadata. The metadata UI will suggest some popular tags, but you can add any tag you want. For example, you could indicate that your model is focused on finance by adding a `finance` tag. ### How can I indicate that my model is not suitable for all audiences You can add a `not-for-all-audiences` tag to your model card metadata. When this tag is present, a message will be displayed on the model page indicating that the model is not for all audiences. Users can click through this message to view the model card. ### Can I write LaTeX in my model card? Yes! The Hub uses the [KaTeX](https://katex.org/) math typesetting library to render math formulas server-side before parsing the Markdown. You have to use the following delimiters: - `$$ ... $$` for display mode - `\\(...\\)` for inline mode (no space between the slashes and the parenthesis). Then you'll be able to write: $$ \LaTeX $$ $$ \mathrm{MSE} = \left(\frac{1}{n}\right)\sum_{i=1}^{n}(y_{i} - x_{i})^{2} $$ $$ E=mc^2 $$ ### Using _Adapters_ at Hugging Face https://huggingface.co/docs/hub/adapters.md # Using _Adapters_ at Hugging Face > Note: _Adapters_ has replaced the `adapter-transformers` library and is fully compatible in terms of model weights. See [here](https://docs.adapterhub.ml/transitioning.html) for more. [_Adapters_](https://github.com/adapter-hub/adapters) is an add-on library to 🤗 `transformers` for efficiently fine-tuning pre-trained language models using adapters and other parameter-efficient methods. _Adapters_ also provides various methods for composition of adapter modules during training and inference. You can learn more about this in the [_Adapters_ paper](https://arxiv.org/abs/2311.11077). ## Exploring _Adapters_ on the Hub You can find _Adapters_ models by filtering at the left of the [models page](https://huggingface.co/models?library=adapter-transformers&sort=downloads). Some adapter models can be found in the Adapter Hub [repository](https://github.com/adapter-hub/hub). Models from both sources are aggregated on the [AdapterHub website](https://adapterhub.ml/explore/). ## Installation To get started, you can refer to the [AdapterHub installation guide](https://docs.adapterhub.ml/installation.html). You can also use the following one-line install through pip: ``` pip install adapters ``` ## Using existing models For a full guide on loading pre-trained adapters, we recommend checking out the [official guide](https://docs.adapterhub.ml/loading.html). As a brief summary, a full setup consists of three steps: 1. Load a base `transformers` model with the `AutoAdapterModel` class provided by _Adapters_. 2. Use the `load_adapter()` method to load and add an adapter. 3. Activate the adapter via `active_adapters` (for inference) or activate and set it as trainable via `train_adapter()` (for training). Make sure to also check out [composition of adapters](https://docs.adapterhub.ml/adapter_composition.html). ```py from adapters import AutoAdapterModel # 1. model = AutoAdapterModel.from_pretrained("FacebookAI/roberta-base") # 2. adapter_name = model.load_adapter("AdapterHub/roberta-base-pf-imdb") # 3. model.active_adapters = adapter_name # or model.train_adapter(adapter_name) ``` You can also use `list_adapters` to find all adapter models programmatically: ```py from adapters import list_adapters # source can be "ah" (AdapterHub), "hf" (hf.co) or None (for both, default) adapter_infos = list_adapters(source="hf", model_name="FacebookAI/roberta-base") ``` If you want to see how to load a specific model, you can click `Use in Adapters` and you will be given a working snippet that you can load it! ## Sharing your models For a full guide on sharing models with _Adapters_, we recommend checking out the [official guide](https://docs.adapterhub.ml/huggingface_hub.html#uploading-to-the-hub). You can share your adapter by using the `push_adapter_to_hub` method from a model that already contains an adapter. ```py model.push_adapter_to_hub( "my-awesome-adapter", "awesome_adapter", adapterhub_tag="sentiment/imdb", datasets_tag="imdb" ) ``` This command creates a repository with an automatically generated model card and all necessary metadata. ## Additional resources * _Adapters_ [repository](https://github.com/adapter-hub/adapters) * _Adapters_ [docs](https://docs.adapterhub.ml) * _Adapters_ [paper](https://arxiv.org/abs/2311.11077) * Integration with Hub [docs](https://docs.adapterhub.ml/huggingface_hub.html) ### Search https://huggingface.co/docs/hub/search.md # Search You can easily search anything on the Hub with **Full-text search**. We index model cards, dataset cards, and Spaces app.py files. Go directly to https://huggingface.co/search or, using the search bar at the top of https://huggingface.co, you can select "Try Full-text search" to help find what you seek on the Hub across models, datasets, and Spaces: ## Filter with ease By default, models, datasets, & spaces are being searched when a user enters a query. If one prefers, one can filter to search only models, datasets, or spaces. Moreover, one can copy & share the URL from one's browser's address bar, which should contain the filter information as URL query. For example, when one searches for a query `llama` with a filter to show `Spaces` only, one gets URL https://huggingface.co/search/full-text?q=llama&type=space ### Storage Regions on the Hub https://huggingface.co/docs/hub/storage-regions.md # Storage Regions on the Hub > [!WARNING] > This feature is part of the Team & Enterprise plans. Regions allow you to specify where your organization's models, datasets and Spaces are stored. For non-Enterprise Hub users, repositories are always stored in the US. This offers two key benefits: - Regulatory and legal compliance - Performance (faster download/upload speeds and lower latency) Currently available regions: - US 🇺🇸 - EU 🇪🇺 - Coming soon: Asia-Pacific 🌏 ## Getting started with Storage Regions Organizations subscribed to Enterprise Hub can access the Regions settings page to manage their repositories storage locations. This page displays: - An audit of your organization's repository locations - Options to select where new repositories will be stored > [!TIP] > Some advanced compute options for Spaces, such as ZeroGPU, may not be available in all regions. ## Repository Tag Any repository (model or dataset) stored in a non-default location displays its Region as a tag, allowing organization members to quickly identify repository locations. ## Regulatory and legal compliance Regulated industries often require data storage in specific regions. For EU companies, you can use the Hub for ML development in a GDPR-compliant manner, with datasets, models and inference endpoints stored in EU data centers. ## Performance Storing models and datasets closer to your team and infrastructure significantly improves performance for both uploads and downloads. This impact is substantial given the typically large size of model weights and dataset files. For example, European users storing repositories in the EU region can expect approximately 4-5x faster upload and download speeds compared to US storage. ## Spaces Both Spaces's storage and runtime use the chosen region. Available hardware configurations vary by region, and some features may not be available in all regions, like persistent storage associated to a Space. ### Using 🧨 `diffusers` at Hugging Face https://huggingface.co/docs/hub/diffusers.md # Using 🧨 `diffusers` at Hugging Face Diffusers is the go-to library for state-of-the-art pretrained diffusion models for generating images, audio, and even 3D structures of molecules. Whether you’re looking for a simple inference solution or want to train your own diffusion model, Diffusers is a modular toolbox that supports both. The library is designed with a focus on usability over performance, simple over easy, and customizability over abstractions. ## Exploring Diffusers in the Hub There are over 10,000 `diffusers` compatible pipelines on the Hub which you can find by filtering at the left of [the models page](https://huggingface.co/models?library=diffusers&sort=downloads). Diffusion systems are typically composed of multiple components such as text encoder, UNet, VAE, and scheduler. Even though they are not standalone models, the pipeline abstraction makes it easy to use them for inference or training. You can find diffusion pipelines for many different tasks: * Generating images from natural language text prompts ([text-to-image](https://huggingface.co/models?library=diffusers&pipeline_tag=text-to-image&sort=downloads)). * Transforming images using natural language text prompts ([image-to-image](https://huggingface.co/models?library=diffusers&pipeline_tag=image-to-image&sort=downloads)). * Generating videos from natural language descriptions ([text-to-video](https://huggingface.co/models?library=diffusers&pipeline_tag=text-to-video&sort=downloads)). You can try out the models directly in the browser if you want to test them out without downloading them, thanks to the in-browser widgets! ## Diffusers repository files A [Diffusers](https://hf.co/docs/diffusers/index) model repository contains all the required model sub-components such as the variational autoencoder for encoding images and decoding latents, text encoder, transformer model, and more. These sub-components are organized into a multi-folder layout. Each subfolder contains the weights and configuration - where applicable - for each component similar to a [Transformers](./transformers) model. Weights are usually stored as safetensors files and the configuration is usually a json file with information about the model architecture. ## Using existing pipelines All `diffusers` pipelines are a line away from being used! To run generation we recommended to always start from the `DiffusionPipeline`: ```py from diffusers import DiffusionPipeline pipeline = DiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0") ``` If you want to load a specific pipeline component such as the UNet, you can do so by: ```py from diffusers import UNet2DConditionModel unet = UNet2DConditionModel.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", subfolder="unet") ``` ## Sharing your pipelines and models All the [pipeline classes](https://huggingface.co/docs/diffusers/main/api/pipelines/overview), [model classes](https://huggingface.co/docs/diffusers/main/api/models/overview), and [scheduler classes](https://huggingface.co/docs/diffusers/main/api/schedulers/overview) are fully compatible with the Hub. More specifically, they can be easily loaded from the Hub using the `from_pretrained()` method and can be shared with others using the `push_to_hub()` method. For more details, please check out the [documentation](https://huggingface.co/docs/diffusers/main/en/using-diffusers/push_to_hub). ## Additional resources * Diffusers [library](https://github.com/huggingface/diffusers). * Diffusers [docs](https://huggingface.co/docs/diffusers/index). ### Perform vector similarity search https://huggingface.co/docs/hub/datasets-duckdb-vector-similarity-search.md # Perform vector similarity search The Fixed-Length Arrays feature was added in DuckDB version 0.10.0. This lets you use vector embeddings in DuckDB tables, making your data analysis even more powerful. Additionally, the array_cosine_similarity function was introduced. This function measures the cosine of the angle between two vectors, indicating their similarity. A value of 1 means they’re perfectly aligned, 0 means they’re perpendicular, and -1 means they’re completely opposite. Let's explore how to use this function for similarity searches. In this section, we’ll show you how to perform similarity searches using DuckDB. We will use the [asoria/awesome-chatgpt-prompts-embeddings](https://huggingface.co/datasets/asoria/awesome-chatgpt-prompts-embeddings) dataset. First, let's preview a few records from the dataset: ```bash FROM 'hf://datasets/asoria/awesome-chatgpt-prompts-embeddings/data/*.parquet' SELECT act, prompt, len(embedding) as embed_len LIMIT 3; ┌──────────────────────┬──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┬───────────┐ │ act │ prompt │ embed_len │ │ varchar │ varchar │ int64 │ ├──────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┼───────────┤ │ Linux Terminal │ I want you to act as a linux terminal. I will type commands and you will reply with what the terminal should show. I want you to only reply with the terminal output insid… │ 384 │ │ English Translator… │ I want you to act as an English translator, spelling corrector and improver. I will speak to you in any language and you will detect the language, translate it and answer… │ 384 │ │ `position` Intervi… │ I want you to act as an interviewer. I will be the candidate and you will ask me the interview questions for the `position` position. I want you to only reply as the inte… │ 384 │ └──────────────────────┴──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┴───────────┘ ``` Next, let's choose an embedding to use for the similarity search: ```bash FROM 'hf://datasets/asoria/awesome-chatgpt-prompts-embeddings/data/*.parquet' SELECT embedding WHERE act = 'Linux Terminal'; ┌─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐ │ embedding │ │ float[] │ ├─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤ │ [-0.020781303, -0.029143505, -0.0660217, -0.00932716, -0.02601602, -0.011426172, 0.06627567, 0.11941507, 0.0013917526, 0.012889079, 0.053234346, -0.07380514, 0.04871567, -0.043601237, -0.0025319182, 0.0448… │ └─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘ ``` Now, let's use the selected embedding to find similar records: ```bash SELECT act, prompt, array_cosine_similarity(embedding::float[384], (SELECT embedding FROM 'hf://datasets/asoria/awesome-chatgpt-prompts-embeddings/data/*.parquet' WHERE act = 'Linux Terminal')::float[384]) AS similarity FROM 'hf://datasets/asoria/awesome-chatgpt-prompts-embeddings/data/*.parquet' ORDER BY similarity DESC LIMIT 3; ┌──────────────────────┬─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┬────────────┐ │ act │ prompt │ similarity │ │ varchar │ varchar │ float │ ├──────────────────────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┼────────────┤ │ Linux Terminal │ I want you to act as a linux terminal. I will type commands and you will reply with what the terminal should show. I want you to only reply with the terminal output insi… │ 1.0 │ │ JavaScript Console │ I want you to act as a javascript console. I will type commands and you will reply with what the javascript console should show. I want you to only reply with the termin… │ 0.7599728 │ │ R programming Inte… │ I want you to act as a R interpreter. I'll type commands and you'll reply with what the terminal should show. I want you to only reply with the terminal output inside on… │ 0.7303775 │ └──────────────────────┴─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┴────────────┘ ``` That's it! You have successfully performed a vector similarity search using DuckDB. ### Spaces https://huggingface.co/docs/hub/spaces.md # Spaces [Hugging Face Spaces](https://huggingface.co/spaces) offer a simple way to host ML demo apps directly on your profile or your organization's profile. This allows you to create your ML portfolio, showcase your projects at conferences or to stakeholders, and work collaboratively with other people in the ML ecosystem. We have built-in support for an awesome SDK that let you build cool apps in Python in a matter of minutes: **[Gradio](https://gradio.app/)**, but you can also unlock the whole power of Docker and host an arbitrary Dockerfile. Finally, you can create static Spaces using JavaScript and HTML. You'll also be able to upgrade your Space to run [on a GPU or other accelerated hardware](./spaces-gpus). ⚡️ ## Contents - [Spaces Overview](./spaces-overview) - [Handling Spaces Dependencies](./spaces-dependencies) - [Spaces Settings](./spaces-settings) - [Using OpenCV in Spaces](./spaces-using-opencv) - [Using Spaces for Organization Cards](./spaces-organization-cards) - [More ways to create Spaces](./spaces-more-ways-to-create) - [Managing Spaces with Github Actions](./spaces-github-actions) - [How to Add a Space to ArXiv](./spaces-add-to-arxiv) - [Spaces Dev Mode](./spaces-dev-mode) - [Spaces GPU Upgrades](./spaces-gpus) - [Spaces Persistent Storage](./spaces-storage) - [Gradio Spaces](./spaces-sdks-gradio) - [Docker Spaces](./spaces-sdks-docker) - [Static HTML Spaces](./spaces-sdks-static) - [Custom Python Spaces](./spaces-sdks-python) - [Embed your Space](./spaces-embed) - [Run your Space with Docker](./spaces-run-with-docker) - [Reference](./spaces-config-reference) - [Changelog](./spaces-changelog) ## Contact Feel free to ask questions on the [forum](https://discuss.huggingface.co/c/spaces/24) if you need help with making a Space, or if you run into any other issues on the Hub. If you're interested in infra challenges, custom demos, advanced GPUs, or something else, please reach out to us by sending an email to **website at huggingface.co**. You can also tag us [on Twitter](https://twitter.com/huggingface)! 🤗 ### Models https://huggingface.co/docs/hub/models.md # Models The Hugging Face Hub hosts many models for a [variety of machine learning tasks](https://huggingface.co/tasks). Models are stored in repositories, so they benefit from [all the features](./repositories) possessed by every repo on the Hugging Face Hub. Additionally, model repos have attributes that make exploring and using models as easy as possible. These docs will take you through everything you'll need to know to find models on the Hub, upload your models, and make the most of everything the Model Hub offers! ## Contents - [The Model Hub](./models-the-hub) - [Model Cards](./model-cards) - [CO2 emissions](./model-cards-co2) - [Gated models](./models-gated) - [Libraries](./models-libraries) - [Uploading Models](./models-uploading) - [Downloading Models](./models-downloading) - [Widgets](./models-widgets) - [Widget Examples](./models-widgets-examples) - [Inference API](./models-inference) - [Frequently Asked Questions](./models-faq) - [Advanced Topics](./models-advanced) - [Integrating libraries with the Hub](./models-adding-libraries) - [Tasks](./models-tasks) ### How to configure OIDC SSO with Azure https://huggingface.co/docs/hub/security-sso-azure-oidc.md # How to configure OIDC SSO with Azure This guide will use Azure as the SSO provider and the Open ID Connect (OIDC) protocol as our preferred identity protocol. > [!WARNING] > This feature is part of the Team & Enterprise plans. ### Step 1: Create a new application in your Identity Provider Open a new tab/window in your browser and sign in to the Azure portal of your organization. Navigate to the Microsoft Entra ID admin center and click on "Enterprise applications" You'll be redirected to this page. Then click "New application" at the top and "Create your own application". Input a name for your application (for example, Hugging Face SSO), then select "Register an application to integrate with Microsoft Entra ID (App you're developing)". ### Step 2: Configure your application on Azure Open a new tab/window in your browser and navigate to the SSO section of your organization's settings. Select the OIDC protocol. Copy the "Redirection URI" from the organization's settings on Hugging Face and paste it into the "Redirect URI" field on Azure Entra ID. Make sure you select "Web" in the dropdown menu. The URL looks like this: `https://huggingface.co/organizations/[organizationIdentifier]/oidc/consume`. Save your new application. ### Step 3: Finalize configuration on Hugging Face We will need to collect the following information to finalize the setup on Hugging Face: - The Client ID of the OIDC app - A Client secret of the OIDC app - The Issuer URL of the OIDC app In Microsoft Entra ID, navigate to Enterprise applications, and click on your newly created application in the list. In the application overview, click on "Single sign-on", then "Go to application" In the OIDC app overview, you will find a copiable field named "Application (client) ID". Copy that ID to your clipboard and paste it into the "Client ID" field on Huggingface. Next, click "Endpoints" in the top menu in Microsoft Entra. Copy the value in the "OpenID connect metadata document" field and paste it into the "Issue URL" field in Hugging Face. Back in Microsoft Entra, navigate to "Certificates & secrets", and generate a new secret by clicking "New client secret". Once you have created the secret, copy the secret value and paste it into the "Client secret" field on Hugging Face. You can now click "Update and Test OIDC configuration" to save the settings. You should be redirected to your SSO provider (IdP) login prompt. Once logged in, you'll be redirected to your organization's settings page. A green check mark near the OIDC selector will attest that the test was successful. ### Step 4: Enable SSO in your organization Now that Single Sign-On is configured and tested, you can enable it for members of your organization by clicking on the "Enable" button. Once enabled, members of your organization must complete the SSO authentication flow described in [How does it work?](./security-sso#how-does-it-work). ### Your First Docker Space: Text Generation with T5 https://huggingface.co/docs/hub/spaces-sdks-docker-first-demo.md # Your First Docker Space: Text Generation with T5 In the following sections, you'll learn the basics of creating a Docker Space, configuring it, and deploying your code to it. We'll create a **Text Generation** Space with Docker that'll be used to demo the [google/flan-t5-small](https://huggingface.co/google/flan-t5-small) model, which can generate text given some input text, using FastAPI as the server. You can find a completed version of this hosted [here](https://huggingface.co/spaces/DockerTemplates/fastapi_t5). ## Create a new Docker Space We'll start by [creating a brand new Space](https://huggingface.co/new-space) and choosing **Docker** as our SDK. Hugging Face Spaces are Git repositories, meaning that you can work on your Space incrementally (and collaboratively) by pushing commits. Take a look at the [Getting Started with Repositories](./repositories-getting-started) guide to learn about how you can create and edit files before continuing. If you prefer to work with a UI, you can also do the work directly in the browser. Selecting **Docker** as the SDK when [creating a new Space](https://huggingface.co/new-space) will initialize your Docker Space by setting the `sdk` property to `docker` in your `README.md` file's YAML block. ```yaml sdk: docker ``` You have the option to change the default application port of your Space by setting the `app_port` property in your `README.md` file's YAML block. The default port is `7860`. ```yaml app_port: 7860 ``` ## Add the dependencies For the **Text Generation** Space, we'll be building a FastAPI app that showcases a text generation model called Flan T5. For the model inference, we'll be using a [🤗 Transformers pipeline](https://huggingface.co/docs/transformers/pipeline_tutorial) to use the model. We need to start by installing a few dependencies. This can be done by creating a **requirements.txt** file in our repository, and adding the following dependencies to it: ``` fastapi==0.74.* requests==2.27.* sentencepiece==0.1.* torch==1.11.* transformers==4.* uvicorn[standard]==0.17.* ``` These dependencies will be installed in the Dockerfile we'll create later. ## Create the app Let's kick off the process with a dummy FastAPI app to see that we can get an endpoint working. The first step is to create an app file, in this case, we'll call it `main.py`. ```python from fastapi import FastAPI app = FastAPI() @app.get("/") def read_root(): return {"Hello": "World!"} ``` ## Create the Dockerfile The main step for a Docker Space is creating a Dockerfile. You can read more about Dockerfiles [here](https://docs.docker.com/get-started/). Although we're using FastAPI in this tutorial, Dockerfiles give great flexibility to users allowing you to build a new generation of ML demos. Let's write the Dockerfile for our application ```Dockerfile # read the doc: https://huggingface.co/docs/hub/spaces-sdks-docker # you will also find guides on how best to write your Dockerfile FROM python:3.9 # The two following lines are requirements for the Dev Mode to be functional # Learn more about the Dev Mode at https://huggingface.co/dev-mode-explorers RUN useradd -m -u 1000 user WORKDIR /app COPY --chown=user ./requirements.txt requirements.txt RUN pip install --no-cache-dir --upgrade -r requirements.txt COPY --chown=user . /app CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "7860"] ``` When the changes are saved, the Space will rebuild and your demo should be up after a couple of seconds! [Here](https://huggingface.co/spaces/DockerTemplates/fastapi_dummy) is an example result at this point. ### Testing locally **Tip for power users (you can skip):** If you're developing locally, this is a good moment in which you can do `docker build` and `docker run` to debug locally, but it's even easier to push the changes to the Hub and see how it looks like! ```bash docker build -t fastapi . docker run -it -p 7860:7860 fastapi ``` If you have [Secrets](spaces-sdks-docker#secret-management) you can use `docker buildx` and pass the secrets as build arguments ```bash export SECRET_EXAMPLE="my_secret_value" docker buildx build --secret id=SECRET_EXAMPLE,env=SECRET_EXAMPLE -t fastapi . ``` and run with `docker run` passing the secrets as environment variables ```bash export SECRET_EXAMPLE="my_secret_value" docker run -it -p 7860:7860 -e SECRET_EXAMPLE=$SECRET_EXAMPLE fastapi ``` ## Adding some ML to our app As mentioned before, the idea is to use a Flan T5 model for text generation. We'll want to add some HTML and CSS for an input field, so let's create a directory called static with `index.html`, `style.css`, and `script.js` files. At this moment, your file structure should look like this: ```bash /static /static/index.html /static/script.js /static/style.css Dockerfile main.py README.md requirements.txt ``` Let's go through all the steps to make this working. We'll skip some of the details of the CSS and HTML. You can find the whole code in the Files and versions tab of the [DockerTemplates/fastapi_t5](https://huggingface.co/spaces/DockerTemplates/fastapi_t5) Space. 1. Write the FastAPI endpoint to do inference We'll use the `pipeline` from `transformers` to load the [google/flan-t5-small](https://huggingface.co/google/flan-t5-small) model. We'll set an endpoint called `infer_t5` that receives and input and outputs the result of the inference call ```python from transformers import pipeline pipe_flan = pipeline("text2text-generation", model="google/flan-t5-small") @app.get("/infer_t5") def t5(input): output = pipe_flan(input) return {"output": output[0]["generated_text"]} ``` 2. Write the `index.html` to have a simple form containing the code of the page. ```html Text generation using Flan T5 Model: google/flan-t5-small Text prompt Submit ``` 3. In the `main.py` file, mount the static files and show the html file in the root route ```python app.mount("/", StaticFiles(directory="static", html=True), name="static") @app.get("/") def index() -> FileResponse: return FileResponse(path="/app/static/index.html", media_type="text/html") ``` 4. In the `script.js` file, make it handle the request ```javascript const textGenForm = document.querySelector(".text-gen-form"); const translateText = async (text) => { const inferResponse = await fetch(`infer_t5?input=${text}`); const inferJson = await inferResponse.json(); return inferJson.output; }; textGenForm.addEventListener("submit", async (event) => { event.preventDefault(); const textGenInput = document.getElementById("text-gen-input"); const textGenParagraph = document.querySelector(".text-gen-output"); textGenParagraph.textContent = await translateText(textGenInput.value); }); ``` 5. Grant permissions to the right directories As discussed in the [Permissions Section](./spaces-sdks-docker#permissions), the container runs with user ID 1000. That means that the Space might face permission issues. For example, `transformers` downloads and caches the models in the path under the `HF_HOME` path. The easiest way to solve this is to create a user with righ permissions and use it to run the container application. We can do this by adding the following lines to the `Dockerfile`. ```Dockerfile # Switch to the "user" user USER user # Set home to the user's home directory ENV HOME=/home/user \ PATH=/home/user/.local/bin:$PATH ``` The final `Dockerfile` should look like this: ```Dockerfile # read the doc: https://huggingface.co/docs/hub/spaces-sdks-docker # you will also find guides on how best to write your Dockerfile FROM python:3.9 # The two following lines are requirements for the Dev Mode to be functional # Learn more about the Dev Mode at https://huggingface.co/dev-mode-explorers RUN useradd -m -u 1000 user WORKDIR /app COPY --chown=user ./requirements.txt requirements.txt RUN pip install --no-cache-dir --upgrade -r requirements.txt COPY --chown=user . /app USER user ENV HOME=/home/user \ PATH=/home/user/.local/bin:$PATH CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "7860"] ``` Success! Your app should be working now! Check out [DockerTemplates/fastapi_t5](https://huggingface.co/spaces/DockerTemplates/fastapi_t5) to see the final result. What a journey! Please remember that Docker Spaces give you lots of freedom, so you're not limited to use FastAPI. From a [Go Endpoint](https://huggingface.co/spaces/DockerTemplates/test-docker-go) to a [Shiny App](https://huggingface.co/spaces/DockerTemplates/shiny-with-python), the limit is the moon! Check out [some official examples](./spaces-sdks-docker-examples). You can also upgrade your Space to a GPU if needed 😃 ## Debugging You can debug your Space by checking the **Build** and **Container** logs. Click on the **Open Logs** button to open the modal. If everything went well, you will see `Pushing Image` and `Scheduling Space` on the **Build** tab On the **Container** tab, you will see the application status, in this case, `Uvicorn running on http://0.0.0.0:7860` Additionally, you can enable the Dev Mode on your Space. The Dev Mode allows you to connect to your running Space via VSCode or SSH. Learn more here: https://huggingface.co/dev-mode-explorers ## Read More - [Docker Spaces](spaces-sdks-docker) - [List of Docker Spaces examples](spaces-sdks-docker-examples) ### Static HTML Spaces https://huggingface.co/docs/hub/spaces-sdks-static.md # Static HTML Spaces Spaces also accommodate custom HTML for your app instead of using Streamlit or Gradio. Set `sdk: static` inside the `YAML` block at the top of your Spaces **README.md** file. Then you can place your HTML code within an **index.html** file. Here are some examples of Spaces using custom HTML: * [Smarter NPC](https://huggingface.co/spaces/mishig/smarter_npc): Display a PlayCanvas project with an iframe in Spaces. * [Huggingfab](https://huggingface.co/spaces/pierreant-p/huggingfab): Display a Sketchfab model in Spaces. * [Diffuse the rest](https://huggingface.co/spaces/huggingface-projects/diffuse-the-rest): Draw and diffuse the rest ## Adding a build step before serving Static Spaces support adding a custom build step before serving your static assets. This is useful for frontend frameworks like React, Svelte and Vue that require a build process before serving the application. The build command runs automatically when your Space is updated. Add `app_build_command` inside the `YAML` block at the top of your Spaces **README.md** file, and `app_file`. For example: - `app_build_command: npm run build` - `app_file: dist/index.html` Example spaces: - [Svelte App](https://huggingface.co/spaces/julien-c/vite-svelte) - [React App](https://huggingface.co/spaces/coyotte508/static-vite) Under the hood, it will [launch a build](https://huggingface.co/spaces/huggingface/space-build), storing the generated files in a special `refs/convert/build` ref. ## Space variables Custom [environment variables](./spaces-overview#managing-secrets) can be passed to your Space. OAuth information such as the client ID and scope are also available as environment variables, if you have [enabled OAuth](./spaces-oauth) for your Space. To use these variables in JavaScript, you can use the `window.huggingface.variables` object. For example, to access the `OAUTH_CLIENT_ID` variable, you can use `window.huggingface.variables.OAUTH_CLIENT_ID`. Here is an example of a Space using custom environment variables and oauth enabled and displaying the variables in the HTML: * [Static Variables](https://huggingface.co/spaces/huggingfacejs/static-variables) ### Single Sign-On (SSO) https://huggingface.co/docs/hub/enterprise-sso.md # Single Sign-On (SSO) > [!WARNING] > This feature is part of the Team & Enterprise plans. Single sign-on (SSO) allows organizations to securely manage user authentication through their own identity provider (IdP). Both SAML 2.0 and OpenID Connect (OIDC) protocols are supported. Please note that this feature is intended to manage access to organization-specific resources such as private models, datasets, and Spaces. However, by default it does not replace the core authentication mechanism for the Hugging Face platform, meaning that users still need to login with their own HF account. To replace the core authentication, i.e. for enhanced capabilities like automated user provisioning (JIT/SCIM) and global SSO enforcement, see our [Advanced SSO documentation](./enterprise-hub-advanced-sso). This feature allows organizations to: - Enforce mandatory authentication through your company's IdP - Automatically manage user access and roles based on your IdP attributes - Support popular providers like Okta, OneLogin, and Azure Active Directory - Maintain security while allowing external collaborators when needed - Control session timeouts and role mappings This Enterprise Hub feature helps organizations maintain consistent security policies while giving their teams seamless access to Hugging Face resources. [Getting started with SSO →](./security-sso) ### Livebook on Spaces https://huggingface.co/docs/hub/spaces-sdks-docker-livebook.md # Livebook on Spaces **Livebook** is an open-source tool for writing interactive code notebooks in [Elixir](https://elixir-lang.org/). It's part of a growing collection of Elixir tools for [numerical computing](https://github.com/elixir-nx/nx), [data science](https://github.com/elixir-nx/explorer), and [Machine Learning](https://github.com/elixir-nx/bumblebee). Some of Livebook's most exciting features are: - **Reproducible workflows**: Livebook runs your code in a predictable order, all the way down to package management - **Smart cells**: perform complex tasks, such as data manipulation and running machine learning models, with a few clicks using Livebook's extensible notebook cells - **Elixir powered**: use the power of the Elixir programming language to write concurrent and distributed notebooks that scale beyond your machine To learn more about it, watch this [15-minute video](https://www.youtube.com/watch?v=EhSNXWkji6o). Or visit [Livebook's website](https://livebook.dev/). Or follow its [Twitter](https://twitter.com/livebookdev) and [blog](https://news.livebook.dev/) to keep up with new features and updates. ## Your first Livebook Space You can get Livebook up and running in a Space with just a few clicks. Click the button below to start creating a new Space using Livebook's Docker template: Then: 1. Give your Space a name 2. Set the password of your Livebook 3. Set its visibility to public 4. Create your Space ![Creating a Livebok Space ](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/spaces-livebook-new-space.png) This will start building your Space using Livebook's Docker image. The visibility of the Space must be set to public for the Smart cells feature in Livebook to function properly. However, your Livebook instance will be protected by Livebook authentication. > [!TIP] > Smart cell is a type of Livebook cell that provides a UI component for accomplishing a specific task. The code for the task is generated automatically based on the user's interactions with the UI, allowing for faster completion of high-level tasks without writing code from scratch. Once the app build is finished, go to the "App" tab in your Space and log in to your Livebook using the password you previously set: ![Livebook authentication](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/spaces-livebook-authentication.png) That's it! Now you can start using Livebook inside your Space. If this is your first time using Livebook, you can learn how to use it with its interactive notebooks within Livebook itself: ![Livebook's learn notebooks](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/spaces-livebook-learn-section.png) ## Livebook integration with Hugging Face Models Livebook has an [official integration with Hugging Face models](https://livebook.dev/integrations/hugging-face). With this feature, you can run various Machine Learning models within Livebook with just a few clicks. Here's a quick video showing how to do that: ## How to update Livebook's version To update Livebook to its latest version, go to the Settings page of your Space and click on "Factory reboot this Space": ![Factory reboot a Space](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/spaces-livebook-factory-reboot.png) ## Caveats The following caveats apply to running Livebook inside a Space: - The Space's visibility setting must be public. Otherwise, Smart cells won't work. That said, your Livebook instance will still be behind Livebook authentication since you've set the `LIVEBOOK_PASSWORD` secret. - Livebook global configurations will be lost once the Space restarts. Consider using the [desktop app](https://livebook.dev/#install) if you find yourself in need of persisting configuration across deployments. ## Feedback and support If you have improvement suggestions or need specific support, please join the [Livebook community on GitHub](https://github.com/livebook-dev/livebook/discussions). ### Downloading datasets https://huggingface.co/docs/hub/datasets-downloading.md # Downloading datasets ## Integrated libraries If a dataset on the Hub is tied to a [supported library](./datasets-libraries), loading the dataset can be done in just a few lines. For information on accessing the dataset, you can click on the "Use this dataset" button on the dataset page to see how to do so. For example, [`samsum`](https://huggingface.co/datasets/Samsung/samsum?library=datasets) shows how to do so with 🤗 Datasets below. ## Using the Hugging Face Client Library You can use the [`huggingface_hub`](/docs/huggingface_hub) library to create, delete, update and retrieve information from repos. For example, to download the `HuggingFaceH4/ultrachat_200k` dataset from the command line, run ```bash hf download HuggingFaceH4/ultrachat_200k --repo-type dataset ``` See the [HF CLI download documentation](https://huggingface.co/docs/huggingface_hub/en/guides/cli#download-a-dataset-or-a-space) for more information. You can also integrate this into your own library! For example, you can quickly load a CSV dataset with a few lines using Pandas. ```py from huggingface_hub import hf_hub_download import pandas as pd REPO_ID = "YOUR_REPO_ID" FILENAME = "data.csv" dataset = pd.read_csv( hf_hub_download(repo_id=REPO_ID, filename=FILENAME, repo_type="dataset") ) ``` ## Using Git Since all datasets on the Hub are Xet-backed Git repositories, you can clone the datasets locally by [installing git-xet](./xet/using-xet-storage#git-xet) and running: ```bash git xet install git lfs install git clone git@hf.co:datasets/ # example: git clone git@hf.co:datasets/allenai/c4 ``` If you have write-access to the particular dataset repo, you'll also have the ability to commit and push revisions to the dataset. Add your SSH public key to [your user settings](https://huggingface.co/settings/keys) to push changes and/or access private repos. ### How to configure OIDC SSO with Google Workspace https://huggingface.co/docs/hub/security-sso-google-oidc.md # How to configure OIDC SSO with Google Workspace In this guide, we will use Google Workspace as the SSO provider with the OpenID Connect (OIDC) protocol as our preferred identity protocol. We currently support SP-initiated authentication. User provisioning is part of Enterprise Plus's [Advanced SSO](./enterprise-hub-advanced-sso). > [!WARNING] > This feature is part of the Team & Enterprise plans. ### Step 1: Create OIDC App in Google Workspace - In your Google Cloud console, search and navigate to `Google Auth Platform` > `Clients`. - Click `Create Client`. - For Application Type select `Web Application`. - Provide a name for your application. - Retrieve the `Redirection URI` from your Hugging Face organization settings, go to the `SSO` tab and select the `OIDC` protocol. - Click `Create`. - A pop-up will appear with the `Client ID` and `Client Secret`, copy those and paste them into your Hugging Face organization settings. In the `SSO` tab (make sure `OIDC` is selected) paste the corresponding values for `Client Identifier` and `Client Secret`. ### Step 2: Configure Hugging Face with Google's OIDC Details - At this point the **Client ID** and **Client Secret** should be set in your Hugging Face organization settings `SSO` tab. - Set the **Issuer URL** to `https://accounts.google.com`. ### Step 3: Test and Enable SSO > [!WARNING] > Before testing, ensure you have granted access to the application for the appropriate users. The admin performing the test must have access. - Now, in your Hugging Face SSO settings, click on **"Update and Test OIDC configuration"**. - You should be redirected to your Google login prompt. Once logged in, you'll be redirected to your organization's settings page. - A green check mark near the OIDC selector will confirm that the test was successful. - Once the test is successful, you can enable SSO for your organization by clicking the "Enable" button. - Once enabled, members of your organization must complete the SSO authentication flow described in ### Gating Group Collections https://huggingface.co/docs/hub/enterprise-hub-gating-group-collections.md # Gating Group Collections > [!WARNING] > This feature is part of the Team & Enterprise plans. Gating Group Collections allow organizations to grant (or reject) access to all the models and datasets in a collection at once, rather than per repo. Users will only have to go through **a single access request**. To enable Gating Group in a collection: - the collection owner must be an organization - the organization must be subscribed to a Team or Enterprise plan - all models and datasets in the collection must be owned by the same organization as the collection - each model or dataset in the collection may only belong to one Gating Group Collection (but they can still be included in non-gating i.e. _regular_ collections). > [!TIP] > Gating only applies to models and datasets; any other resource part of the collection (such as a Space or a Paper) won't be affected. ## Manage gating group as an organization admin To enable access requests, go to the collection page and click on **Gating group** in the bottom-right corner. By default, gating group is disabled: click on **Configure Access Requests** to open the settings By default, access to the repos in the collection is automatically granted to users when they request it. This is referred to as **automatic approval**. In this mode, any user can access your repos once they’ve agreed to share their contact information with you. If you want to manually approve which users can access repos in your collection, you must set it to **Manual Review**. When this is the case, you will notice a new option: **Notifications frequency**, which lets you configure when to get notified about new users requesting access. It can be set to once a day or real-time. By default, emails are sent to the first 5 admins of the organization. You can also set a different email address in the **Notifications email** field. ### Review access requests Once access requests are enabled, you have full control of who can access repos in your gating group collection, whether the approval mode is manual or automatic. You can review and manage requests either from the UI or via the API. **Approving a request for a repo in a gating group collection will automatically approve access to all repos (models and datasets) in that collection.** #### From the UI You can review who has access to all the repos in your Gating Group Collection from the settings page of any of the repos in the collection, by clicking on the **Review access requests** button: This will open a modal with 3 lists of users: - **pending**: the list of users waiting for approval to access your repository. This list is empty unless you’ve selected **Manual Review**. You can either **Accept** or **Reject** each request. If the request is rejected, the user cannot access your repository and cannot request access again. - **accepted**: the complete list of users with access to your repository. You can choose to **Reject** access at any time for any user, whether the approval mode is manual or automatic. You can also **Cancel** the approval, which will move the user to the **pending** list. - **rejected**: the list of users you’ve manually rejected. Those users cannot access your repositories. If they go to your repository, they will see a message _Your request to access this repo has been rejected by the repo’s authors_. #### Via the API You can programmatically manage access requests in a Gated Group Collection through the API of any of its models or datasets. Visit our [gated models](https://huggingface.co/docs/hub/models-gated#via-the-api) or [gated datasets](https://huggingface.co/docs/hub/datasets-gated#via-the-api) documentation to know more about it. #### Download access report You can download access reports for the Gated Group Collection through the settings page of any of its models or datasets. Visit our [gated models](https://huggingface.co/docs/hub/models-gated#download-access-report) or [gated datasets](https://huggingface.co/docs/hub/datasets-gated#download-access-report) documentation to know more about it. #### Customize requested information Organizations can customize the gating parameters as well as the user information that is collected per gated repo. Please, visit our [gated models](https://huggingface.co/docs/hub/models-gated#customize-requested-information) or [gated datasets](https://huggingface.co/docs/hub/datasets-gated#customize-requested-information) documentation for more details. > [!WARNING] > There is currently no way to customize the gate parameters and requested information in a centralized way. If you want to collect the same data no matter which collection's repository a user requests access throughout, you need to add the same gate parameters in the metadata of all the models and datasets of the collection, and keep it synced. ## Access gated repos in a Gating Group Collection as a user A Gated Group Collection shows a specific icon next to its name: To get access to the models and datasets in a Gated Group Collection, a single access request on the page of any of those repositories is needed. Once your request is approved, you will be able to access all the other repositories in the collection, including future ones. Visit our [gated models](https://huggingface.co/docs/hub/models-gated#access-gated-models-as-a-user) or [gated datasets](https://huggingface.co/docs/hub/datasets-gated#access-gated-datasets-as-a-user) documentation to learn more about requesting access to a repository. ### How to configure SCIM with Microsoft Entra ID (Azure AD) https://huggingface.co/docs/hub/security-sso-entra-id-scim.md # How to configure SCIM with Microsoft Entra ID (Azure AD) This guide explains how to set up automatic user and group provisioning between Microsoft Entra ID and your Hugging Face organization using SCIM. > [!WARNING] > This feature is part of the Enterprise Plus plan. ### Step 1: Get SCIM configuration from Hugging Face 1. Navigate to your organization's settings page on Hugging Face. 2. Go to the **SSO** tab, then click on the **SCIM** sub-tab. 3. Copy the **SCIM Tenant URL**. You will need this for the Entra ID configuration. 4. Click **Generate an access token**. A new SCIM token will be generated. Copy this token immediately and store it securely, as you will not be able to see it again. ### Step 2: Configure Provisioning in Microsoft Entra ID 1. In the Microsoft Entra admin center, navigate to your Hugging Face Enterprise Application. 2. In the left-hand menu, select **Provisioning**. 3. Click **Get started**. 4. Change the **Provisioning Mode** from "Manual" to **Automatic**. ### Step 3: Enter Admin Credentials 1. In the **Admin Credentials** section, paste the **SCIM Tenant URL** from Hugging Face into the **Tenant URL** field. 2. Paste the **SCIM token** from Hugging Face into the **Secret Token** field. 3. Click **Test Connection**. You should see a success notification. 4. Click **Save**. ### Step 4: Configure Attribute Mappings 1. Under the **Mappings** section, click on **Provision Microsoft Entra ID Users**. 2. The default attribute mappings often require adjustments for robust provisioning. We recommend using the following configuration. You can delete attributes that are not listed here: | `customappsso` Attribute | Microsoft Entra ID Attribute | Matching precedence | |---|---|---| | `userName` | `Replace([mailNickname], ".", "", "", "", "", "")` | | | `active` | `Switch([IsSoftDeleted], , "False", "True", "True", "False")` | | | `emails[type eq "work"].value` | `userPrincipalName` | | | `name.givenName` | `givenName` | | | `name.familyName` | `surname` | | | `name.formatted` | `Join(" ", [givenName], [surname])` | | | `externalId` | `objectId` | `1` | 3. The Username needs to comply with the following rules. > [!WARNING] > > Only regular characters and `-` are accepted in the Username. > `--` (double dash) is forbidden. > `-` cannot start or end the name. > Digit-only names are not accepted. > Minimum length is 2 and maximum length is 42. > Username has to be unique within your org. > 4. After configuring the user mappings, go back to the Provisioning screen and click on **Provision Microsoft Entra ID Groups** to review group mappings. The default settings for groups are usually sufficient. ### Step 5: Start Provisioning 1. On the main Provisioning screen, set the **Provisioning Status** to **On**. 2. Under **Settings**, you can configure the **Scope** to either "Sync only assigned users and groups" or "Sync all users and groups". We recommend starting with "Sync only assigned users and groups". 3. Save your changes. The initial synchronization can take up to 40 minutes to start. You can monitor the progress in the **Provisioning logs** tab. #### Assigning Users and Groups for Provisioning To control which users and groups are provisioned to your Hugging Face organization, you need to assign them to the Hugging Face Enterprise Application in Microsoft Entra ID. This is done in the **Users and groups** tab of your application. 1. Navigate to your Hugging Face Enterprise Application in the Microsoft Entra admin center. 2. Go to the **Users and groups** tab. 3. Click **Add user/group**. 4. Select the users and groups you want to provision and click **Assign**. Only the users and groups you assign here will be provisioned to Hugging Face if you have set the **Scope** to "Sync only assigned users and groups". > [!TIP] > Active Directory Plan Considerations > > With Free, Office 365, and Premium P1/P2 plans, you can assign individual users to the application for provisioning. > With Premium P1/P2 plans, you can also assign groups. This is the recommended approach for managing access at scale, as you can manage group membership in AD, and the changes will automatically be reflected in Hugging Face. > ### Step 6: Verify Provisioning in Hugging Face Once the synchronization is complete, navigate back to your Hugging Face organization settings: - Provisioned users will appear in the **Users Management** tab. - Provisioned groups will appear in the **SCIM** tab under **SCIM Groups**. These groups can then be assigned to [Resource Groups](./security-resource-groups) for fine-grained access control. ### Step 7: Link SCIM Groups to Hugging Face Resource Groups Once your groups are provisioned from Entra ID, you can link them to Hugging Face Resource Groups to manage permissions at scale. This allows all members of a SCIM group to automatically receive specific roles (like read or write) for a collection of resources. 1. In your Hugging Face organization settings, navigate to the **SSO** -> **SCIM** tab, You will see a list of your provisioned groups under **SCIM Groups**. 3. Locate the group you wish to configure and click **Link resource groups** in its row. 4. A dialog will appear. Click **Link a Resource Group**. 5. From the dropdown menus, select the **Resource Group** you want to link and the **Role Assignment** you want to grant to the members of the SCIM group. 6. Click **Link to SCIM group** and save the mapping. ### Using Flair at Hugging Face https://huggingface.co/docs/hub/flair.md # Using Flair at Hugging Face [Flair](https://github.com/flairNLP/flair) is a very simple framework for state-of-the-art NLP. Developed by [Humboldt University of Berlin](https://www.informatik.hu-berlin.de/en/forschung-en/gebiete/ml-en/) and friends. ## Exploring Flair in the Hub You can find `flair` models by filtering at the left of the [models page](https://huggingface.co/models?library=flair). All models on the Hub come with these useful features: 1. An automatically generated model card with a brief description. 2. An interactive widget you can use to play with the model directly in the browser. 3. An Inference API that allows you to make inference requests. ## Installation To get started, you can follow the [Flair installation guide](https://github.com/flairNLP/flair?tab=readme-ov-file#requirements-and-installation). You can also use the following one-line install through pip: ``` $ pip install -U flair ``` ## Using existing models All `flair` models can easily be loaded from the Hub: ```py from flair.data import Sentence from flair.models import SequenceTagger # load tagger tagger = SequenceTagger.load("flair/ner-multi") ``` Once loaded, you can use `predict()` to perform inference: ```py sentence = Sentence("George Washington ging nach Washington.") tagger.predict(sentence) # print sentence print(sentence) ``` It outputs the following: ```text Sentence[6]: "George Washington ging nach Washington." → ["George Washington"/PER, "Washington"/LOC] ``` If you want to load a specific Flair model, you can click `Use in Flair` in the model card and you will be given a working snippet! ## Additional resources * Flair [repository](https://github.com/flairNLP/flair) * Flair [docs](https://flairnlp.github.io/docs/intro) * Official Flair [models](https://huggingface.co/flair) on the Hub (mainly trained by [@alanakbik](https://huggingface.co/alanakbik) and [@stefan-it](https://huggingface.co/stefan-it)) ### Authentication https://huggingface.co/docs/hub/datasets-polars-auth.md # Authentication In order to access private or gated datasets, you need to authenticate first. Authentication works by providing an access token which will be used to authenticate and authorize your access to gated and private datasets. The first step is to create an access token for your account. This can be done by visiting [Hugging Face Settings - Tokens](https://huggingface.co/settings/tokens). There are three ways to provide the token: setting an environment variable, passing a parameter to the reader or using the Hugging Face CLI. ## Environment variable If you set the environment variable `HF_TOKEN`, Polars will automatically use it when requesting datasets from Hugging Face. ```bash export HF_TOKEN="hf_xxxxxxxxxxxxx" ``` ## Parameters You can also explicitly provide the access token to the reader (e.g. `read_parquet`) through the `storage_options` parameter. For a full overview of all the parameters, check out the [API reference guide](https://docs.pola.rs/api/python/stable/reference/api/polars.read_parquet.html). ```python pl.read_parquet( "hf://datasets/roneneldan/TinyStories/data/train-*.parquet", storage_options={"token": ACCESS_TOKEN}, ) ``` ## CLI Alternatively, you can you use the [Hugging Face CLI](/docs/huggingface_hub/en/guides/cli) to authenticate. After successfully logging in with `hf auth login` an access token will be stored in the `HF_HOME` directory which defaults to `~/.cache/huggingface`. Polars will then use this token for authentication. If multiple methods are specified, they are prioritized in the following order: - Parameters (`storage_options`) - Environment variable (`HF_TOKEN`) - CLI ### Spark https://huggingface.co/docs/hub/datasets-spark.md # Spark Spark enables real-time, large-scale data processing in a distributed environment. You can use `pyspark_huggingface` to access Hugging Face datasets repositories in PySpark via the "huggingface" Data Source. Try out [Spark Notebooks](https://huggingface.co/spaces/Dataset-Tools/Spark-Notebooks) on Hugging Face Spaces to get Notebooks with PySpark and `pyspark_huggingface` pre-installed. ## Set up ### Installation To be able to read and write to Hugging Face Datasets, you need to install the `pyspark_huggingface` library: ``` pip install pyspark_huggingface ``` This will also install required dependencies like `huggingface_hub` for authentication, and `pyarrow` for reading and writing datasets. ### Authentication You need to authenticate to Hugging Face to read private/gated dataset repositories or to write to your dataset repositories. You can use the CLI for example: ``` hf auth login ``` It's also possible to provide your Hugging Face token with the `HF_TOKEN` environment variable or passing the `token` option to the reader. For more details about authentication, check out [this guide](https://huggingface.co/docs/huggingface_hub/quick-start#authentication). ### Enable the "huggingface" Data Source PySpark 4 came with a new Data Source API which allows to use datasets from custom sources. If `pyspark_huggingface` is installed, PySpark auto-imports it and enables the "huggingface" Data Source. The library also backports the Data Source API for the "huggingface" Data Source for PySpark 3.5, 3.4 and 3.3. However in this case `pyspark_huggingface` should be imported explicitly to activate the backport and enable the "huggingface" Data Dource: ```python >>> import pyspark_huggingface huggingface datasource enabled for pyspark 3.x.x (backport from pyspark 4) ``` ## Read The "huggingface" Data Source allows to read datasets from Hugging Face, using `pyarrow` under the hood to stream Arrow data. This is compatible with all the dataset in [supported format](https://huggingface.co/docs/hub/datasets-adding#file-formats) on Hugging Face, like Parquet datasets. For example here is how to load the [stanfordnlp/imdb](https://huggingface.co/stanfordnlp/imdb) dataset: ```python >>> import pyspark_huggingface >>> from pyspark.sql import SparkSession >>> spark = SparkSession.builder.appName("demo").getOrCreate() >>> df = spark.read.format("huggingface").load("stanfordnlp/imdb") ``` Here is another example with the [BAAI/Infinity-Instruct](https://huggingface.co/datasets/BAAI/Infinity-Instruct) dataset. It is a gated repository, users have to accept the terms of use before accessing it. It also has multiple subsets, namely, "3M" and "7M". So we need to specify which one to load. We use the `.format()` function to use the "huggingface" Data Source, and `.load()` to load the dataset (more precisely the config or subset named "7M" containing 7M samples). Then we compute the number of dialogue per language and filter the dataset. After logging-in to access the gated repository, we can run: ```python >>> import pyspark_huggingface >>> from pyspark.sql import SparkSession >>> spark = SparkSession.builder.appName("demo").getOrCreate() >>> df = spark.read.format("huggingface").option("config", "7M").load("BAAI/Infinity-Instruct") >>> df.show() +---+----------------------------+-----+----------+--------------------+ | id| conversations|label|langdetect| source| +---+----------------------------+-----+----------+--------------------+ | 0| [{human, def exti...| | en| code_exercises| | 1| [{human, See the ...| | en| flan| | 2| [{human, This is ...| | en| flan| | 3| [{human, If you d...| | en| flan| | 4| [{human, In a Uni...| | en| flan| | 5| [{human, Read the...| | en| flan| | 6| [{human, You are ...| | en| code_bagel| | 7| [{human, I want y...| | en| Subjective| | 8| [{human, Given th...| | en| flan| | 9|[{human, 因果联系原则是法...| | zh-cn| Subjective| | 10| [{human, Provide ...| | en|self-oss-instruct...| | 11| [{human, The univ...| | en| flan| | 12| [{human, Q: I am ...| | en| flan| | 13| [{human, What is ...| | en| OpenHermes-2.5| | 14| [{human, In react...| | en| flan| | 15| [{human, Write Py...| | en| code_exercises| | 16| [{human, Find the...| | en| MetaMath| | 17| [{human, Three of...| | en| MetaMath| | 18| [{human, Chandra ...| | en| MetaMath| | 19|[{human, 用经济学知识分析...| | zh-cn| Subjective| +---+----------------------------+-----+----------+--------------------+ ``` This loads the dataset in a streaming fashion, and the output DataFrame has one partition per data file in the dataset to enable efficient distributed processing. To compute the number of dialogues per language we run this code that uses the `columns` option and a `groupBy()` operation. The `columns` option is useful to only load the data we need, since PySpark doesn't enable predicate push-down with the Data Source API. There is also a `filters` option to only load data with values within a certain range. ```python >>> df_langdetect_only = ( ... spark.read.format("huggingface") ... .option("config", "7M") ... .option("columns", '["langdetect"]') ... .load("BAAI/Infinity-Instruct") ... ) >>> df_langdetect_only.groupBy("langdetect").count().show() +----------+-------+ |langdetect| count| +----------+-------+ | en|6697793| | zh-cn| 751313| +----------+-------+ ``` To filter the dataset and only keep dialogues in Chinese: ```python >>> df_chinese_only = ( ... spark.read.format("huggingface") ... .option("config", "7M") ... .option("filters", '[("langdetect", "=", "zh-cn")]') ... .load("BAAI/Infinity-Instruct") ... ) >>> df_chinese_only.show() +---+----------------------------+-----+----------+----------+ | id| conversations|label|langdetect| source| +---+----------------------------+-----+----------+----------+ | 9|[{human, 因果联系原则是法...| | zh-cn|Subjective| | 19|[{human, 用经济学知识分析...| | zh-cn|Subjective| | 38| [{human, 某个考试共有A、...| | zh-cn|Subjective| | 39|[{human, 撰写一篇关于斐波...| | zh-cn|Subjective| | 57|[{human, 总结世界历史上的...| | zh-cn|Subjective| | 61|[{human, 生成一则广告词。...| | zh-cn|Subjective| | 66|[{human, 描述一个有效的团...| | zh-cn|Subjective| | 94|[{human, 如果比利和蒂芙尼...| | zh-cn|Subjective| |102|[{human, 生成一句英文名言...| | zh-cn|Subjective| |106|[{human, 写一封感谢信,感...| | zh-cn|Subjective| |118| [{human, 生成一个故事。}...| | zh-cn|Subjective| |174|[{human, 高胆固醇水平的后...| | zh-cn|Subjective| |180|[{human, 基于以下角色信息...| | zh-cn|Subjective| |192|[{human, 请写一篇文章,概...| | zh-cn|Subjective| |221|[{human, 以诗歌形式表达对...| | zh-cn|Subjective| |228|[{human, 根据给定的指令,...| | zh-cn|Subjective| |236|[{human, 打开一个新的生成...| | zh-cn|Subjective| |260|[{human, 生成一个有关未来...| | zh-cn|Subjective| |268|[{human, 如果有一定数量的...| | zh-cn|Subjective| |273| [{human, 题目:小明有5个...| | zh-cn|Subjective| +---+----------------------------+-----+----------+----------+ ``` It is also possible to apply filters or remove columns on the loaded DataFrame, but it is more efficient to do it while loading, especially on Parquet datasets. Indeed, Parquet contains metadata at the file and row group level, which allows to skip entire parts of the dataset that don't contain samples that satisfy the criteria. Columns in Parquet can also be loaded independently, whch allows to skip the excluded columns and avoid loading unnecessary data. ### Options Here is the list of available options you can pass to `read..option()`: * `config` (string): select a dataset subset/config * `split` (string): select a dataset split (default is "train") * `token` (string): your Hugging Face token For Parquet datasets: * `columns` (string): select a subset of columns to load, e.g. `'["id"]'` * `filters` (string): to skip files and row groups that don't match a criteria, e.g. `'["source", "=", "code_exercises"]'`. Filters are passed to [pyarrow.parquet.ParquetDataset](https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetDataset.html). Any other option is passed as an argument to [datasets.load_dataset] (https://huggingface.co/docs/datasets/en/package_reference/loading_methods#datasets.load_dataset) ### Run SQL queries Once you have your PySpark Dataframe ready, you can run SQL queries using `spark.sql`: ```python >>> import pyspark_huggingface >>> from pyspark.sql import SparkSession >>> spark = SparkSession.builder.appName("demo").getOrCreate() >>> df = ( ... spark.read.format("huggingface") ... .option("config", "7M") ... .option("columns", '["source"]') ... .load("BAAI/Infinity-Instruct") ... ) >>> spark.sql("SELECT source, count(*) AS total FROM {df} GROUP BY source ORDER BY total DESC", df=df).show() +--------------------+-------+ | source| total| +--------------------+-------+ | flan|2435840| | Subjective|1342427| | OpenHermes-2.5| 855478| | MetaMath| 690138| | code_exercises| 590958| |Orca-math-word-pr...| 398168| | code_bagel| 386649| | MathInstruct| 329254| |python-code-datas...| 88632| |instructional_cod...| 82920| | CodeFeedback| 79513| |self-oss-instruct...| 50467| |Evol-Instruct-Cod...| 43354| |CodeExercise-Pyth...| 27159| |code_instructions...| 23130| | Code-Instruct-700k| 10860| |Glaive-code-assis...| 9281| |python_code_instr...| 2581| |Python-Code-23k-S...| 2297| +--------------------+-------+ ``` Again, specifying the `columns` option is not necessary, but is useful to avoid loading unnecessary data and make the query faster. ## Write You can write a PySpark Dataframe to Hugging Face with the "huggingface" Data Source. It uploads Parquet files in parallel in a distributed manner, and only commits the files once they're all uploaded. It works like this: ```python >>> import pyspark_huggingface >>> df.write.format("huggingface").save("username/dataset_name") ``` Here is how we can use this function to write the filtered version of the [BAAI/Infinity-Instruct](https://huggingface.co/datasets/BAAI/Infinity-Instruct) dataset back to Hugging Face. First you need to [create a dataset repository](https://huggingface.co/new-dataset), e.g. `username/Infinity-Instruct-Chinese-Only` (you can set it to private if you want). Then, make sure you are authenticated and you can use the "huggingface" Data Source, set the `mode` to "overwrite" (or "append" if you want to extend an existing dataset), and push to Hugging Face with `.save()`: ```python >>> df_chinese_only.write.format("huggingface").mode("overwrite").save("username/Infinity-Instruct-Chinese-Only") ``` ### Mode Two modes are available when pushing a dataset to Hugging Face: * "overwrite": overwrite the dataset if it already exists * "append": append the dataset to an existing dataset ### Options Here is the list of available options you can pass to `write.option()`: * `token` (string): your Hugging Face token Contributions are welcome to add more options here, in particular `subset` and `split`. ### Using ML-Agents at Hugging Face https://huggingface.co/docs/hub/ml-agents.md # Using ML-Agents at Hugging Face `ml-agents` is an open-source toolkit that enables games and simulations made with Unity to serve as environments for training intelligent agents. ## Exploring ML-Agents in the Hub You can find `ml-agents` models by filtering at the left of the [models page](https://huggingface.co/models?library=ml-agents). All models on the Hub come up with useful features: 1. An automatically generated model card with a description, a training configuration, and more. 2. Metadata tags that help for discoverability. 3. Tensorboard summary files to visualize the training metrics. 4. A link to the Spaces web demo where you can visualize your agent playing in your browser. ## Install the library To install the `ml-agents` library, you need to clone the repo: ``` # Clone the repository git clone https://github.com/Unity-Technologies/ml-agents # Go inside the repository and install the package cd ml-agents pip3 install -e ./ml-agents-envs pip3 install -e ./ml-agents ``` ## Using existing models You can simply download a model from the Hub using `mlagents-load-from-hf`. ``` mlagents-load-from-hf --repo-id="ThomasSimonini/MLAgents-Pyramids" --local-dir="./downloads" ``` You need to define two parameters: - `--repo-id`: the name of the Hugging Face repo you want to download. - `--local-dir`: the path to download the model. ## Visualize an agent playing You can easily watch any model playing directly in your browser: 1. Go to your model repo. 2. In the `Watch Your Agent Play` section, click on the link. 3. In the demo, on step 1, choose your model repository, which is the model id. 4. In step 2, choose what model you want to replay. ## Sharing your models You can easily upload your models using `mlagents-push-to-hf`: ``` mlagents-push-to-hf --run-id="First Training" --local-dir="results/First Training" --repo-id="ThomasSimonini/MLAgents-Pyramids" --commit-message="Pyramids" ``` You need to define four parameters: - `--run-id`: the name of the training run id. - `--local-dir`: where the model was saved. - `--repo-id`: the name of the Hugging Face repo you want to create or update. It’s `/`. - `--commit-message`. ## Additional resources * ML-Agents [documentation](https://github.com/Unity-Technologies/ml-agents/blob/develop/docs/Hugging-Face-Integration.md) * Official Unity ML-Agents Spaces [demos](https://huggingface.co/unity) ### fenic https://huggingface.co/docs/hub/datasets-fenic.md # fenic [fenic](https://github.com/typedef-ai/fenic) is a PySpark-inspired DataFrame framework designed for building production AI and agentic applications. fenic provides support for reading datasets directly from the Hugging Face Hub. ## Getting Started To get started, pip install `fenic`: ```bash pip install fenic ``` ### Create a Session Instantiate a fenic session with the default configuration (sufficient for reading datasets and other non-semantic operations): ```python import fenic as fc session = fc.Session.get_or_create(fc.SessionConfig()) ``` ## Overview fenic is an opinionated data processing framework that combines: - **DataFrame API**: PySpark-inspired operations for familiar data manipulation - **Semantic Operations**: Built-in AI/LLM operations including semantic functions, embeddings, and clustering - **Model Integration**: Native support for AI providers (Anthropic, OpenAI, Cohere, Google) - **Query Optimization**: Automatic optimization through logical plan transformations ## Read from Hugging Face Hub fenic can read datasets directly from the Hugging Face Hub using the `hf://` protocol. This functionality is built into fenic's DataFrameReader interface. ### Supported Formats fenic supports reading the following formats from Hugging Face: - **Parquet files** (`.parquet`) - **CSV files** (`.csv`) ### Reading Datasets To read a dataset from the Hugging Face Hub: ```python import fenic as fc session = fc.Session.get_or_create(fc.SessionConfig()) # Read a CSV file from a public dataset df = session.read.csv("hf://datasets/datasets-examples/doc-formats-csv-1/data.csv") # Read Parquet files using glob patterns df = session.read.parquet("hf://datasets/cais/mmlu/astronomy/*.parquet") # Read from a specific dataset revision df = session.read.parquet("hf://datasets/datasets-examples/doc-formats-csv-1@~parquet/**/*.parquet") ``` ### Reading with Schema Management ```python # Read multiple CSV files with schema merging df = session.read.csv("hf://datasets/username/dataset_name/*.csv", merge_schemas=True) # Read multiple Parquet files with schema merging df = session.read.parquet("hf://datasets/username/dataset_name/*.parquet", merge_schemas=True) ``` > **Note:** In fenic, a schema is the set of column names and their data types. When you enable `merge_schemas`, fenic tries to reconcile differences across files by filling missing columns with nulls and widening types where it can. Some layouts still cannot be merged—consult the fenic docs for [CSV schema merging limitations](https://docs.fenic.ai/latest/reference/fenic/?h=parquet#fenic.DataFrameReader.csv) and [Parquet schema merging limitations](https://docs.fenic.ai/latest/reference/fenic/?h=parquet#fenic.DataFrameReader.parquet). ### Authentication To read private datasets, you need to set your Hugging Face token as an environment variable: ```shell export HF_TOKEN="your_hugging_face_token_here" ``` ### Path Format The Hugging Face path format in fenic follows this structure: ``` hf://{repo_type}/{repo_id}/{path_to_file} ``` You can also specify dataset revisions or versions: ``` hf://{repo_type}/{repo_id}@{revision}/{path_to_file} ``` Features: - Supports glob patterns (`*`, `**`) - Dataset revisions/versions using `@` notation: - Specific commit: `@d50d8923b5934dc8e74b66e6e4b0e2cd85e9142e` - Branch: `@refs/convert/parquet` - Branch alias: `@~parquet` - Requires `HF_TOKEN` environment variable for private datasets ### Mixing Data Sources fenic allows you to combine multiple data sources in a single read operation, including mixing different protocols: ```python # Mix HF and local files in one read call df = session.read.parquet([ "hf://datasets/cais/mmlu/astronomy/*.parquet", "file:///local/data/*.parquet", "./relative/path/data.parquet" ]) ``` This flexibility allows you to seamlessly combine data from Hugging Face Hub and local files in your data processing pipeline. ## Processing Data from Hugging Face Once loaded from Hugging Face, you can use fenic's full DataFrame API: ### Basic DataFrame Operations ```python import fenic as fc session = fc.Session.get_or_create(fc.SessionConfig()) # Load IMDB dataset from Hugging Face df = session.read.parquet("hf://datasets/imdb/plain_text/train-*.parquet") # Filter and select positive_reviews = df.filter(fc.col("label") == 1).select("text", "label") # Group by and aggregate label_counts = df.group_by("label").agg( fc.count("*").alias("count") ) ``` ### AI-Powered Operations To use semantic and embedding operations, configure language and embedding models in your SessionConfig. Once configured: ```python import fenic as fc # Requires OPENAI_API_KEY to be set for language and embedding calls session = fc.Session.get_or_create( fc.SessionConfig( semantic=fc.SemanticConfig( language_models={ "gpt-4o-mini": fc.OpenAILanguageModel( model_name="gpt-4o-mini", rpm=60, tpm=60000, ) }, embedding_models={ "text-embedding-3-small": fc.OpenAIEmbeddingModel( model_name="text-embedding-3-small", rpm=60, tpm=60000, ) }, ) ) ) # Load a text dataset from Hugging Face df = session.read.parquet("hf://datasets/imdb/plain_text/train-00000-of-00001.parquet") # Add embeddings to text columns df_with_embeddings = df.select( "*", fc.semantic.embed(fc.col("text")).alias("embedding") ) # Apply semantic functions for sentiment analysis df_analyzed = df_with_embeddings.select( "*", fc.semantic.analyze_sentiment( fc.col("text"), model_alias="gpt-4o-mini", # Optional: specify model ).alias("sentiment") ) ``` ## Example: Analyzing MMLU Dataset ```python import fenic as fc # Requires OPENAI_API_KEY to be set for semantic calls session = fc.Session.get_or_create( fc.SessionConfig( semantic=fc.SemanticConfig( language_models={ "gpt-4o-mini": fc.OpenAILanguageModel( model_name="gpt-4o-mini", rpm=60, tpm=60000, ) }, ) ) ) # Load MMLU astronomy subset from Hugging Face df = session.read.parquet("hf://datasets/cais/mmlu/astronomy/*.parquet") # Process the data processed_df = (df # Filter for specific criteria .filter(fc.col("subject") == "astronomy") # Select relevant columns .select("question", "choices", "answer") # Add difficulty analysis using semantic.map .select( "*", fc.semantic.map( "Rate the difficulty of this question from 1-5: {{question}}", question=fc.col("question"), model_alias="gpt-4o-mini" # Optional: specify model ).alias("difficulty") ) ) # Show results processed_df.show() ``` ## Resources - [fenic GitHub Repository](https://github.com/typedef-ai/fenic) - [fenic Documentation](https://docs.fenic.ai/latest/) ### Two-Factor Authentication (2FA) https://huggingface.co/docs/hub/security-2fa.md # Two-Factor Authentication (2FA) Using two-factor authentication verifies a user's identity with two methods, adding extra security to ensure only authorized individuals can access an account, even if the password is compromised. If you choose to enable two-factor authentication, at every login you will need to provide: - Username or email & password (normal login credentials) - One-time security code via app ## Enable Two-factor Authentication (2FA) To enable Two-factor Authentication with a one-time password: On the Hugging Face Hub: 1. Go to your [Authentication settings](https://hf.co/settings/authentication) 2. Select Add Two-Factor Authentication On your device (usually your phone): 1. Install a compatible application. For example: - Authy - Google Authenticator - Microsoft Authenticator - FreeOTP 2. In the application, add a new entry in one of two ways: - Scan the code displayed on screen Hub with your device’s camera to add the entry automatically - Enter the details provided to add the entry manually In Hugging Face Hub: 1. Enter the six-digit pin number from your authentication device into "Code" 2. Save If you entered the correct pin, the Hub displays a list of recovery codes. Download them and keep them in a safe place. > [!TIP] > You will be prompted for 2FA every time you log in, and every 30 days ## Recovery codes Right after you've successfully activated 2FA with a one-time password, you're requested to download a collection of generated recovery codes. If you ever lose access to your one-time password authenticator, you can use one of these recovery codes to log in to your account. - Each code can be used only **once** to sign in to your account - You should copy and print the codes, or download them for storage in a safe place. If you choose to download them, the file is called **huggingface-recovery-codes.txt** If you lose the recovery codes, or want to generate new ones, you can use the [Authentication settings](https://hf.co/settings/authentication) page. ## Regenerate two-factor authentication recovery codes To regenerate 2FA recovery codes: 1. Access your [Authentication settings](https://hf.co/settings/authentication) 2. If you’ve already configured 2FA, select Recovery Code 3. Click on Regenerate recovery codes > [!WARNING] > If you regenerate 2FA recovery codes, save them. You can’t use any previously created recovery codes. ## Sign in with two-factor authentication enabled When you sign in with 2FA enabled, the process is only slightly different than the standard sign-in procedure. After entering your username and password, you'll encounter an additional prompt, depending on the type of 2FA you've set up. When prompted, provide the pin from your one-time password authenticator's app or a recovery code to complete the sign-in process. ## Disable two-factor authentication To disable 2FA: 1. Access your [Authentication settings](https://hf.co/settings/authentication) 2. Click on "Remove". This clears all your 2FA registrations. ## Recovery options If you no longer have access to your authentication device, you can still recover access to your account: - Use a saved recovery code, if you saved them when you enabled two-factor authentication - Requesting help with two-factor authentication ### Use a recovery code To use a recovery code: 1. Enter your username or email, and password, on the [Hub sign-in page](https://hf.co/login) 2. When prompted for a two-factor code, click on "Lost access to your two-factor authentication app? Use a recovery code" 3. Enter one of your recovery codes After you use a recovery code, you cannot re-use it. You can still use the other recovery codes you saved. ### Requesting help with two-factor authentication In case you've forgotten your password and lost access to your two-factor authentication credentials, you can reach out to support (website@huggingface.co) to regain access to your account. You'll be required to verify your identity using a recovery authentication factor, such as an SSH key or personal access token. ### Langfuse on Spaces https://huggingface.co/docs/hub/spaces-sdks-docker-langfuse.md # Langfuse on Spaces This guide shows you how to deploy Langfuse on Hugging Face Spaces and start instrumenting your LLM application for observability. This integration helps you to experiment with LLM APIs on the Hugging Face Hub, manage your prompts in one place, and evaluate model outputs. ## What is Langfuse? [Langfuse](https://langfuse.com) is an open-source LLM engineering platform that helps teams collaboratively debug, evaluate, and iterate on their LLM applications. Key features of Langfuse include LLM tracing to capture the full context of your application's execution flow, prompt management for centralized and collaborative prompt iteration, evaluation metrics to assess output quality, dataset creation for testing and benchmarking, and a playground to experiment with prompts and model configurations. _This video is a 10 min walkthrough of the Langfuse features:_ ## Why LLM Observability? - As language models become more prevalent, understanding their behavior and performance is important. - **LLM observability** involves monitoring and understanding the internal states of an LLM application through its outputs. - It is essential for addressing challenges such as: - **Complex control flows** with repeated or chained calls, making debugging challenging. - **Non-deterministic outputs**, adding complexity to consistent quality assessment. - **Varied user intents**, requiring deep understanding to improve user experience. - Building LLM applications involves intricate workflows, and observability helps in managing these complexities. ## Step 1: Set up Langfuse on Spaces The Langfuse Hugging Face Space allows you to get up and running with a deployed version of Langfuse with just a few clicks. To get started, click the button above or follow these steps: 1. Create a [**new Hugging Face Space**](https://huggingface.co/new-space) 2. Select **Docker** as the Space SDK 3. Select **Langfuse** as the Space template 4. Enable **persistent storage** to ensure your Langfuse data is persisted across restarts 5. Ensure the space is set to **public** visibility so Langfuse API/SDK's can access the app (see note below for more details) 6. [Optional but recommended] For a secure deployment, replace the default values of the **environment variables**: - `NEXTAUTH_SECRET`: Used to validate login session cookies, generate secret with at least 256 entropy using `openssl rand -base64 32`. - `SALT`: Used to salt hashed API keys, generate secret with at least 256 entropy using `openssl rand -base64 32`. - `ENCRYPTION_KEY`: Used to encrypt sensitive data. Must be 256 bits, 64 string characters in hex format, generate via: `openssl rand -hex 32`. 7. Click **Create Space**! ![Clone the Langfuse Space](https://langfuse.com/images/cookbook/huggingface/huggingface-space-setup.png) ### User Access Your Langfuse Space is pre-configured with Hugging Face OAuth for secure authentication, so you'll need to authorize `read` access to your Hugging Face account upon first login by following the instructions in the pop-up. Once inside the app, you can use [the native Langfuse features](https://langfuse.com/docs/rbac) to manage Organizations, Projects, and Users. The Langfuse space _must_ be set to **public** visibility so that Langfuse API/SDK's can reach the app. This means that by default, _any_ logged-in Hugging Face user will be able to access the Langfuse space. You can prevent new users from signing up and accessing the space via two different methods: #### 1. (Recommended) Hugging Face native org-level OAuth restrictions If you want to restrict access to only members of a specified organization(s), you can simply set the `hf_oauth_authorized_org` metadata field in the space's `README.md` file, as shown [here](https://huggingface.co/docs/hub/spaces-oauth#create-an-oauth-app). Once configured, only users who are members of the specified organization(s) will be able to access the space. #### 2. Manual access control You can also restrict access on a per-user basis by setting the `AUTH_DISABLE_SIGNUP` environment variable to `true`. Be sure that you've first signed in & authenticated to the space before setting this variable else your own user profile won't be able to authenticate. > [!TIP] > **Note:** If you've set the `AUTH_DISABLE_SIGNUP` environment variable to `true` to restrict access, and want to grant a new user access to the space, you'll need to first set it back to `false` (wait for rebuild to complete), add the user and have them authenticate with OAuth, and then set it back to `true`. ## Step 2: Use Langfuse Now that you have Langfuse running, you can start instrumenting your LLM application to capture traces and manage your prompts. Let's see how! ### Monitor Any Application Langfuse is model agnostic and can be used to trace any application. Follow the [get-started guide](https://langfuse.com/docs) in Langfuse documentation to see how you can instrument your code. Langfuse maintains native integrations with many popular LLM frameworks, including [Langchain](https://langfuse.com/docs/integrations/langchain/tracing), [LlamaIndex](https://langfuse.com/docs/integrations/llama-index/get-started) and [OpenAI](https://langfuse.com/docs/integrations/openai/python/get-started) and offers Python and JS/TS SDKs to instrument your code. Langfuse also offers various API endpoints to ingest data and has been integrated by other open source projects such as [Langflow](https://langfuse.com/docs/integrations/langflow), [Dify](https://langfuse.com/docs/integrations/dify) and [Haystack](https://langfuse.com/docs/integrations/haystack/get-started). ### Example 1: Trace Calls to Inference Providers As a simple example, here's how to trace LLM calls to [Inference Providers](https://huggingface.co/docs/inference-providers/en/index) using the Langfuse Python SDK. Be sure to first configure your `LANGFUSE_HOST`, `LANGFUSE_PUBLIC_KEY` and `LANGFUSE_SECRET_KEY` environment variables, and make sure you've [authenticated with your Hugging Face account](https://huggingface.co/docs/huggingface_hub/en/quick-start#authentication). ```python from langfuse.openai import openai from huggingface_hub import get_token client = openai.OpenAI( base_url="https://router.huggingface.co/hf-inference/models/meta-llama/Llama-3.3-70B-Instruct/v1", api_key=get_token(), ) messages = [{"role": "user", "content": "What is observability for LLMs?"}] response = client.chat.completions.create( model="meta-llama/Llama-3.3-70B-Instruct", messages=messages, max_tokens=100, ) ``` ### Example 2: Monitor a Gradio Application We created a Gradio template space that shows how to create a simple chat application using a Hugging Face model and trace model calls and user feedback in Langfuse - without leaving Hugging Face. To get started, [duplicate this Gradio template space](https://huggingface.co/spaces/langfuse/langfuse-gradio-example-template?duplicate=true) and follow the instructions in the [README](https://huggingface.co/spaces/langfuse/langfuse-gradio-example-template/blob/main/README.md). ## Step 3: View Traces in Langfuse Once you have instrumented your application, and ingested traces or user feedback into Langfuse, you can view your traces in Langfuse. ![Example trace with Gradio](https://langfuse.com/images/cookbook/huggingface/huggingface-gradio-example-trace.png) _[Example trace in the Langfuse UI](https://langfuse-langfuse-template-space.hf.space/project/cm4r1ajtn000a4co550swodxv/traces/9cdc12fb-71bf-4074-ab0b-0b8d212d839f?timestamp=2024-12-20T12%3A12%3A50.089Z&view=preview)_ ## Additional Resources and Support - [Langfuse documentation](https://langfuse.com/docs) - [Langfuse GitHub repository](https://github.com/langfuse/langfuse) - [Langfuse Discord](https://langfuse.com/discord) - [Langfuse template Space](https://huggingface.co/spaces/langfuse/langfuse-template-space) For more help, open a support thread on [GitHub discussions](https://langfuse.com/discussions) or [open an issue](https://github.com/langfuse/langfuse/issues). ### Gated datasets https://huggingface.co/docs/hub/datasets-gated.md # Gated datasets To give more control over how datasets are used, the Hub allows datasets authors to enable **access requests** for their datasets. Users must agree to share their contact information (username and email address) with the datasets authors to access the datasets files when enabled. Datasets authors can configure this request with additional fields. A dataset with access requests enabled is called a **gated dataset**. Access requests are always granted to individual users rather than to entire organizations. A common use case of gated datasets is to provide access to early research datasets before the wider release. ## Manage gated datasets as a dataset author To enable access requests, go to the dataset settings page. By default, the dataset is not gated. Click on **Enable Access request** in the top-right corner. By default, access to the dataset is automatically granted to the user when requesting it. This is referred to as **automatic approval**. In this mode, any user can access your dataset once they've shared their personal information with you. If you want to manually approve which users can access your dataset, you must set it to **manual approval**. When this is the case, you will notice more options: - **Add access** allows you to search for a user and grant them access even if they did not request it. - **Notification frequency** lets you configure when to get notified if new users request access. It can be set to once a day or real-time. By default, an email is sent to your primary email address. For datasets hosted under an organization, emails are by default sent to the first 5 admins of the organization. In both cases (user or organization) you can set a different email address in the **Notifications email** field. ### Review access requests Once access requests are enabled, you have full control of who can access your dataset or not, whether the approval mode is manual or automatic. You can review and manage requests either from the UI or via the API. ### From the UI You can review who has access to your gated dataset from its settings page by clicking on the **Review access requests** button. This will open a modal with 3 lists of users: - **pending**: the list of users waiting for approval to access your dataset. This list is empty unless you've selected **manual approval**. You can either **Accept** or **Reject** the demand. If the demand is rejected, the user cannot access your dataset and cannot request access again. - **accepted**: the complete list of users with access to your dataset. You can choose to **Reject** access at any time for any user, whether the approval mode is manual or automatic. You can also **Cancel** the approval, which will move the user to the *pending* list. - **rejected**: the list of users you've manually rejected. Those users cannot access your datasets. If they go to your dataset repository, they will see a message *Your request to access this repo has been rejected by the repo's authors*. #### Via the API You can automate the approval of access requests by using the API. You must pass a `token` with `write` access to the gated repository. To generate a token, go to [your user settings](https://huggingface.co/settings/tokens). | Method | URI | Description | Headers | Payload | ------ | --- | ----------- | ------- | ------- | | `GET` | `/api/datasets/{repo_id}/user-access-request/pending` | Retrieve the list of pending requests. | `{"authorization": "Bearer $token"}` | | | `GET` | `/api/datasets/{repo_id}/user-access-request/accepted` | Retrieve the list of accepted requests. | `{"authorization": "Bearer $token"}` | | | `GET` | `/api/datasets/{repo_id}/user-access-request/rejected` | Retrieve the list of rejected requests. | `{"authorization": "Bearer $token"}` | | | `POST` | `/api/datasets/{repo_id}/user-access-request/handle` | Change the status of a given access request to `status`. | `{"authorization": "Bearer $token"}` | `{"status": "accepted"/"rejected"/"pending", "user": "username", "rejectionReason": "Optional rejection reason that will be visible to the user (max 200 characters)."}}` | | `POST` | `/api/datasets/{repo_id}/user-access-request/grant` | Allow a specific user to access your repo. | `{"authorization": "Bearer $token"}` | `{"user": "username"} ` | The base URL for the HTTP endpoints above is `https://huggingface.co`. **NEW!** Those endpoints are now officially supported in our Python client `huggingface_hub`. List the access requests to your dataset with [`list_pending_access_requests`](/docs/huggingface_hub/main/en/package_reference/hf_api#huggingface_hub.HfApi.list_pending_access_requests), [`list_accepted_access_requests`](/docs/huggingface_hub/main/en/package_reference/hf_api#huggingface_hub.HfApi.list_accepted_access_requests) and [`list_rejected_access_requests`](/docs/huggingface_hub/main/en/package_reference/hf_api#huggingface_hub.HfApi.list_rejected_access_requests). You can also accept, cancel and reject access requests with [`accept_access_request`](/docs/huggingface_hub/main/en/package_reference/hf_api#huggingface_hub.HfApi.accept_access_request), [`cancel_access_request`](/docs/huggingface_hub/main/en/package_reference/hf_api#huggingface_hub.HfApi.cancel_access_request), [`reject_access_request`](/docs/huggingface_hub/main/en/package_reference/hf_api#huggingface_hub.HfApi.reject_access_request). Finally, you can grant access to a user with [`grant_access`](/docs/huggingface_hub/main/en/package_reference/hf_api#huggingface_hub.HfApi.grant_access). ### Download access report You can download a report of all access requests for a gated datasets with the **download user access report** button. Click on it to download a json file with a list of users. For each entry, you have: - **user**: the user id. Example: *julien-c*. - **fullname**: name of the user on the Hub. Example: *Julien Chaumond*. - **status**: status of the request. Either `"pending"`, `"accepted"` or `"rejected"`. - **email**: email of the user. - **time**: datetime when the user initially made the request. ### Customize requested information By default, users landing on your gated dataset will be asked to share their contact information (email and username) by clicking the **Agree and send request to access repo** button. If you want to request more user information to provide access, you can configure additional fields. This information will be accessible from the **Settings** tab. To do so, add an `extra_gated_fields` property to your [dataset card metadata](./datasets-cards#dataset-card-metadata) containing a list of key/value pairs. The *key* is the name of the field and *value* its type or an object with a `type` field. The list of field types is: - `text`: a single-line text field. - `checkbox`: a checkbox field. - `date_picker`: a date picker field. - `country`: a country dropdown. The list of countries is based on the [ISO 3166-1 alpha-2](https://en.wikipedia.org/wiki/ISO_3166-1_alpha-2) standard. - `select`: a dropdown with a list of options. The list of options is defined in the `options` field. Example: `options: ["option 1", "option 2", {label: "option3", value: "opt3"}]`. Finally, you can also personalize the message displayed to the user with the `extra_gated_prompt` extra field. Here is an example of customized request form where the user is asked to provide their company name and country and acknowledge that the dataset is for non-commercial use only. ```yaml --- extra_gated_prompt: "You agree to not use the dataset to conduct experiments that cause harm to human subjects." extra_gated_fields: Company: text Country: country Specific date: date_picker I want to use this dataset for: type: select options: - Research - Education - label: Other value: other I agree to use this dataset for non-commercial use ONLY: checkbox --- ``` In some cases, you might also want to modify the default text in the gate heading, description, and button. For those use cases, you can modify `extra_gated_heading`, `extra_gated_description` and `extra_gated_button_content` like this: ```yaml --- extra_gated_heading: "Acknowledge license to accept the repository" extra_gated_description: "Our team may take 2-3 days to process your request" extra_gated_button_content: "Acknowledge license" --- ``` ## Manage gated datasets as an organization (Enterprise Hub) [Enterprise Hub](https://huggingface.co/docs/hub/en/enterprise-hub) subscribers can create a Gating Group Collection to grant (or reject) access to all the models and datasets in a collection at once. More information about Gating Group Collections can be found in [our dedicated doc](https://huggingface.co/docs/hub/en/enterprise-hub-gating-group-collections). ## Access gated datasets as a user As a user, if you want to use a gated dataset, you will need to request access to it. This means that you must be logged in to a Hugging Face user account. Requesting access can only be done from your browser. Go to the dataset on the Hub and you will be prompted to share your information: By clicking on **Agree**, you agree to share your username and email address with the dataset authors. In some cases, additional fields might be requested. To help the dataset authors decide whether to grant you access, try to fill out the form as completely as possible. Once the access request is sent, there are two possibilities. If the approval mechanism is automatic, you immediately get access to the dataset files. Otherwise, the requests have to be approved manually by the authors, which can take more time. > [!WARNING] > The dataset authors have complete control over dataset access. In particular, they can decide at any time to block your access to the dataset without prior notice, regardless of approval mechanism or if your request has already been approved. ### Download files To download files from a gated dataset you'll need to be authenticated. In the browser, this is automatic as long as you are logged in with your account. If you are using a script, you will need to provide a [user token](./security-tokens). In the Hugging Face Python ecosystem (`transformers`, `diffusers`, `datasets`, etc.), you can login your machine using the [`huggingface_hub`](/docs/huggingface_hub/index) library and running in your terminal: ```bash hf auth login ``` Alternatively, you can programmatically login using `login()` in a notebook or a script: ```python >>> from huggingface_hub import login >>> login() ``` You can also provide the `token` parameter to most loading methods in the libraries (`from_pretrained`, `hf_hub_download`, `load_dataset`, etc.), directly from your scripts. For more details about how to login, check out the [login guide](/docs/huggingface_hub/quick-start#login). ### Restricting Access for EU Users For gated datasets, you can add an additional layer of access control to specifically restrict users from European Union countries. This is useful if your dataset's license or terms of use prohibit its distribution in the EU. To enable this, add the `extra_gated_eu_disallowed: true` property to your dataset card's metadata. **Important:** This feature will only activate if your dataset is already gated. If `gated: false` or the property is not set, this restriction will not apply. ```yaml --- license: mit gated: true extra_gated_eu_disallowed: true --- ``` The system identifies a user's location based on their IP address. ### How to handle URL parameters in Spaces https://huggingface.co/docs/hub/spaces-handle-url-parameters.md # How to handle URL parameters in Spaces You can use URL query parameters as a data sharing mechanism, for instance to be able to deep-link into an app with a specific state. On a Space page (`https://huggingface.co/spaces//`), the actual application page (`https://*.hf.space/`) is embedded in an iframe. The query string and the hash attached to the parent page URL are propagated to the embedded app on initial load, so the embedded app can read these values without special consideration. In contrast, updating the query string and the hash of the parent page URL from the embedded app is slightly more complex. If you want to do this in a Docker or static Space, you need to add the following JS code that sends a message to the parent page that has a `queryString` and/or `hash` key. ```js const queryString = "..."; const hash = "..."; window.parent.postMessage({ queryString, hash, }, "https://huggingface.co"); ``` **This is only for Docker or static Spaces.** For Streamlit apps, Spaces automatically syncs the URL parameters. Gradio apps can read the query parameters from the Spaces page, but do not sync updated URL parameters with the parent page. Note that the URL parameters of the parent page are propagated to the embedded app *only* on the initial load. So `location.hash` in the embedded app will not change even if the parent URL hash is updated using this method. An example of this method can be found in this static Space, [`whitphx/static-url-param-sync-example`](https://huggingface.co/spaces/whitphx/static-url-param-sync-example). ### Video Dataset https://huggingface.co/docs/hub/datasets-video.md # Video Dataset This guide will show you how to configure your dataset repository with video files. A dataset with a supported structure and [file formats](./datasets-adding#file-formats) automatically has a Dataset Viewer on its page on the Hub. Additional information about your videos - such as captions or bounding boxes for object detection - is automatically loaded as long as you include this information in a metadata file (`metadata.csv`/`metadata.jsonl`/`metadata.parquet`). Alternatively, videos can be in Parquet files or in TAR archives following the [WebDataset](https://github.com/webdataset/webdataset) format. ## Only videos If your dataset only consists of one column with videos, you can simply store your video files at the root: ``` my_dataset_repository/ ├── 1.mp4 ├── 2.mp4 ├── 3.mp4 └── 4.mp4 ``` or in a subdirectory: ``` my_dataset_repository/ └── videos ├── 1.mp4 ├── 2.mp4 ├── 3.mp4 └── 4.mp4 ``` Multiple [formats](./datasets-adding#file-formats) are supported at the same time, including MP4, MOV and AVI. ``` my_dataset_repository/ └── videos ├── 1.mp4 ├── 2.mov └── 3.avi ``` If you have several splits, you can put your videos into directories named accordingly: ``` my_dataset_repository/ ├── train │   ├── 1.mp4 │   └── 2.mp4 └── test ├── 3.mp4 └── 4.mp4 ``` See [File names and splits](./datasets-file-names-and-splits) for more information and other ways to organize data by splits. ## Additional columns If there is additional information you'd like to include about your dataset, like text captions or bounding boxes, add it as a `metadata.csv` file in your repository. This lets you quickly create datasets for different computer vision tasks like [video generation](https://huggingface.co/tasks/text-to-video) or [object detection](https://huggingface.co/tasks/object-detection). ``` my_dataset_repository/ └── train ├── 1.mp4 ├── 2.mp4 ├── 3.mp4 ├── 4.mp4 └── metadata.csv ``` Your `metadata.csv` file must have a `file_name` column which links video files with their metadata: ```csv file_name,text 1.mp4,an animation of a green pokemon with red eyes 2.mp4,a short video of a green and yellow toy with a red nose 3.mp4,a red and white ball shows an angry look on its face 4.mp4,a cartoon ball is smiling ``` You can also use a [JSONL](https://jsonlines.org/) file `metadata.jsonl`: ```jsonl {"file_name": "1.mp4","text": "an animation of a green pokemon with red eyes"} {"file_name": "2.mp4","text": "a short video of a green and yellow toy with a red nose"} {"file_name": "3.mp4","text": "a red and white ball shows an angry look on its face"} {"file_name": "4.mp4","text": "a cartoon ball is smiling"} ``` And for bigger datasets or if you are interested in advanced data retrieval features, you can use a [Parquet](https://parquet.apache.org/) file `metadata.parquet`. ## Relative paths Metadata file must be located either in the same directory with the videos it is linked to, or in any parent directory, like in this example: ``` my_dataset_repository/ └── train ├── videos │   ├── 1.mp4 │   ├── 2.mp4 │   ├── 3.mp4 │   └── 4.mp4 └── metadata.csv ``` In this case, the `file_name` column must be a full relative path to the videos, not just the filename: ```csv file_name,text videos/1.mp4,an animation of a green pokemon with red eyes videos/2.mp4,a short video of a green and yellow toy with a red nose videos/3.mp4,a red and white ball shows an angry look on its face videos/4.mp4,a cartoon ball is smiling ``` Metadata files cannot be put in subdirectories of a directory with the videos. More generally, any column named `file_name` or `*_file_name` should contain the full relative path to the videos. ## Video classification For video classification datasets, you can also use a simple setup: use directories to name the video classes. Store your video files in a directory structure like: ``` my_dataset_repository/ ├── green │   ├── 1.mp4 │   └── 2.mp4 └── red ├── 3.mp4 └── 4.mp4 ``` The dataset created with this structure contains two columns: `video` and `label` (with values `green` and `red`). You can also provide multiple splits. To do so, your dataset directory should have the following structure (see [File names and splits](./datasets-file-names-and-splits) for more information): ``` my_dataset_repository/ ├── test │   ├── green │   │   └── 2.mp4 │   └── red │   └── 4.mp4 └── train ├── green │   └── 1.mp4 └── red └── 3.mp4 ``` You can disable this automatic addition of the `label` column in the [YAML configuration](./datasets-manual-configuration). If your directory names have no special meaning, set `drop_labels: true` in the README header: ```yaml configs: - config_name: default # Name of the dataset subset, if applicable. drop_labels: true ``` ## Large scale datasets ### WebDataset format The [WebDataset](./datasets-webdataset) format is well suited for large scale video datasets. It consists of TAR archives containing videos and their metadata and is optimized for streaming. It is useful if you have a large number of videos and to get streaming data loaders for large scale training. ``` my_dataset_repository/ ├── train-0000.tar ├── train-0001.tar ├── ... └── train-1023.tar ``` To make a WebDataset TAR archive, create a directory containing the videos and metadata files to be archived and create the TAR archive using e.g. the `tar` command. The usual size per archive is generally around 1GB. Make sure each video and metadata pair share the same file prefix, for example: ``` train-0000/ ├── 000.mp4 ├── 000.json ├── 001.mp4 ├── 001.json ├── ... ├── 999.mp4 └── 999.json ``` Note that for user convenience and to enable the [Dataset Viewer](./data-studio), every dataset hosted in the Hub is automatically converted to Parquet format up to 5GB. Since videos can be quite large, the URLs to the videos are stored in the converted Parquet data without the video bytes themselves. Read more about it in the [Parquet format](./data-studio#access-the-parquet-files) documentation. ### Libraries https://huggingface.co/docs/hub/models-libraries.md # Libraries The Hub has support for dozens of libraries in the Open Source ecosystem. Thanks to the `huggingface_hub` Python library, it's easy to enable sharing your models on the Hub. The Hub supports many libraries, and we're working on expanding this support. We're happy to welcome to the Hub a set of Open Source libraries that are pushing Machine Learning forward. The table below summarizes the supported libraries and their level of integration. Find all our supported libraries in [the model-libraries.ts file](https://github.com/huggingface/huggingface.js/blob/main/packages/tasks/src/model-libraries.ts). | Library | Description | Inference API | Widgets | Download from Hub | Push to Hub | |-----------------------------------------------------------------------------|--------------------------------------------------------------------------------------|---|---:|---|---| | [Adapters](./adapters) | A unified Transformers add-on for parameter-efficient and modular fine-tuning. | ✅ | ✅ | ✅ | ✅ | | [AllenNLP](./allennlp) | An open-source NLP research library, built on PyTorch. | ✅ | ✅ | ✅ | ❌ | | [Asteroid](./asteroid) | PyTorch-based audio source separation toolkit | ✅ | ✅ | ✅ | ❌ | | [BERTopic](./bertopic) | BERTopic is a topic modeling library for text and images | ✅ | ✅ | ✅ | ✅ | | [Diffusers](./diffusers) | A modular toolbox for inference and training of diffusion models | ✅ | ✅ | ✅ | ✅ | | [docTR](https://github.com/mindee/doctr) | Models and datasets for OCR-related tasks in PyTorch & TensorFlow | ✅ | ✅ | ✅ | ❌ | | [ESPnet](./espnet) | End-to-end speech processing toolkit (e.g. TTS) | ✅ | ✅ | ✅ | ❌ | | [fastai](./fastai) | Library to train fast and accurate models with state-of-the-art outputs. | ✅ | ✅ | ✅ | ✅ | | [Keras](./keras) | Open-source multi-backend deep learning framework, with support for JAX, TensorFlow, and PyTorch. | ❌ | ❌ | ✅ | ✅ | | [KerasNLP](https://keras.io/guides/keras_nlp/upload/) | Natural language processing library built on top of Keras that works natively with TensorFlow, JAX, or PyTorch. | ❌ | ❌ | ✅ | ✅ | | [TF-Keras](./tf-keras) (legacy) | Legacy library that uses a consistent and simple API to build models leveraging TensorFlow and its ecosystem. | ❌ | ❌ | ✅ | ✅ | | [Flair](./flair) | Very simple framework for state-of-the-art NLP. | ✅ | ✅ | ✅ | ✅ | | [MBRL-Lib](https://github.com/facebookresearch/mbrl-lib) | PyTorch implementations of MBRL Algorithms. | ❌ | ❌ | ✅ | ✅ | | [MidiTok](https://github.com/Natooz/MidiTok) | Tokenizers for symbolic music / MIDI files. | ❌ | ❌ | ✅ | ✅ | | [ML-Agents](./ml-agents) | Enables games and simulations made with Unity to serve as environments for training intelligent agents. | ❌ | ❌ | ✅ | ✅ | | [MLX](./mlx) | Model training and serving framework on Apple silicon made by Apple. | ❌ | ❌ | ✅ | ✅ | | [NeMo](https://github.com/NVIDIA/NeMo) | Conversational AI toolkit built for researchers | ✅ | ✅ | ✅ | ❌ | | [OpenCLIP](./open_clip) | Library for open-source implementation of OpenAI's CLIP | ❌ | ❌ | ✅ | ✅ | | [PaddleNLP](./paddlenlp) | Easy-to-use and powerful NLP library built on PaddlePaddle | ✅ | ✅ | ✅ | ✅ | | [PEFT](./peft) | Cutting-edge Parameter Efficient Fine-tuning Library | ✅ | ✅ | ✅ | ✅ | | [Pyannote](https://github.com/pyannote/pyannote-audio) | Neural building blocks for speaker diarization. | ❌ | ❌ | ✅ | ❌ | | [PyCTCDecode](https://github.com/kensho-technologies/pyctcdecode) | Language model supported CTC decoding for speech recognition | ❌ | ❌ | ✅ | ❌ | | [Pythae](https://github.com/clementchadebec/benchmark_VAE) | Unified framework for Generative Autoencoders in Python | ❌ | ❌ | ✅ | ✅ | | [RL-Baselines3-Zoo](./rl-baselines3-zoo) | Training framework for Reinforcement Learning, using [Stable Baselines3](https://github.com/DLR-RM/stable-baselines3).| ❌ | ✅ | ✅ | ✅ | | [Sample Factory](./sample-factory) | Codebase for high throughput asynchronous reinforcement learning. | ❌ | ✅ | ✅ | ✅ | | [Sentence Transformers](./sentence-transformers) | Compute dense vector representations for sentences, paragraphs, and images. | ✅ | ✅ | ✅ | ✅ | | [SetFit](./setfit) | Efficient few-shot text classification with Sentence Transformers | ✅ | ✅ | ✅ | ✅ | | [spaCy](./spacy) | Advanced Natural Language Processing in Python and Cython. | ✅ | ✅ | ✅ | ✅ | | [SpanMarker](./span_marker) | Familiar, simple and state-of-the-art Named Entity Recognition. | ✅ | ✅ | ✅ | ✅ | | [Scikit Learn (using skops)](https://skops.readthedocs.io/en/stable/) | Machine Learning in Python. | ✅ | ✅ | ✅ | ✅ | | [Speechbrain](./speechbrain) | A PyTorch Powered Speech Toolkit. | ✅ | ✅ | ✅ | ❌ | | [Stable-Baselines3](./stable-baselines3) | Set of reliable implementations of deep reinforcement learning algorithms in PyTorch | ❌ | ✅ | ✅ | ✅ | | [TensorFlowTTS](https://github.com/TensorSpeech/TensorFlowTTS) | Real-time state-of-the-art speech synthesis architectures. | ❌ | ❌ | ✅ | ❌ | | [Timm](./timm) | Collection of image models, scripts, pretrained weights, etc. | ✅ | ✅ | ✅ | ✅ | | [Transformers](./transformers) | State-of-the-art Natural Language Processing for PyTorch, TensorFlow, and JAX | ✅ | ✅ | ✅ | ✅ | | [Transformers.js](./transformers-js) | State-of-the-art Machine Learning for the web. Run 🤗 Transformers directly in your browser, with no need for a server! | ❌ | ❌ | ✅ | ❌ | | [Unity Sentis](./unity-sentis) | Inference engine for the Unity 3D game engine | ❌ | ❌ | ❌ | ❌ | ### How can I add a new library to the Inference API? If you're interested in adding your library, please reach out to us! Read about it in [Adding a Library Guide](./models-adding-libraries). ### How to configure SAML SSO with Google Workspace https://huggingface.co/docs/hub/security-sso-google-saml.md # How to configure SAML SSO with Google Workspace In this guide, we will use Google Workspace as the SSO provider and with the Security Assertion Markup Language (SAML) protocol as our preferred identity protocol. We currently support SP-initiated and IdP-initiated authentication. User provisioning is part of Enterprise Plus's [Advanced SSO](./enterprise-hub-advanced-sso). > [!WARNING] > This feature is part of the Team & Enterprise plans. ### Step 1: Create SAML App in Google Workspace - In your Google Workspace admin console, navigate to `Admin` > `Apps` > `Web and mobile apps`. - Click `Add app` and then `Add custom SAML app`. - You must provide a name for your application in the "App name" field. - Click `Continue`. ### Step 2: Configure Hugging Face with Google's IdP Details - The next screen in the Google setup contains the SSO information for your application. - In your Hugging Face organization settings, go to the `SSO` tab and select the `SAML` protocol. - Copy the **SSO URL** from Google into the **Sign-on URL** field on Hugging Face. - Copy the **Certificate** from Google into the corresponding field on Hugging Face. The public certificate must have the following format: ``` -----BEGIN CERTIFICATE----- {certificate} -----END CERTIFICATE----- ``` - In the Google Workspace setup, click `Continue`. ### Step 3: Configure Google with Hugging Face's SP Details - In the "Service provider details" screen, you'll need the `Assertion Consumer Service URL` and `SP Entity ID` from your Hugging Face SSO settings. Copy them into the corresponding `ACS URL` and `Entity ID` fields in Google. - Ensure the following are set: - Check the **Signed response** box. - Name ID format: `EMAIL` - Name ID: `Basic Information > Primary email` - Click `Continue`. ### Step 4: Attribute Mapping - On the "Attribute mapping" screen, click `Add mapping` and configure the attributes you want to send. This step is optional and depends on whether you want to use [Role Mapping](./security-sso#role-mapping) or [Resource Group Mapping](./security-sso#resource-group-mapping) on Hugging Face. - Click `Finish`. ### Step 5: Test and Enable SSO > [!WARNING] > Before testing, ensure you have granted access to the application for the appropriate users in the Google Workspace admin console under the app's "User access" settings. The admin performing the test must have access. It may take a few minutes for user access changes to apply on Google Workspace. - Now, in your Hugging Face SSO settings, click on **"Update and Test SAML configuration"**. - You should be redirected to your Google login prompt. Once logged in, you'll be redirected to your organization's settings page. - A green check mark near the SAML selector will confirm that the test was successful. - Once the test is successful, you can enable SSO for your organization by clicking the "Enable" button. - Once enabled, members of your organization must complete the SSO authentication flow described in the [How does it work?](./security-sso#how-does-it-work) section. ### Agents on the Hub https://huggingface.co/docs/hub/agents.md # Agents on the Hub This page compiles all the libraries and tools Hugging Face offers for agentic workflows: - [HF MCP Server](./hf-mcp-server): Connect your MCP-compatible AI assistant directly to the Hugging Face Hub. - `tiny-agents`: A lightweight toolkit for MCP-powered agents, available in both JS (`@huggingface/tiny-agents`) and Python (`huggingface_hub`). - `Gradio MCP Server`: Easily create MCP servers from Gradio apps and Spaces. - `smolagents`: a Python library that enables you to run powerful agents in a few lines of code. ## HF MCP Server The official **Hugging Face MCP (Model Context Protocol) Server** enables seamless integration between the Hugging Face Hub and any MCP-compatible AI assistant—including VSCode, Cursor, and Claude Desktop. With the HF MCP Server, you can enhance your AI assistant's capabilities by connecting directly to the Hub's ecosystem. It comes with: - a curated set of **built-in tools** like Spaces and Papers Semantic Search, Model and Dataset exploration, etc - **MCP-compatible Gradio apps**: Connect to any [MCP-compatible Gradio app](https://huggingface.co/spaces?filter=mcp-server) built by the Hugging Face community #### Getting Started Visit [huggingface.co/settings/mcp](https://huggingface.co/settings/mcp) to configure your MCP client and get started. Read the dedicated one‑page guide: [HF MCP Server](./hf-mcp-server). > [!WARNING] > This feature is experimental ⚗️ and will continue to evolve. ## tiny-agents (JS and Python) NEW: tiny-agents now supports [AGENTS.md](https://agents.md/) standard. 🥳 `tiny-agents` is a lightweight toolkit for running and building MCP-powered agents on top of the Hugging Face Inference Client + Model Context Protocol (MCP). It is available as a JS package `@huggingface/tiny-agents` and in the `huggingface_hub` Python package. ### @huggingface/tiny-agents (JS) The `@huggingface/tiny-agents` package offers a simple and straightforward CLI and a simple programmatic API for running and building MCP-powered agents in JS. **Getting Started** First, you need to install the package: ```bash npm install @huggingface/tiny-agents # or pnpm add @huggingface/tiny-agents ``` Then, you can your agent: ```bash npx @huggingface/tiny-agents [command] "agent/id" Usage: tiny-agents [flags] tiny-agents run "agent/id" tiny-agents serve "agent/id" Available Commands: run Run the Agent in command-line serve Run the Agent as an OpenAI-compatible HTTP server ``` You can load agents directly from the [tiny-agents](https://huggingface.co/datasets/tiny-agents/tiny-agents) Dataset, or specify a path to your own local agent configuration. **Advanced Usage** In addition to the CLI, you can use the `Agent` class for more fine-grained control. For lower-level interactions, use the `MCPClient` from the `@huggingface/mcp-client` package to connect directly to MCP servers and manage tool calls. Learn more about tiny-agents in the [huggingface.js documentation](https://huggingface.co/docs/huggingface.js/en/tiny-agents/README). ### huggingface_hub (Python) The `huggingface_hub` library is the easiest way to run MCP-powered agents in Python. It includes a high-level `tiny-agents` CLI as well as programmatic access via the `Agent` and `MCPClient` classes — all built to work with [Hugging Face Inference Providers](https://huggingface.co/docs/inference-providers/index), local LLMs, or any inference endpoint compatible with OpenAI's API specs. **Getting started** Install the latest version with MCP support: ```bash pip install "huggingface_hub[mcp]>=0.32.2" ``` Then, you can run your agent: ```bash > tiny-agents run --help Usage: tiny-agents run [OPTIONS] [PATH] COMMAND [ARGS]... Run the Agent in the CLI ╭─ Arguments ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮ │ path [PATH] Path to a local folder containing an agent.json file or a built-in agent stored in the 'tiny-agents/tiny-agents' Hugging Face dataset │ │ (https://huggingface.co/datasets/tiny-agents/tiny-agents) │ ╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯ ╭─ Options ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮ │ --help Show this message and exit. │ ╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯ ``` The CLI pulls the config, connects to its MCP servers, prints the available tools, and waits for your prompt. **Advanced Usage** For more fine-grained control, use the `MCPClient` directly. This low-level interface extends `AsyncInferenceClient` and allows LLMs to call tools via the Model Context Protocol (MCP). It supports both local (`stdio`) and remote (`http`/`sse`) MCP servers, handles tool registration and execution, and streams results back to the model in real-time. Learn more in the [`huggingface_hub` MCP documentation](https://huggingface.co/docs/huggingface_hub/main/en/package_reference/mcp). ### Custom Agents To create your own agent, simply create a folder (e.g., `my-agent/`) and define your agent’s configuration in an `agent.json` file. The following example shows a web-browsing agent configured to use the [Qwen/Qwen2.5-72B-Instruct](https://huggingface.co/Qwen/Qwen2.5-72B-Instruct) model via Nebius inference provider, and it comes equipped with a playwright MCP server, which lets it use a web browser ```json { "model": "Qwen/Qwen2.5-72B-Instruct", "provider": "nebius", "servers": [ { "type": "stdio", "command": "npx", "args": ["@playwright/mcp@latest"] } ] } ``` To use a local LLM (such as [llama.cpp](https://github.com/ggerganov/llama.cpp), or [LM Studio](https://lmstudio.ai/)), just provide an `endpointUrl`: ```json { "model": "Qwen/Qwen3-32B", "endpointUrl": "http://localhost:1234/v1", "servers": [ { "type": "stdio", "command": "npx", "args": ["@playwright/mcp@latest"] } ] } ``` Optionally, add a `PROMPT.md` to customize the system prompt. > [!TIP] > Don't hesitate to contribute your agent to the community by opening a Pull Request in the [tiny-agents](https://huggingface.co/datasets/tiny-agents/tiny-agents) Hugging Face dataset. ## Gradio MCP Server / Tools You can build an MCP server in just a few lines of Python with Gradio. If you have an existing Gradio app or Space you'd like to use as an MCP server / tool, it's just a single-line change. To make a Gradio application an MCP server, simply pass in `mcp_server=True` when launching your demo like follows. ```python # pip install gradio import gradio as gr def generate_image(prompt: str): """ Generate an image based on a text prompt Args: prompt: a text string describing the image to generate """ pass demo = gr.Interface( fn=generate_image, inputs="text", outputs="image", title="Image Generator" ) demo.launch(mcp_server=True) ``` The MCP server will be available at `http://your-space-id.hf.space/gradio_api/mcp/sse` where your application is served. It will have a tool corresponding to each function in your Gradio app, with the tool description automatically generated from the docstrings of your functions. Lastly, add this to the settings of the MCP Client of your choice (e.g. Cursor). ```json { "mcpServers": { "gradio": { "url": "http://your-server:port/gradio_api/mcp/sse" } } } ``` This is very powerful because it lets the LLM use any Gradio application as a tool. You can find thousands of them on [Spaces](https://huggingface.co/spaces). Learn more [here](https://www.gradio.app/guides/building-mcp-server-with-gradio). ## smolagents [smolagents](https://github.com/huggingface/smolagents) is a lightweight library to cover all agentic use cases, from code-writing agents to computer use, in few lines of code. It is model agnostic, supporting local models served with Hugging Face Transformers, as well as models offered with [Inference Providers](../inference-providers/index.md), and proprietary model providers. It offers a unique kind of agent :`CodeAgent`, an agent that writes its actions in Python code. It also supports the standard agent that writes actions in JSON blobs as most other agentic frameworks do, called `ToolCallingAgent`. To learn more about write actions in code vs JSON, check out our [new short course on DeepLearning.AI](https://www.deeplearning.ai/short-courses/building-code-agents-with-hugging-face-smolagents/). If you want to avoid defining agents yourself, the easiest way to start an agent is through the CLI, using the `smolagent` command. ```bash smolagent "Plan a trip to Tokyo, Kyoto and Osaka between Mar 28 and Apr 7." \ --model-type "InferenceClientModel" \ --model-id "Qwen/Qwen2.5-Coder-32B-Instruct" \ --imports "pandas numpy" \ --tools "web_search" ``` Agents can be pushed to Hugging Face Hub as Spaces. Check out all the cool agents people have built [here](https://huggingface.co/spaces?filter=smolagents&sort=likes). smolagents also supports MCP servers as tools, as follows: ```python # pip install --upgrade smolagents mcp from smolagents import MCPClient, CodeAgent from mcp import StdioServerParameters import os server_parameters = StdioServerParameters( command="uvx", # Using uvx ensures dependencies are available args=["--quiet", "pubmedmcp@0.1.3"], env={"UV_PYTHON": "3.12", **os.environ}, ) with MCPClient(server_parameters) as tools: agent = CodeAgent(tools=tools, model=model, add_base_tools=True) agent.run("Please find the latest research on COVID-19 treatment.") ``` Learn more [in the documentation](https://huggingface.co/docs/smolagents/tutorials/tools#use-mcp-tools-with-mcpclient-directly). ### Streamlit Spaces https://huggingface.co/docs/hub/spaces-sdks-streamlit.md # Streamlit Spaces **Streamlit** gives users freedom to build a full-featured web app with Python in a *reactive* way. Your code is rerun each time the state of the app changes. Streamlit is also great for data visualization and supports several charting libraries such as Bokeh, Plotly, and Altair. Read this [blog post](https://huggingface.co/blog/streamlit-spaces) about building and hosting Streamlit apps in Spaces. Selecting **Streamlit** as the SDK when [creating a new Space](https://huggingface.co/new-space) will initialize your Space with the latest version of Streamlit by setting the `sdk` property to `streamlit` in your `README.md` file's YAML block. If you'd like to change the Streamlit version, you can edit the `sdk_version` property. To use Streamlit in a Space, select **Streamlit** as the SDK when you create a Space through the [**New Space** form](https://huggingface.co/new-space). This will create a repository with a `README.md` that contains the following properties in the YAML configuration block: ```yaml sdk: streamlit sdk_version: 1.25.0 # The latest supported version ``` You can edit the `sdk_version`, but note that issues may occur when you use an unsupported Streamlit version. Not all Streamlit versions are supported, so please refer to the [reference section](./spaces-config-reference) to see which versions are available. For in-depth information about Streamlit, refer to the [Streamlit documentation](https://docs.streamlit.io/). > [!WARNING] > Only port 8501 is allowed for Streamlit Spaces (default port). As a result if you provide a `config.toml` file for your Space make sure the default port is not overridden. ## Your First Streamlit Space: Hot Dog Classifier In the following sections, you'll learn the basics of creating a Space, configuring it, and deploying your code to it. We'll create a **Hot Dog Classifier** Space with Streamlit that'll be used to demo the [julien-c/hotdog-not-hotdog](https://huggingface.co/julien-c/hotdog-not-hotdog) model, which can detect whether a given picture contains a hot dog 🌭 You can find a completed version of this hosted at [NimaBoscarino/hotdog-streamlit](https://huggingface.co/spaces/NimaBoscarino/hotdog-streamlit). ## Create a new Streamlit Space We'll start by [creating a brand new Space](https://huggingface.co/new-space) and choosing **Streamlit** as our SDK. Hugging Face Spaces are Git repositories, meaning that you can work on your Space incrementally (and collaboratively) by pushing commits. Take a look at the [Getting Started with Repositories](./repositories-getting-started) guide to learn about how you can create and edit files before continuing. ## Add the dependencies For the **Hot Dog Classifier** we'll be using a [🤗 Transformers pipeline](https://huggingface.co/docs/transformers/pipeline_tutorial) to use the model, so we need to start by installing a few dependencies. This can be done by creating a **requirements.txt** file in our repository, and adding the following dependencies to it: ``` transformers torch ``` The Spaces runtime will handle installing the dependencies! ## Create the Streamlit app To create the Streamlit app, make a new file in the repository called **app.py**, and add the following code: ```python import streamlit as st from transformers import pipeline from PIL import Image pipeline = pipeline(task="image-classification", model="julien-c/hotdog-not-hotdog") st.title("Hot Dog? Or Not?") file_name = st.file_uploader("Upload a hot dog candidate image") if file_name is not None: col1, col2 = st.columns(2) image = Image.open(file_name) col1.image(image, use_column_width=True) predictions = pipeline(image) col2.header("Probabilities") for p in predictions: col2.subheader(f"{ p['label'] }: { round(p['score'] * 100, 1)}%") ``` This Python script uses a [🤗 Transformers pipeline](https://huggingface.co/docs/transformers/pipeline_tutorial) to load the [julien-c/hotdog-not-hotdog](https://huggingface.co/julien-c/hotdog-not-hotdog) model, which is used by the Streamlit interface. The Streamlit app will expect you to upload an image, which it'll then classify as *hot dog* or *not hot dog*. Once you've saved the code to the **app.py** file, visit the **App** tab to see your app in action! ## Embed Streamlit Spaces on other webpages You can use the HTML `` tag to embed a Streamlit Space as an inline frame on other webpages. Simply include the URL of your Space, ending with the `.hf.space` suffix. To find the URL of your Space, you can use the "Embed this Space" button from the Spaces options. For example, the demo above can be embedded in these docs with the following tag: ``` ``` Please note that we have added `?embed=true` to the URL, which activates the embed mode of the Streamlit app, removing some spacers and the footer for slim embeds. ## Embed Streamlit Spaces with auto-resizing IFrames Streamlit has supported automatic iframe resizing since [1.17.0](https://docs.streamlit.io/library/changelog#version-1170) so that the size of the parent iframe is automatically adjusted to fit the content volume of the embedded Streamlit application. It relies on the [`iFrame Resizer`](https://github.com/davidjbradshaw/iframe-resizer) library, for which you need to add a few lines of code, as in the following example where - `id` is set to `` that is used to specify the auto-resize target. - The `iFrame Resizer` is loaded via the `script` tag. - The `iFrameResize()` function is called with the ID of the target `iframe` element, so that its size changes automatically. We can pass options to the first argument of `iFrameResize()`. See [the document](https://github.com/davidjbradshaw/iframe-resizer/blob/master/docs/parent_page/options.md) for the details. ```html .hf.space" frameborder="0" width="850" height="450" > iFrameResize({}, "#your-iframe-id") ``` Additionally, you can checkout [our documentation](./spaces-embed). ### Using PEFT at Hugging Face https://huggingface.co/docs/hub/peft.md # Using PEFT at Hugging Face 🤗 [Parameter-Efficient Fine-Tuning (PEFT)](https://huggingface.co/docs/peft/index) is a library for efficiently adapting pre-trained language models to various downstream applications without fine-tuning all the model’s parameters. ## Exploring PEFT on the Hub You can find PEFT models by filtering at the left of the [models page](https://huggingface.co/models?library=peft&sort=trending). ## Installation To get started, you can check out the [Quick Tour in the PEFT docs](https://huggingface.co/docs/peft/quicktour). To install, follow the [PEFT installation guide](https://huggingface.co/docs/peft/install). You can also use the following one-line install through pip: ``` $ pip install peft ``` ## Using existing models All PEFT models can be loaded from the Hub. To use a PEFT model you also need to load the base model that was fine-tuned, as shown below. Every fine-tuned model has the base model in its model card. ```py from transformers import AutoModelForCausalLM, AutoTokenizer from peft import PeftModel, PeftConfig base_model = "mistralai/Mistral-7B-v0.1" adapter_model = "dfurman/Mistral-7B-Instruct-v0.2" model = AutoModelForCausalLM.from_pretrained(base_model) model = PeftModel.from_pretrained(model, adapter_model) tokenizer = AutoTokenizer.from_pretrained(base_model) model = model.to("cuda") model.eval() ``` Once loaded, you can pass your inputs to the tokenizer to prepare them, and call `model.generate()` in regular `transformers` fashion. ```py inputs = tokenizer("Tell me the recipe for chocolate chip cookie", return_tensors="pt") with torch.no_grad(): outputs = model.generate(input_ids=inputs["input_ids"].to("cuda"), max_new_tokens=10) print(tokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=True)[0]) ``` It outputs the following: ```text Tell me the recipe for chocolate chip cookie dough. 1. Preheat oven to 375 degrees F (190 degrees C). 2. In a large bowl, cream together 1/2 cup (1 stick) of butter or margarine, 1/2 cup granulated sugar, and 1/2 cup packed brown sugar. 3. Beat in 1 egg and 1 teaspoon vanilla extract. 4. Mix in 1 1/4 cups all-purpose flour. 5. Stir in 1/2 teaspoon baking soda and 1/2 teaspoon salt. 6. Fold in 3/4 cup semisweet chocolate chips. 7. Drop by ``` If you want to load a specific PEFT model, you can click `Use in PEFT` in the model card and you will be given a working snippet! ## Additional resources * PEFT [repository](https://github.com/huggingface/peft) * PEFT [docs](https://huggingface.co/docs/peft/index) * PEFT [models](https://huggingface.co/models?library=peft&sort=trending) ### Model Card components https://huggingface.co/docs/hub/model-cards-components.md # Model Card components **Model Card Components** are special elements that you can inject directly into your Model Card markdown to display powerful custom components in your model page. These components are authored by us, feel free to share ideas about new Model Card component in [this discussion](https://huggingface.co/spaces/huggingface/HuggingDiscussions/discussions/17). ## The Gallery component The `` component can be used in your model card to showcase your generated images and videos. ### How to use it? 1. Update your Model Card [widget metadata](/docs/hub/models-widgets-examples#text-to-image) to add the media you want to showcase. ```yaml widget: - text: a girl wandering through the forest output: url: images/6CD03C101B7F6545EB60E9F48D60B8B3C2D31D42D20F8B7B9B149DD0C646C0C2.jpeg - text: a tiny witch child output: url: images/7B482E1FDB39DA5A102B9CD041F4A2902A8395B3835105C736C5AD9C1D905157.jpeg - text: an artist leaning over to draw something output: url: images/7CCEA11F1B74C8D8992C47C1C5DEA9BD6F75940B380E9E6EC7D01D85863AF718.jpeg ``` 2. Add the `` component to your card. The widget metadata will be used by the `` component to display the media with each associated prompt. ```md ## Model description A very classic hand drawn cartoon style. ``` See result [here](https://huggingface.co/alvdansen/littletinies#little-tinies). > Hint: Support of Card Components through the GUI editor coming soon... ### Hugging Face Hub documentation https://huggingface.co/docs/hub/index.md # Hugging Face Hub documentation The Hugging Face Hub is a platform with over 2M models, 500k datasets, and 1M demo apps (Spaces), all open source and publicly available, in an online platform where people can easily collaborate and build ML together. The Hub works as a central place where anyone can explore, experiment, collaborate, and build technology with Machine Learning. Are you ready to join the path towards open source Machine Learning? 🤗 Subscriptions & Plans PRO subscription Team & Enterprise Plans Single Sign-On (SSO) Audit Logs Storage Regions Data Studio for Private datasets Resource Groups Advanced Security Tokens Management Network Security Rate Limits Repositories Introduction Getting Started Repository Settings Storage Limits Storage Backend (Xet) Pull requests and Discussions Notifications Collections Webhooks Next Steps Licenses Models Introduction The Model Hub Model Cards Gated Models Uploading Models Downloading Models Libraries Tasks Widgets Inference API Download Stats Datasets Introduction Datasets Overview Dataset Cards Gated Datasets Uploading Datasets Downloading Datasets Libraries Dataset Viewer Download Stats Data files Configuration Spaces Introduction Spaces Overview Gradio Spaces Static HTML Spaces Docker Spaces ZeroGPU Spaces Embed your Space Run with Docker Reference Advanced Topics Sign in with HF Other Organizations Billing Security Moderation Paper Pages Search Digital Object Identifier (DOI) Hub API Endpoints Sign in with HF Contributor Code of Conduct Content Guidelines ## What's the Hugging Face Hub? We are helping the community work together towards the goal of advancing Machine Learning 🔥. The Hugging Face Hub is a platform with over 2M models, 500k datasets, and 1M demos in which people can easily collaborate in their ML workflows. The Hub works as a central place where anyone can share, explore, discover, and experiment with open-source Machine Learning. No single company, including the Tech Titans, will be able to “solve AI” by themselves – the only way we'll achieve this is by sharing knowledge and resources in a community-centric approach. We are building the largest open-source collection of models, datasets, and demos on the Hugging Face Hub to democratize and advance ML for everyone 🚀. We encourage you to read the [Code of Conduct](https://huggingface.co/code-of-conduct) and the [Content Guidelines](https://huggingface.co/content-guidelines) to familiarize yourself with the values that we expect our community members to uphold 🤗. ## What can you find on the Hub? The Hugging Face Hub hosts Git-based repositories, which are version-controlled buckets that can contain all your files. 💾 On it, you'll be able to upload and discover... - Models: _hosting the latest state-of-the-art models for LLM, text, vision, and audio tasks_ - Datasets: _featuring a wide variety of data for different domains and modalities_ - Spaces: _interactive apps for demonstrating ML models directly in your browser_ The Hub offers **versioning, commit history, diffs, branches, and over a dozen library integrations**! All repositories build on [Xet](./xet/index), a new technology to efficiently store Large Files inside Git, intelligently splitting files into unique chunks and accelerating uploads and downloads. You can learn more about the features that all repositories share in the [**Repositories documentation**](./repositories). ## Models You can discover and use dozens of thousands of open-source ML models shared by the community. To promote responsible model usage and development, model repos are equipped with [Model Cards](./model-cards) to inform users of each model's limitations and biases. Additional [metadata](./model-cards#model-card-metadata) about info such as their tasks, languages, and evaluation results can be included, with training metrics charts even added if the repository contains [TensorBoard traces](./tensorboard). It's also easy to add an [**inference widget**](./models-widgets) to your model, allowing anyone to play with the model directly in the browser! For programmatic access, a serverless API is provided by [**Inference Providers**](./models-inference). To upload models to the Hub, or download models and integrate them into your work, explore the [**Models documentation**](./models). You can also choose from [**over a dozen libraries**](./models-libraries) such as 🤗 Transformers, Asteroid, and ESPnet that support the Hub. ## Datasets The Hub is home to over 500k public datasets in more than 8k languages that can be used for a broad range of tasks across NLP, Computer Vision, and Audio. The Hub makes it simple to find, download, and upload datasets. Datasets are accompanied by extensive documentation in the form of [**Dataset Cards**](./datasets-cards) and [**Data Studio**](./datasets-viewer) to let you explore the data directly in your browser. While many datasets are public, [**organizations**](./organizations) and individuals can create private datasets to comply with licensing or privacy issues. You can learn more about [**Datasets here on the Hugging Face Hub documentation**](./datasets-overview). The [🤗 `datasets`](https://huggingface.co/docs/datasets/index) library allows you to programmatically interact with the datasets, so you can easily use datasets from the Hub in your projects. With a single line of code, you can access the datasets; even if they are so large they don't fit in your computer, you can use streaming to efficiently access the data. ## Spaces [Spaces](https://huggingface.co/spaces) is a simple way to host ML demo apps on the Hub. They allow you to build your ML portfolio, showcase your projects at conferences or to stakeholders, and work collaboratively with other people in the ML ecosystem. We currently support two awesome Python SDKs (**[Gradio](https://gradio.app/)** and **[Streamlit](./spaces-sdks-streamlit)**) that let you build cool apps in a matter of minutes. Users can also create static Spaces, which are simple HTML/CSS/JavaScript pages, or deploy any Docker-based application. If you need GPU power for your demos, try [**ZeroGPU**](./spaces-zerogpu): it dynamically provides NVIDIA H200 GPUs, in real-time, only when needed. After you've explored a few Spaces (take a look at our [Space of the Week!](https://huggingface.co/spaces)), dive into the [**Spaces documentation**](./spaces-overview) to learn all about how you can create your own Space. You'll also be able to upgrade your Space to run on a GPU or other accelerated hardware. ⚡️ ## Organizations Companies, universities and non-profits are an essential part of the Hugging Face community! The Hub offers [**Organizations**](./organizations), which can be used to group accounts and manage datasets, models, and Spaces. Educators can also create collaborative organizations for students using [Hugging Face for Classrooms](https://huggingface.co/classrooms). An organization's repositories will be featured on the organization’s page and every member of the organization will have the ability to contribute to the repository. In addition to conveniently grouping all of an organization's work, the Hub allows admins to set roles to [**control access to repositories**](./organizations-security), and manage their organization's [payment method and billing info](https://huggingface.co/pricing). Machine Learning is more fun when collaborating! 🔥 [Explore existing organizations](https://huggingface.co/organizations), create a new organization [here](https://huggingface.co/organizations/new), and then visit the [**Organizations documentation**](./organizations) to learn more. ## Security The Hugging Face Hub supports security and access control features to give you the peace of mind that your code, models, and data are safe. Visit the [**Security**](./security) section in these docs to learn about: - User Access Tokens - Access Control for Organizations - Signing commits with GPG - Malware scanning ### Perform SQL operations https://huggingface.co/docs/hub/datasets-duckdb-sql.md # Perform SQL operations Performing SQL operations with DuckDB opens up a world of possibilities for querying datasets efficiently. Let's dive into some examples showcasing the power of DuckDB functions. For our demonstration, we'll explore a fascinating dataset. The [MMLU](https://huggingface.co/datasets/cais/mmlu) dataset is a multitask test containing multiple-choice questions spanning various knowledge domains. To preview the dataset, let's select a sample of 3 rows: ```bash FROM 'hf://datasets/cais/mmlu/all/test-*.parquet' USING SAMPLE 3; ┌──────────────────────┬──────────────────────┬──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┬────────┐ │ question │ subject │ choices │ answer │ │ varchar │ varchar │ varchar[] │ int64 │ ├──────────────────────┼──────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┼────────┤ │ The model of light… │ conceptual_physics │ [wave model, particle model, Both of these, Neither of these] │ 1 │ │ A person who is lo… │ professional_psych… │ [his/her life scripts., his/her own feelings, attitudes, and beliefs., the emotional reactions and behaviors of the people he/she is interacting with.… │ 1 │ │ The thermic effect… │ nutrition │ [is substantially higher for carbohydrate than for protein, is accompanied by a slight decrease in body core temperature., is partly related to sympat… │ 2 │ └──────────────────────┴──────────────────────┴──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┴────────┘ ``` This command retrieves a random sample of 3 rows from the dataset for us to examine. Let's start by examining the schema of our dataset. The following table outlines the structure of our dataset: ```bash DESCRIBE FROM 'hf://datasets/cais/mmlu/all/test-*.parquet' USING SAMPLE 3; ┌─────────────┬─────────────┬─────────┬─────────┬─────────┬─────────┐ │ column_name │ column_type │ null │ key │ default │ extra │ │ varchar │ varchar │ varchar │ varchar │ varchar │ varchar │ ├─────────────┼─────────────┼─────────┼─────────┼─────────┼─────────┤ │ question │ VARCHAR │ YES │ │ │ │ │ subject │ VARCHAR │ YES │ │ │ │ │ choices │ VARCHAR[] │ YES │ │ │ │ │ answer │ BIGINT │ YES │ │ │ │ └─────────────┴─────────────┴─────────┴─────────┴─────────┴─────────┘ ``` Next, let's analyze if there are any duplicated records in our dataset: ```bash SELECT *, COUNT(*) AS counts FROM 'hf://datasets/cais/mmlu/all/test-*.parquet' GROUP BY ALL HAVING counts > 2; ┌──────────┬─────────┬───────────┬────────┬────────┐ │ question │ subject │ choices │ answer │ counts │ │ varchar │ varchar │ varchar[] │ int64 │ int64 │ ├──────────┴─────────┴───────────┴────────┴────────┤ │ 0 rows │ └──────────────────────────────────────────────────┘ ``` Fortunately, our dataset doesn't contain any duplicate records. Let's see the proportion of questions based on the subject in a bar representation: ```bash SELECT subject, COUNT(*) AS counts, BAR(COUNT(*), 0, (SELECT COUNT(*) FROM 'hf://datasets/cais/mmlu/all/test-*.parquet')) AS percentage FROM 'hf://datasets/cais/mmlu/all/test-*.parquet' GROUP BY subject ORDER BY counts DESC; ┌──────────────────────────────┬────────┬────────────────────────────────────────────────────────────────────────────────┐ │ subject │ counts │ percentage │ │ varchar │ int64 │ varchar │ ├──────────────────────────────┼────────┼────────────────────────────────────────────────────────────────────────────────┤ │ professional_law │ 1534 │ ████████▋ │ │ moral_scenarios │ 895 │ █████ │ │ miscellaneous │ 783 │ ████▍ │ │ professional_psychology │ 612 │ ███▍ │ │ high_school_psychology │ 545 │ ███ │ │ high_school_macroeconomics │ 390 │ ██▏ │ │ elementary_mathematics │ 378 │ ██▏ │ │ moral_disputes │ 346 │ █▉ │ ├──────────────────────────────┴────────┴────────────────────────────────────────────────────────────────────────────────┤ │ 57 rows (8 shown) 3 columns │ └────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘ ``` Now, let's prepare a subset of the dataset containing questions related to **nutrition** and create a mapping of questions to correct answers. Notice that we have the column **choices** from which we can get the correct answer using the **answer** column as an index. ```bash SELECT * FROM 'hf://datasets/cais/mmlu/all/test-*.parquet' WHERE subject = 'nutrition' LIMIT 3; ┌──────────────────────┬───────────┬─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┬────────┐ │ question │ subject │ choices │ answer │ │ varchar │ varchar │ varchar[] │ int64 │ ├──────────────────────┼───────────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┼────────┤ │ Which foods tend t… │ nutrition │ [Meat, Confectionary, Fruits and vegetables, Potatoes] │ 2 │ │ In which one of th… │ nutrition │ [If the incidence rate of the disease falls., If survival time with the disease increases., If recovery of the disease is faster., If the population in which the… │ 1 │ │ Which of the follo… │ nutrition │ [The flavonoid class comprises flavonoids and isoflavonoids., The digestibility and bioavailability of isoflavones in soya food products are not changed by proce… │ 0 │ └──────────────────────┴───────────┴─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┴────────┘ ``` ```bash SELECT question, choices[answer] AS correct_answer FROM 'hf://datasets/cais/mmlu/all/test-*.parquet' WHERE subject = 'nutrition' LIMIT 3; ┌─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┬─────────────────────────────────────────────┐ │ question │ correct_answer │ │ varchar │ varchar │ ├─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┼─────────────────────────────────────────────┤ │ Which foods tend to be consumed in lower quantities in Wales and Scotland (as of 2020)?\n │ Confectionary │ │ In which one of the following circumstances will the prevalence of a disease in the population increase, all else being constant?\n │ If the incidence rate of the disease falls. │ │ Which of the following statements is correct?\n │ │ └─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┴─────────────────────────────────────────────┘ ``` To ensure data cleanliness, let's remove any newline characters at the end of the questions and filter out any empty answers: ```bash SELECT regexp_replace(question, '\n', '') AS question, choices[answer] AS correct_answer FROM 'hf://datasets/cais/mmlu/all/test-*.parquet' WHERE subject = 'nutrition' AND LENGTH(correct_answer) > 0 LIMIT 3; ┌───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┬─────────────────────────────────────────────┐ │ question │ correct_answer │ │ varchar │ varchar │ ├───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┼─────────────────────────────────────────────┤ │ Which foods tend to be consumed in lower quantities in Wales and Scotland (as of 2020)? │ Confectionary │ │ In which one of the following circumstances will the prevalence of a disease in the population increase, all else being constant? │ If the incidence rate of the disease falls. │ │ Which vitamin is a major lipid-soluble antioxidant in cell membranes? │ Vitamin D │ └───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┴─────────────────────────────────────────────┘ ``` Finally, lets highlight some of the DuckDB functions used in this section: - `DESCRIBE`, returns the table schema. - `USING SAMPLE`, samples are used to randomly select a subset of a dataset. - `BAR`, draws a band whose width is proportional to (x - min) and equal to width characters when x = max. Width defaults to 80. - `string[begin:end]`, extracts a string using slice conventions. Missing begin or end arguments are interpreted as the beginning or end of the list respectively. Negative values are accepted. - `regexp_replace`, if the string contains the regexp pattern, replaces the matching part with replacement. - `LENGTH`, gets the number of characters in the string. > [!TIP] > There are plenty of useful functions available in DuckDB's [SQL functions overview](https://duckdb.org/docs/sql/functions/overview). The best part is that you can use them directly on Hugging Face datasets. ### Data Studio https://huggingface.co/docs/hub/data-studio.md # Data Studio Each dataset page includes a table with the contents of the dataset, arranged by pages of 100 rows. You can navigate between pages using the buttons at the bottom of the table. ## Inspect data distributions At the top of the columns you can see the graphs representing the distribution of their data. This gives you a quick insight on how balanced your classes are, what are the range and distribution of numerical data and lengths of texts, and what portion of the column data is missing. ## Filter by value If you click on a bar of a histogram from a numerical column, the dataset viewer will filter the data and show only the rows with values that fall in the selected range. Similarly, if you select one class from a categorical column, it will show only the rows from the selected category. ## Search a word in the dataset You can search for a word in the dataset by typing it in the search bar at the top of the table. The search is case-insensitive and will match any row containing the word. The text is searched in the columns of `string`, even if the values are nested in a dictionary or a list. ## Run SQL queries on the dataset You can run SQL queries on the dataset in the browser using the SQL Console. This feature also leverages our [auto-conversion to Parquet](data-studio#access-the-parquet-files). For more information see our guide on [SQL Console](./datasets-viewer-sql-console). ## Share a specific row You can share a specific row by clicking on it, and then copying the URL in the address bar of your browser. For example https://huggingface.co/datasets/nyu-mll/glue/viewer/mrpc/test?p=2&row=241 will open the dataset studio on the MRPC dataset, on the test split, and on the 241st row. ## Large scale datasets The Dataset Viewer supports large scale datasets, but depending on the data format it may only show the first 5GB of the dataset: - For Parquet datasets: the Dataset Viewer shows the full dataset, but sorting, filtering and search are only enabled on the first 5GB. - For datasets >5GB in other formats (e.g. [WebDataset](https://github.com/webdataset/webdataset) or JSON Lines): the Dataset Viewer only shows the first 5GB, and sorting, filtering and search are enabled on these first 5GB. In this case, an informational message lets you know that the Viewer is partial. This should be a large enough sample to represent the full dataset accurately, let us know if you need a bigger sample. ## Access the parquet files To power the dataset viewer, the first 5GB of every dataset are auto-converted to the Parquet format (unless it was already a Parquet dataset). In the dataset viewer (for example, see [GLUE](https://huggingface.co/datasets/nyu-mll/glue)), you can click on [_"Auto-converted to Parquet"_](https://huggingface.co/datasets/nyu-mll/glue/tree/refs%2Fconvert%2Fparquet/cola) to access the Parquet files. Please, refer to the [dataset viewer docs](/docs/datasets-server/parquet_process) to learn how to query the dataset parquet files with libraries such as Polars, Pandas or DuckDB. > [!TIP] > Parquet is a columnar storage format optimized for querying and processing large datasets. Parquet is a popular choice for big data processing and analytics and is widely used for data processing and machine learning. You can learn more about the advantages associated with this format in the documentation. ### Conversion bot When you create a new dataset, the [`parquet-converter` bot](https://huggingface.co/parquet-converter) notifies you once it converts the dataset to Parquet. The [discussion](./repositories-pull-requests-discussions) it opens in the repository provides details about the Parquet format and links to the Parquet files. ### Programmatic access You can also access the list of Parquet files programmatically using the [Hub API](./api#get-apidatasetsrepoidparquet); for example, endpoint [`https://huggingface.co/api/datasets/nyu-mll/glue/parquet`](https://huggingface.co/api/datasets/nyu-mll/glue/parquet) lists the parquet files of the `nyu-mll/glue` dataset. We also have a specific documentation about the [Dataset Viewer API](https://huggingface.co/docs/dataset-viewer), which you can call directly. That API lets you access the contents, metadata and basic statistics of all Hugging Face Hub datasets, and powers the Dataset viewer frontend. ## Dataset preview For the biggest datasets, the page shows a preview of the first 100 rows instead of a full-featured viewer. This restriction only applies for datasets over 5GB that are not natively in Parquet format or that have not been auto-converted to Parquet. ## Embed the Dataset Viewer in a webpage You can embed the Dataset Viewer in your own webpage using an iframe. The URL to use is `https://huggingface.co/datasets///embed/viewer`, where `` is the owner of the dataset and `` is the name of the dataset. You can also pass other parameters like the subset, split, filter, search or selected row. For more information see our guide on [How to embed the Dataset Viewer in a webpage](./datasets-viewer-embed). ## Configure the Dataset Viewer To have a properly working Dataset Viewer for your dataset, make sure your dataset is in a supported format and structure. There is also an option to configure your dataset using YAML. For **private** datasets, the Dataset Viewer is enabled for [PRO users](https://huggingface.co/pricing) and [Team or Enterprise organizations](https://huggingface.co/enterprise). For more information see our guide on [How to configure the Dataset Viewer](./datasets-viewer-configure). ### Academia Hub https://huggingface.co/docs/hub/academia-hub.md # Academia Hub > [!TIP] > Ask your university's IT or Procurement Team to get in touch from a university-affiliated email address to initiate the subscription process. Academia Hub is a program designed to provide students, researchers and faculty members the tools, community and infrastructure to support their projects in artificial intelligence, at scale. With Academia Hub, your institution joins a dedicated program built for higher education and labs: offering the Hub’s advanced features, streamlined administration, and academic-friendly pricing. #### Key features of Academia Hub ***Accessible pricing & credits*** - $10 per seat/month under an annual university contract - Each seat includes $2/month in compute credits for Inference Providers (with the option to add more) ***Advanced compute & hosting*** - ZeroGPU: 5× usage quota and highest GPU queue priority - Spaces Hosting: create ZeroGPU Spaces with H200 hardware - Spaces Dev Mode: fast iterations via SSH/VS Code ***Storage & data management*** - Increased public storage capacity for datasets and models. - 1TB of private repository storage per seat in the subscription (i.e with 400 seats, your institution would have 400TB of included private storage) - Dataset Viewer: enable visualization even on private datasets ***Administration & security*** - Centralized administration: seat assignment, revocation, and management at scale - Seamless onboarding with academic email domains for secure and quick sign-up ***Collaboration & publishing*** - Larger collaboration capacity with higher quotas, priority queues, and governance tools - Community blog & social posts: publish research outputs, updates, and stories directly to the Hugging Face community ***Community & resources*** - Connect with peers and mentors across institutions - Access datasets, models, and projects tailored for academia #### How to get started 1. Check your eligibility (see below) 2. Ask your university’s IT or Procurement Team to get in touch to initiate the subscription process to the Academia Hub today. Academia Hub can not be initiated by students themselves. 3. When Academia Hub is enabled, any affiliated user will need to add their `@your-university-name.edu` email address (or other university domain) to their HF account. #### Eligibility **Who is Academia Hub for?** - **Students:** Unlock powerful features to learn about AI and Machine learning in the most efficient way. - **Researchers:** Collaborate with peers using the standard AI ecosystem of tools. - **Faculty members:** Enhance your classes' projects with PRO capabilities. **Requirements** - Must possess a valid university or college email address. - Open to all students regardless of discipline or level of study. - Pricing: Academia Hub is priced based on volume of usage and number of active users at your institution. ### Collections https://huggingface.co/docs/hub/collections.md # Collections Use Collections to group repositories from the Hub (Models, Datasets, Spaces and Papers) on a dedicated page. ![Collection page](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/collections/collection-intro.webp) Collections have many use cases: - Highlight specific repositories on your personal or organizational profile. - Separate key repositories from others for your profile visitors. - Showcase and share a complete project with its paper(s), dataset(s), model(s) and Space(s). - Bookmark things you find on the Hub in categories. - Have a dedicated page of curated things to share with others. - Gate a group of models/datasets (Enterprise Hub) This is just a list of possible uses, but remember that collections are just a way of grouping things, so use them in the way that best fits your use case. ## Creating a new collection There are several ways to create a collection: - For personal collections: Use the **+ New** button on your logged-in homepage (1). - For organization collections: Use the **+ New** button available on organizations page (2). ![New collection](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/collections/collection-new.webp) It's also possible to create a collection on the fly when adding the first item from a repository page, select **+ Create new collection** from the dropdown menu. You'll need to enter a title and short description for your collection to be created. ## Adding items to a collection There are 2 ways to add items to a collection: - From any repository page: Use the context menu available on any repository page then select **Add to collection** to add it to a collection (1). - From the collection page: If you know the name of the repository you want to add, use the **+ add to collection** option in the right-hand menu (2). ![Add items to collections](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/collections/collection-add.webp) It's possible to add external repositories to your collections, not just your own. ## Collaborating on collections Organization collections are a great way to build collections together. Any member of the organization can add, edit and remove items from the collection. Use the **history feature** to keep track of who has edited the collection. ![Collection history](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/collections/collection-history.webp) ## Collection options ### Collection visibility ![Collections on profiles](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/collections/collection-profile.webp) **Public** collections appear at the top of your profile or organization page and can be viewed by anyone. The first 3 items in each collection are visible directly in the collection preview (1). To see more, the user must click to go to the collection page. Set your collection to **private** if you don't want it to be accessible via its URL (it will not be displayed on your profile/organization page). For organizations, private collections are only available to members of the organization. ### Gating Group Collections (Enterprise Hub) You can use a collection to [gate](https://huggingface.co/docs/hub/en/models-gated) all the models/datasets belonging to it, allowing you to grant (or reject) access to all of them at once. This feature is reserved for [Enterprise Hub](https://huggingface.co/docs/hub/en/enterprise-hub) subscribers: more information about Gating Group Collections can be found in [our dedicated doc](https://huggingface.co/docs/hub/en/enterprise-hub-gating-group-collections). ### Ordering your collections and their items You can use the drag and drop handles in the collections list (on the left side of your collections page) to change the order of your collections (1). The first two collections will be directly visible on your profile/organization pages. You can also sort repositories within a collection by dragging the handles next to each item (2). ![Collections sort](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/collections/collection-sort.webp) ### Deleting items from a collection To delete an item from a collection, click the trash icon in the menu that shows up on the right when you hover over an item (1). To delete the whole collection, click delete on the right-hand menu (2) - you'll need to confirm this action. ![Collection delete](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/collections/collection-delete.webp) ### Adding notes to collection's items It's possible to add a note to any item in a collection to give it more context (for others, or as a reminder to yourself). You can add notes by clicking the pencil icon when you hover over an item with your mouse. Notes are plain text and don't support markdown, to keep things clean and simple. URLs in notes are converted into clickable links. ![Collection note](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/collections/collection-note.webp) ### Adding images to a collection item Similarly, you can attach images to a collection item. This is useful for showcasing the output of a model, the content of a dataset, attaching an infographic for context, etc. To start adding images to your collection, you can click on the image icon in the contextual menu of an item. The menu shows up when you hover over an item with your mouse. ![Collection image icon](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/collections/collections-image-button.webp) Then, add images by dragging and dropping images from your computer. You can also click on the gray zone to select image files from your computer's file system. ![Collection image drop zone with images](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/collections/collections-image-gallery.webp) You can re-order images by drag-and-dropping them. Clicking on an image will open it in full-screen mode. ![Collection image viewer](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/collections/collections-image-viewer.webp) ## Your feedback on collections We're working on improving collections, so if you have any bugs, questions, or new features you'd like to see added, please post a message in the [dedicated discussion](https://huggingface.co/spaces/huggingface/HuggingDiscussions/discussions/12). ### Uploading datasets https://huggingface.co/docs/hub/datasets-adding.md # Uploading datasets The [Hub](https://huggingface.co/datasets) is home to an extensive collection of community-curated and research datasets. We encourage you to share your dataset to the Hub to help grow the ML community and accelerate progress for everyone. All contributions are welcome; adding a dataset is just a drag and drop away! Start by [creating a Hugging Face Hub account](https://huggingface.co/join) if you don't have one yet. ## Upload using the Hub UI The Hub's web-based interface allows users without any developer experience to upload a dataset. ### Create a repository A repository hosts all your dataset files, including the revision history, making storing more than one dataset version possible. 1. Click on your profile and select **New Dataset** to create a [new dataset repository](https://huggingface.co/new-dataset). 2. Pick a name for your dataset, and choose whether it is a public or private dataset. A public dataset is visible to anyone, whereas a private dataset can only be viewed by you or members of your organization. ### Upload dataset 1. Once you've created a repository, navigate to the **Files and versions** tab to add a file. Select **Add file** to upload your dataset files. We support many text, audio, image and other data extensions such as `.csv`, `.mp3`, and `.jpg` (see the full list of [File formats](#file-formats)). 2. Drag and drop your dataset files. 3. After uploading your dataset files, they are stored in your dataset repository. ### Create a Dataset card Adding a Dataset card is super valuable for helping users find your dataset and understand how to use it responsibly. 1. Click on **Create Dataset Card** to create a [Dataset card](./datasets-cards). This button creates a `README.md` file in your repository. 2. At the top, you'll see the **Metadata UI** with several fields to select from such as license, language, and task categories. These are the most important tags for helping users discover your dataset on the Hub (when applicable). When you select an option for a field, it will be automatically added to the top of the dataset card. You can also look at the [Dataset Card specifications](https://github.com/huggingface/hub-docs/blob/main/datasetcard.md?plain=1), which has a complete set of allowed tags, including optional like `annotations_creators`, to help you choose the ones that are useful for your dataset. 3. Write your dataset documentation in the Dataset Card to introduce your dataset to the community and help users understand what is inside: what are the use cases and limitations, where the data comes from, what are important ethical considerations, and any other relevant details. You can click on the **Import dataset card template** link at the top of the editor to automatically create a dataset card template. For a detailed example of what a good Dataset card should look like, take a look at the [CNN DailyMail Dataset card](https://huggingface.co/datasets/cnn_dailymail). ## Using the `huggingface_hub` client library The rich features set in the `huggingface_hub` library allows you to manage repositories, including creating repos and uploading datasets to the Hub. Visit [the client library's documentation](/docs/huggingface_hub/index) to learn more. ## Using other libraries Some libraries like [🤗 Datasets](/docs/datasets/index), [Pandas](https://pandas.pydata.org/), [Polars](https://pola.rs), [Dask](https://www.dask.org/), [DuckDB](https://duckdb.org/), or [Daft](https://daft.ai/) can upload files to the Hub. See the list of [Libraries supported by the Datasets Hub](./datasets-libraries) for more information. ## Using Git Since dataset repos are Git repositories, you can use Git to push your data files to the Hub. Follow the guide on [Getting Started with Repositories](repositories-getting-started) to learn about using the `git` CLI to commit and push your datasets. ## File formats The Hub natively supports multiple file formats: - Parquet (.parquet) - CSV (.csv, .tsv) - JSON Lines, JSON (.jsonl, .json) - Arrow streaming format (.arrow) - Text (.txt) - Images (.png, .jpg, etc.) - Audio (.wav, .mp3, etc.) - PDF (.pdf) - [WebDataset](https://github.com/webdataset/webdataset) (.tar) It supports files compressed using ZIP (.zip), GZIP (.gz), ZSTD (.zst), BZ2 (.bz2), LZ4 (.lz4) and LZMA (.xz). Image and audio files can also have additional metadata files. See the [Data files Configuration](./datasets-data-files-configuration#image-and-audio-datasets) on image and audio datasets, as well as the collections of [example datasets](https://huggingface.co/datasets-examples) for CSV, TSV and images. You may want to convert your files to these formats to benefit from all the Hub features. Other formats and structures may not be recognized by the Hub. ### Which file format should I use? For most types of datasets, **Parquet** is the recommended format due to its efficient compression, rich typing, and since a variety of tools supports this format with optimized read and batched operations. Alternatively, CSV or JSON Lines/JSON can be used for tabular data (prefer JSON Lines for nested data). Although easy to parse compared to Parquet, these formats are not recommended for data larger than several GBs. For image and audio datasets, uploading raw files is the most practical for most use cases since it's easy to access individual files. For large scale image and audio datasets streaming, [WebDataset](https://github.com/webdataset/webdataset) should be preferred over raw image and audio files to avoid the overhead of accessing individual files. Though for more general use cases involving analytics, data filtering or metadata parsing, Parquet is the recommended option for large scale image and audio datasets. ### Data Studio The [Data Studio](./data-studio) is useful to know how the data actually looks like before you download it. It is enabled by default for all public datasets. It is also available for private datasets owned by a [PRO user](https://huggingface.co/pricing) or a [Team or Enterprise organization](https://huggingface.co/enterprise). After uploading your dataset, make sure the Dataset Viewer correctly shows your data, or [Configure the Dataset Viewer](./datasets-viewer-configure). ## Large scale datasets The Hugging Face Hub supports large scale datasets, usually uploaded in Parquet (e.g. via `push_to_hub()` using [🤗 Datasets](/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset.push_to_hub)) or [WebDataset](https://github.com/webdataset/webdataset) format. You can upload large scale datasets at high speed using the `huggingface_hub` library. See [how to upload a folder by chunks](/docs/huggingface_hub/guides/upload#upload-a-folder-by-chunks), the [tips and tricks for large uploads](/docs/huggingface_hub/guides/upload#tips-and-tricks-for-large-uploads) and the [repository storage limits and recommendations](./storage-limits). ### Using mlx-image at Hugging Face https://huggingface.co/docs/hub/mlx-image.md # Using mlx-image at Hugging Face [`mlx-image`](https://github.com/riccardomusmeci/mlx-image) is an image models library developed by [Riccardo Musmeci](https://github.com/riccardomusmeci) built on Apple [MLX](https://github.com/ml-explore/mlx). It tries to replicate the great [timm](https://github.com/huggingface/pytorch-image-models), but for MLX models. ## Exploring mlx-image on the Hub You can find `mlx-image` models by filtering using the `mlx-image` library name, like in [this query](https://huggingface.co/models?library=mlx-image&sort=trending). There's also an open [mlx-vision](https://huggingface.co/mlx-vision) community for contributors converting and publishing weights for MLX format. ## Installation ```bash pip install mlx-image ``` ## Models Model weights are available on the [`mlx-vision`](https://huggingface.co/mlx-vision) community on HuggingFace. To load a model with pre-trained weights: ```python from mlxim.model import create_model # loading weights from HuggingFace (https://huggingface.co/mlx-vision/resnet18-mlxim) model = create_model("resnet18") # pretrained weights loaded from HF # loading weights from local file model = create_model("resnet18", weights="path/to/resnet18/model.safetensors") ``` To list all available models: ```python from mlxim.model import list_models list_models() ``` ## ImageNet-1K Results Go to [results-imagenet-1k.csv](https://github.com/riccardomusmeci/mlx-image/blob/main/results/results-imagenet-1k.csv) to check every model converted to `mlx-image` and its performance on ImageNet-1K with different settings. > **TL;DR** performance is comparable to the original models from PyTorch implementations. ## Similarity to PyTorch and other familiar tools `mlx-image` tries to be as close as possible to PyTorch: - `DataLoader` -> you can define your own `collate_fn` and also use `num_workers` to speed up data loading - `Dataset` -> `mlx-image` already supports `LabelFolderDataset` (the good and old PyTorch `ImageFolder`) and `FolderDataset` (a generic folder with images in it) - `ModelCheckpoint` -> keeps track of the best model and saves it to disk (similar to PyTorchLightning). It also suggests early stopping ## Training Training is similar to PyTorch. Here's an example of how to train a model: ```python import mlx.nn as nn import mlx.optimizers as optim from mlxim.model import create_model from mlxim.data import LabelFolderDataset, DataLoader train_dataset = LabelFolderDataset( root_dir="path/to/train", class_map={0: "class_0", 1: "class_1", 2: ["class_2", "class_3"]} ) train_loader = DataLoader( dataset=train_dataset, batch_size=32, shuffle=True, num_workers=4 ) model = create_model("resnet18") # pretrained weights loaded from HF optimizer = optim.Adam(learning_rate=1e-3) def train_step(model, inputs, targets): logits = model(inputs) loss = mx.mean(nn.losses.cross_entropy(logits, target)) return loss model.train() for epoch in range(10): for batch in train_loader: x, target = batch train_step_fn = nn.value_and_grad(model, train_step) loss, grads = train_step_fn(x, target) optimizer.update(model, grads) mx.eval(model.state, optimizer.state) ``` ## Additional Resources * [mlx-image repository](https://github.com/riccardomusmeci/mlx-image) * [mlx-vision community](https://huggingface.co/mlx-vision) ## Contact If you have any questions, please email `riccardomusmeci92@gmail.com`. ### User Studies https://huggingface.co/docs/hub/model-cards-user-studies.md # User Studies ## Model Card Audiences and Use Cases During our investigation into the landscape of model documentation tools (data cards etc), we noted how different stakeholders make use of existing infrastructure to create a kind of model card with information focused on their needed domain. One such example are ‘business analysts’ or those whose focus is on B2B as well as an internal only audience.The static and more manual approach for this audience is using Confluence pages. (*if PMs write the page, we are detaching the model creators from its theoretical consumption; if ML engineers write the page, they may tend to stress only a certain type of information.* [^1]) or a proposed combination of HTML (Jinja) templates, Metaflow classes and external APi keys, in order to create model cards that include a perspective of the model information that is needed for their domain/use case. We conducted a user study, with the aim of validating a literature informed model card structure and to understand sections/ areas of ranked importance for the different stakeholders perspectives. The study aimed to validate the following components: * **Model Card Layout** During our examination of the state of the art of model cards, which noted recurring sections from the top ~100 downloaded models on the hub that had model cards. From this analysis we catalogued the top recurring model card sections and recurring information, this coupled with the structure of the Bloom model card, lead us to the initial version of a standard model card structure. As we began to structure our user studies, two variations of model cards - that made use of the [initial model card structure](./model-card-annotated) - were used as interactive demonstrations. The aim of these demo’s was to understand not only the different user perspectives on the visual elements of the model card’s but also the content presented to users. The {desired} outcome would enable us to further understand what makes a model card both easier to read, still providing some level of interactivity within the model cards, all while presenting the information in an easily understandable [approachable] manner. * **Stakeholder Perspectives** As different people, of varying technical backgrounds, could be collaborating on a model and subsequently the model card, we sought to validate the need for different stakeholders perspectives. Based on the ease of use of writing the different model card sections and the sections that one would read first Participants ranked the different sections of model cards in the perspective of one reading a model card and then as an author of a model card. An ordering scheme - 1 being the highest weight and 10 being the lowest - was applied to the different sections that the user would usually read first in a model card and the sections of a model card that a model card author would find easiest to write. ## Summary of Responses to the User Studies Survey Our user studies provided further clarity on the sections that different user profiles/stakeholders would find more challenging or easier to write. The results illustrated below show that while the Bias, Risks and Limitations section ranks second for both model card writers and model card readers for *In what order do you write the model card and What section do you look at first*, respectively, it is also noted as the most challenging/longest section to write. This favoured/endorsed the need to further evaluate the Bias, Risks and Limitations sections in order to assist with writing this decisive/imperative section. These templates were then used to generate model cards for the top 200 most downloaded Hugging Face (HF) models. * We first began by pulling all Hugging Face model's on the hub and, in particular, subsections on Limitations and Bias ("Risks" subsections were largely not present). * Based on inputs that were the most continuously used with a higher number of model downloads, grouped by model typed, the tool provides prompted text within the Bias, Risks and Limitations sections. We also prompt a default text if the model type is not specified. Using this information, we returned back to our analysis of all model cards on the hub, coupled with suggestions from other researchers and peers at HF and additional research on the type of prompted information we could provide to users while they are creating model cards. These defaulted prompted text allowed us to satisfy the aims: 1) For those who have not created model cards before or who do not usually make a model card or any other type of model documentation for their model’s, the prompted text enables these users to easily create a model card. This in turn increased the number of model cards created. 2) Users who already write model cards, the prompted text invites them to add more to their model card, further developing the content/standard of model cards. ## User Study Details We selected people from a variety of different backgrounds relevant to machine learning and model documentation. Below, we detail their demographics, the questions they were asked, and the corresponding insights from their responses. Full details on responses are available in [Appendix A](./model-card-appendix#appendix-a-user-study). ### Respondent Demographics * Tech & Regulatory Affairs Counsel * ML Engineer (x2) * Developer Advocate * Executive Assistant * Monetization Lead * Policy Manager/AI Researcher * Research Intern **What are the key pieces of information you want or need to know about a model when interacting with a machine learning model?** **Insight:** * Respondents prioritised information about the model task/domain (x3), training data/training procedure (x2), how to use the model (with code) (x2), bias and limitations, and the model licence ### Feedback on Specific Model Card Formats #### Format 1: **Current [distilbert/distilgpt2 model card](https://huggingface.co/distilbert/distilgpt2) on the Hub** **Insights:** * Respondents found this model card format to be concise, complete, and readable. * There was no consensus about the collapsible sections (some liked them and wanted more, some disliked them). * Some respondents said “Risks and Limitations” should go with “Out of Scope Uses” #### Format 2: **Nazneen Rajani's [Interactive Model Card space](https://huggingface.co/spaces/nazneen/interactive-model-cards)** **Insights:** * While a few respondents really liked this format, most found it overwhelming or as an overload of information. Several suggested this could be a nice tool to layer onto a base model card for more advanced audiences. #### Format 3: **Ezi Ozoani's [Semi-Interactive Model Card Space](https://huggingface.co/spaces/Ezi/ModelCardsAnalysis)** **Insights:** * Several respondents found this format overwhelming, but they generally found it less overwhelming than format 2. * Several respondents disagreed with the current layout and gave specific feedback about which sections should be prioritised within each column. ### Section Rankings *Ordered based on average ranking. Arrows are shown relative to the order of the associated section in the question on the survey.* **Insights:** * When writing model cards, respondents generally said they would write a model card in the same order in which the sections were listed in the survey question. * When ranking the sections of the model card by ease/quickness of writing, consensus was that the sections on uses and limitations and risks were the most difficult. * When reading model cards, respondents said they looked at the cards’ sections in an order that was close to – but not perfectly aligned with – the order in which the sections were listed in the survey question. ![user studies results 1](https://huggingface.co/datasets/huggingface/documentation-images/blob/main/hub/usaer-studes-responses(1).png) ![user studies results 2](https://huggingface.co/datasets/huggingface/documentation-images/blob/main/hub/user-studies-responses(2).png) > [!TIP] > [Checkout the Appendix](./model-card-appendix) Acknowledgements ================ We want to acknowledge and thank [Bibi Ofuya](https://www.figma.com/proto/qrPCjWfFz5HEpWqQ0PJSWW/Bibi's-Portfolio?page-id=0%3A1&node-id=1%3A28&viewport=243%2C48%2C0.2&scaling=min-zoom&starting-point-node-id=1%3A28) for her question creation and her guidance on user-focused ordering and presentation during the user studies. [^1]: See https://towardsdatascience.com/dag-card-is-the-new-model-card-70754847a111 --- **Please cite as:** Ozoani, Ezi and Gerchick, Marissa and Mitchell, Margaret. Model Card Guidebook. Hugging Face, 2022. https://huggingface.co/docs/hub/en/model-card-guidebook ### Docker Spaces https://huggingface.co/docs/hub/spaces-sdks-docker.md # Docker Spaces Spaces accommodate custom [Docker containers](https://docs.docker.com/get-started/) for apps outside the scope of Streamlit and Gradio. Docker Spaces allow users to go beyond the limits of what was previously possible with the standard SDKs. From FastAPI and Go endpoints to Phoenix apps and ML Ops tools, Docker Spaces can help in many different setups. ## Setting up Docker Spaces Selecting **Docker** as the SDK when [creating a new Space](https://huggingface.co/new-space) will initialize your Space by setting the `sdk` property to `docker` in your `README.md` file's YAML block. Alternatively, given an existing Space repository, set `sdk: docker` inside the `YAML` block at the top of your Spaces **README.md** file. You can also change the default exposed port `7860` by setting `app_port: 7860`. Afterwards, you can create a usual `Dockerfile`. ```Yaml --- title: Basic Docker SDK Space emoji: 🐳 colorFrom: purple colorTo: gray sdk: docker app_port: 7860 --- ``` Internally you could have as many open ports as you want. For instance, you can install Elasticsearch inside your Space and call it internally on its default port 9200. If you want to expose apps served on multiple ports to the outside world, a workaround is to use a reverse proxy like Nginx to dispatch requests from the broader internet (on a single port) to different internal ports. ## Secrets and Variables Management You can manage a Space's environment variables in the Space Settings. Read more [here](./spaces-overview#managing-secrets). ### Variables #### Buildtime Variables are passed as `build-arg`s when building your Docker Space. Read [Docker's dedicated documentation](https://docs.docker.com/engine/reference/builder/#arg) for a complete guide on how to use this in the Dockerfile. ```Dockerfile # Declare your environment variables with the ARG directive ARG MODEL_REPO_NAME FROM python:latest # [...] # You can use them like environment variables RUN predict.py $MODEL_REPO_NAME ``` #### Runtime Variables are injected in the container's environment at runtime. ### Secrets #### Buildtime In Docker Spaces, the secrets management is different for security reasons. Once you create a secret in the [Settings tab](./spaces-overview#managing-secrets), you can expose the secret by adding the following line in your Dockerfile: For example, if `SECRET_EXAMPLE` is the name of the secret you created in the Settings tab, you can read it at build time by mounting it to a file, then reading it with `$(cat /run/secrets/SECRET_EXAMPLE)`. See an example below: ```Dockerfile # Expose the secret SECRET_EXAMPLE at buildtime and use its value as git remote URL RUN --mount=type=secret,id=SECRET_EXAMPLE,mode=0444,required=true \ git init && \ git remote add origin $(cat /run/secrets/SECRET_EXAMPLE) ``` ```Dockerfile # Expose the secret SECRET_EXAMPLE at buildtime and use its value as a Bearer token for a curl request RUN --mount=type=secret,id=SECRET_EXAMPLE,mode=0444,required=true \ curl test -H 'Authorization: Bearer $(cat /run/secrets/SECRET_EXAMPLE)' ``` #### Runtime Same as for public Variables, at runtime, you can access the secrets as environment variables. For example, in Python you would use `os.environ.get("SECRET_EXAMPLE")`. Check out this [example](https://huggingface.co/spaces/DockerTemplates/secret-example) of a Docker Space that uses secrets. ## Permissions The container runs with user ID 1000. To avoid permission issues you should create a user and set its `WORKDIR` before any `COPY` or download. ```Dockerfile # Set up a new user named "user" with user ID 1000 RUN useradd -m -u 1000 user # Switch to the "user" user USER user # Set home to the user's home directory ENV HOME=/home/user \ PATH=/home/user/.local/bin:$PATH # Set the working directory to the user's home directory WORKDIR $HOME/app # Try and run pip command after setting the user with `USER user` to avoid permission issues with Python RUN pip install --no-cache-dir --upgrade pip # Copy the current directory contents into the container at $HOME/app setting the owner to the user COPY --chown=user . $HOME/app # Download a checkpoint RUN mkdir content ADD --chown=user https:// content/ ``` Always specify the `--chown=user` with `ADD` and `COPY` to ensure the new files are owned by your user. If you still face permission issues, you might need to use `chmod` or `chown` in your `Dockerfile` to grant the right permissions. For example, if you want to use the directory `/data`, you can do: ```Dockerfile RUN mkdir -p /data RUN chmod 777 /data ``` You should always avoid superfluous chowns. > [!WARNING] > Updating metadata for a file creates a new copy stored in the new layer. Therefore, a recursive chown can result in a very large image due to the duplication of all affected files. Rather than fixing permission by running `chown`: ``` COPY checkpoint . RUN chown -R user checkpoint ``` you should always do: ``` COPY --chown=user checkpoint . ``` (same goes for `ADD` command) ## Data Persistence The data written on disk is lost whenever your Docker Space restarts, unless you opt-in for a [persistent storage](./spaces-storage) upgrade. If you opt-in for a persistent storage upgrade, you can use the `/data` directory to store data. This directory is mounted on a persistent volume, which means that the data written in this directory will be persisted across restarts. At the moment, `/data` volume is only available at runtime, i.e. you cannot use `/data` during the build step of your Dockerfile. You can also use our Datasets Hub for specific cases, where you can store state and data in a git LFS repository. You can find an example of persistence [here](https://huggingface.co/spaces/Wauplin/space_to_dataset_saver), which uses the [`huggingface_hub` library](https://huggingface.co/docs/huggingface_hub/index) for programmatically uploading files to a dataset repository. This Space example along with [this guide](https://huggingface.co/docs/huggingface_hub/main/en/guides/upload#scheduled-uploads) will help you define which solution fits best your data type. Finally, in some cases, you might want to use an external storage solution from your Space's code like an external hosted DB, S3, etc. ### Docker container with GPU You can run Docker containers with GPU support by using one of our GPU-flavored [Spaces Hardware](./spaces-gpus). We recommend using the [`nvidia/cuda`](https://hub.docker.com/r/nvidia/cuda) from Docker Hub as a base image, which comes with CUDA and cuDNN pre-installed. During Docker buildtime, you don't have access to a GPU hardware. Therefore, you should not try to run any GPU-related command during the build step of your Dockerfile. For example, you can't run `nvidia-smi` or `torch.cuda.is_available()` building an image. Read more [here](https://github.com/NVIDIA/nvidia-docker/wiki/nvidia-docker#description). ## Read More - [Full Docker demo example](spaces-sdks-docker-first-demo) - [List of Docker Spaces examples](spaces-sdks-docker-examples) - [Spaces Examples](https://huggingface.co/SpacesExamples) ### Advanced Topics https://huggingface.co/docs/hub/spaces-advanced.md # Advanced Topics ## Contents - [Using OpenCV in Spaces](./spaces-using-opencv) - [More ways to create Spaces](./spaces-more-ways-to-create) - [Managing Spaces with Github Actions](./spaces-github-actions) - [Managing Spaces with CircleCI Workflows](./spaces-circleci) - [Custom Python Spaces](./spaces-sdks-python) - [How to Add a Space to ArXiv](./spaces-add-to-arxiv) - [Cookie limitations in Spaces](./spaces-cookie-limitations) - [How to handle URL parameters in Spaces](./spaces-handle-url-parameters) - [How to get user status and plan in Spaces](./spaces-get-user-plan) ### Transforming your dataset https://huggingface.co/docs/hub/datasets-polars-operations.md # Transforming your dataset On this page we'll guide you through some of the most common operations used when doing data analysis. This is only a small subset of what's possible in Polars. For more information, please visit the [Documentation](https://docs.pola.rs/). For the example we will use the [Common Crawl statistics](https://huggingface.co/datasets/commoncrawl/statistics) dataset. These statistics include: number of pages, distribution of top-level domains, crawl overlaps, etc. For more detailed information and graphs please visit their [official statistics page](https://commoncrawl.github.io/cc-crawl-statistics/plots/tlds). ## Reading ```python import polars as pl df = pl.read_csv( "hf://datasets/commoncrawl/statistics/tlds.csv", try_parse_dates=True, ) df.head(3) ``` ```bash ┌─────┬────────┬───────────────────┬────────────┬───┬───────┬──────┬───────┬─────────┐ │ ┆ suffix ┆ crawl ┆ date ┆ … ┆ pages ┆ urls ┆ hosts ┆ domains │ │ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │ │ i64 ┆ str ┆ str ┆ date ┆ ┆ i64 ┆ i64 ┆ f64 ┆ f64 │ ╞═════╪════════╪═══════════════════╪════════════╪═══╪═══════╪══════╪═══════╪═════════╡ │ 0 ┆ a.se ┆ CC-MAIN-2008-2009 ┆ 2009-01-12 ┆ … ┆ 18 ┆ 18 ┆ 1.0 ┆ 1.0 │ │ 1 ┆ a.se ┆ CC-MAIN-2009-2010 ┆ 2010-09-25 ┆ … ┆ 3462 ┆ 3259 ┆ 166.0 ┆ 151.0 │ │ 2 ┆ a.se ┆ CC-MAIN-2012 ┆ 2012-11-02 ┆ … ┆ 6957 ┆ 6794 ┆ 172.0 ┆ 150.0 │ └─────┴────────┴───────────────────┴────────────┴───┴───────┴──────┴───────┴─────────┘ ``` ## Selecting columns The dataset contains some columns we don't need. To remove them, we will use the `select` method: ```python df = df.select("suffix", "date", "tld", "pages", "domains") df.head(3) ``` ```bash ┌────────┬───────────────────┬────────────┬─────┬───────┬─────────┐ │ suffix ┆ crawl ┆ date ┆ tld ┆ pages ┆ domains │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ str ┆ date ┆ str ┆ i64 ┆ f64 │ ╞════════╪═══════════════════╪════════════╪═════╪═══════╪═════════╡ │ a.se ┆ CC-MAIN-2008-2009 ┆ 2009-01-12 ┆ se ┆ 18 ┆ 1.0 │ │ a.se ┆ CC-MAIN-2009-2010 ┆ 2010-09-25 ┆ se ┆ 3462 ┆ 151.0 │ │ a.se ┆ CC-MAIN-2012 ┆ 2012-11-02 ┆ se ┆ 6957 ┆ 150.0 │ └────────┴───────────────────┴────────────┴─────┴───────┴─────────┘ ``` ## Filtering We can filter the dataset using the `filter` method. This method accepts complex expressions, but let's start simple by filtering based on the crawl date: ```python import datetime df = df.filter(pl.col("date") >= datetime.date(2020, 1, 1)) ``` You can combine multiple predicates with `&` or `|` operators: ```python df = df.filter( (pl.col("date") >= datetime.date(2020, 1, 1)) | pl.col("crawl").str.contains("CC") ) ``` ## Transforming In order to add new columns to the dataset, use `with_columns`. In the example below we calculate the total number of pages per domain and add a new column `pages_per_domain` using the `alias` method. The entire statement within `with_columns` is called an expression. Read more about expressions and how to use them in the [Polars user guide](https://docs.pola.rs/user-guide/expressions/) ```python df = df.with_columns( (pl.col("pages") / pl.col("domains")).alias("pages_per_domain") ) df.sample(3) ``` ```bash ┌────────┬─────────────────┬────────────┬─────┬───────┬─────────┬──────────────────┐ │ suffix ┆ crawl ┆ date ┆ tld ┆ pages ┆ domains ┆ pages_per_domain │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ str ┆ date ┆ str ┆ i64 ┆ f64 ┆ f64 │ ╞════════╪═════════════════╪════════════╪═════╪═══════╪═════════╪══════════════════╡ │ net.bt ┆ CC-MAIN-2014-41 ┆ 2014-10-06 ┆ bt ┆ 4 ┆ 1.0 ┆ 4.0 │ │ org.mk ┆ CC-MAIN-2016-44 ┆ 2016-10-31 ┆ mk ┆ 1445 ┆ 430.0 ┆ 3.360465 │ │ com.lc ┆ CC-MAIN-2016-44 ┆ 2016-10-31 ┆ lc ┆ 1 ┆ 1.0 ┆ 1.0 │ └────────┴─────────────────┴────────────┴─────┴───────┴─────────┴──────────────────┘ ``` ## Aggregation & Sorting In order to aggregate data together you can use the `group_by`, `agg` and `sort` methods. Within the aggregation context you can combine expressions to create powerful statements which are still easy to read. First, we aggregate all the data to the top-level domain `tld` per scraped date: ```python df = df.group_by("tld", "date").agg( pl.col("pages").sum(), pl.col("domains").sum(), ) ``` Now we can calculate several statistics per top level domain: - Number of unique scrape dates - Average number of domains in the scraped period - Average growth rate in terms of number of pages ```python df = df.group_by("tld").agg( pl.col("date").unique().count().alias("number_of_scrapes"), pl.col("domains").mean().alias("avg_number_of_domains"), pl.col("pages").sort_by("date").pct_change().mean().alias("avg_page_growth_rate"), ) df = df.sort("avg_number_of_domains", descending=True) df.head(10) ``` ```bash ┌─────┬───────────────────┬───────────────────────┬─────────────────────────────────┐ │ tld ┆ number_of_scrapes ┆ avg_number_of_domains ┆ avg_percent_change_in_number_o… │ │ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ u32 ┆ f64 ┆ f64 │ ╞═════╪═══════════════════╪═══════════════════════╪═════════════════════════════════╡ │ com ┆ 101 ┆ 1.9571e7 ┆ 0.022182 │ │ de ┆ 101 ┆ 1.8633e6 ┆ 0.5232 │ │ org ┆ 101 ┆ 1.5049e6 ┆ 0.019604 │ │ net ┆ 101 ┆ 1.5020e6 ┆ 0.021002 │ │ cn ┆ 101 ┆ 1.1101e6 ┆ 0.281726 │ │ ru ┆ 101 ┆ 1.0561e6 ┆ 0.416303 │ │ uk ┆ 101 ┆ 827453.732673 ┆ 0.065299 │ │ nl ┆ 101 ┆ 710492.623762 ┆ 1.040096 │ │ fr ┆ 101 ┆ 615471.594059 ┆ 0.419181 │ │ jp ┆ 101 ┆ 615391.455446 ┆ 0.246162 │ └─────┴───────────────────┴───────────────────────┴─────────────────────────────────┘ ``` ### Using spaCy at Hugging Face https://huggingface.co/docs/hub/spacy.md # Using spaCy at Hugging Face `spaCy` is a popular library for advanced Natural Language Processing used widely across industry. `spaCy` makes it easy to use and train pipelines for tasks like named entity recognition, text classification, part of speech tagging and more, and lets you build powerful applications to process and analyze large volumes of text. ## Exploring spaCy models in the Hub The official models from `spaCy` 3.3 are in the `spaCy` [Organization Page](https://huggingface.co/spacy). Anyone in the community can also share their `spaCy` models, which you can find by filtering at the left of the [models page](https://huggingface.co/models?library=spacy). All models on the Hub come up with useful features 1. An automatically generated model card with label scheme, metrics, components, and more. 2. An evaluation sections at top right where you can look at the metrics. 3. Metadata tags that help for discoverability and contain information such as license and language. 4. An interactive widget you can use to play out with the model directly in the browser 5. An Inference API that allows to make inference requests. ## Using existing models All `spaCy` models from the Hub can be directly installed using pip install. ```bash pip install "en_core_web_sm @ https://huggingface.co/spacy/en_core_web_sm/resolve/main/en_core_web_sm-any-py3-none-any.whl" ``` To find the link of interest, you can go to a repository with a `spaCy` model. When you open the repository, you can click `Use in spaCy` and you will be given a working snippet that you can use to install and load the model! Once installed, you can load the model as any spaCy pipeline. ```python # Using spacy.load(). import spacy nlp = spacy.load("en_core_web_sm") # Importing as module. import en_core_web_sm nlp = en_core_web_sm.load() ``` ## Sharing your models ### Using the spaCy CLI (recommended) The `spacy-huggingface-hub` library extends `spaCy` native CLI so people can easily push their packaged models to the Hub. You can install spacy-huggingface-hub from pip: ```bash pip install spacy-huggingface-hub ``` You can then check if the command has been registered successfully ```bash python -m spacy huggingface-hub --help ``` To push with the CLI, you can use the `huggingface-hub push` command as seen below. ```bash python -m spacy huggingface-hub push [whl_path] [--org] [--msg] [--local-repo] [--verbose] ``` | Argument | Type | Description | | -------------------- | ------------ | ----------------------------------------------------------------------------------------------------------------------------- | | `whl_path` | str / `Path` | The path to the `.whl` file packaged with [`spacy package`](https://spacy.io/api/cli#package). | | `--org`, `-o` | str | Optional name of organization to which the pipeline should be uploaded. | | `--msg`, `-m` | str | Commit message to use for update. Defaults to `"Update spaCy pipeline"`. | | `--local-repo`, `-l` | str / `Path` | Local path to the model repository (will be created if it doesn't exist). Defaults to `hub` in the current working directory. | | `--verbose`, `-V` | bool | Output additional info for debugging, e.g. the full generated hub metadata. | You can then upload any pipeline packaged with [`spacy package`](https://spacy.io/api/cli#package). Make sure to set `--build wheel` to output a binary .whl file. The uploader will read all metadata from the pipeline package, including the auto-generated pretty `README.md` and the model details available in the `meta.json`. ```bash hf auth login python -m spacy package ./en_ner_fashion ./output --build wheel cd ./output/en_ner_fashion-0.0.0/dist python -m spacy huggingface-hub push en_ner_fashion-0.0.0-py3-none-any.whl ``` In just a minute, you can get your packaged model in the Hub, try it out directly in the browser, and share it with the rest of the community. All the required metadata will be uploaded for you and you even get a cool model card. The command will output two things: * Where to find your repo in the Hub! For example, https://huggingface.co/spacy/en_core_web_sm * And how to install the pipeline directly from the Hub! ### From a Python script You can use the `push` function from Python. It returns a dictionary containing the `"url"` and "`whl_url`" of the published model and the wheel file, which you can later install with `pip install`. ```py from spacy_huggingface_hub import push result = push("./en_ner_fashion-0.0.0-py3-none-any.whl") print(result["url"]) ``` ## Additional resources * spacy-huggingface-hub [library](https://github.com/explosion/spacy-huggingface-hub). * Launch [blog post](https://huggingface.co/blog/spacy) * spaCy v 3.1 [Announcement](https://explosion.ai/blog/spacy-v3-1#huggingface-hub) * spaCy [documentation](https://spacy.io/universe/project/spacy-huggingface-hub/) ### Spaces Dev Mode: Seamless development in Spaces https://huggingface.co/docs/hub/spaces-dev-mode.md # Spaces Dev Mode: Seamless development in Spaces > [!WARNING] > This feature is still in Beta stage. > [!WARNING] > The Spaces Dev Mode is part of PRO or Team & Enterprise plans. ## Spaces Dev Mode Spaces Dev Mode is a feature that eases the debugging of your application and makes iterating on Spaces faster. Whenever your commit some changes to your Space repo, the underlying Docker image gets rebuilt, and then a new virtual machine is provisioned to host the new container. The Dev Mode allows you to update your Space much quicker by overriding the Docker image. The Dev Mode Docker image starts your application as a sub-process, allowing you to restart it without stopping the Space container itself. It also starts a VS Code server and a SSH server in the background for you to connect to the Space. The ability to connect to the running Space unlocks several use cases: - You can make changes to the app code without the Space rebuilding everytime - You can debug a running application and monitor resources live Overall it makes developing and experimenting with Spaces much faster by skipping the Docker image rebuild phase. ## Interface Once the Dev Mode is enabled on your Space, you should see a modal like the following. The application does not restart automatically when you change the code. For your changes to appear in the Space, you need to use the `Refresh` button that will restart the app. If you're using the Gradio SDK, or if your application is Python-based, note that requirements are not installed automatically. You will need to manually run `pip install` from VS Code or SSH. ### SSH connection and VS Code The Dev Mode allows you to connect to your Space's docker container using SSH. Instructions to connect are listed in the Dev Mode controls modal. You will need to add your machine's SSH public key to [your user account](https://huggingface.co/settings/keys) to be able to connect to the Space using SSH. Check out the [Git over SSH](./security-git-ssh#add-a-ssh-key-to-your-account) documentation for more detailed instructions. You can also use a local install of VS Code to connect to the Space container. To do so, you will need to install the [SSH Remote](https://marketplace.visualstudio.com/items?itemName=ms-vscode-remote.remote-ssh) extension. ### Persisting changes The changes you make when Dev Mode is enabled are not persisted to the Space repo automatically. By default, they will be discarded when Dev Mode is disabled or when the Space goes to sleep. If you wish to persist changes made while Dev Mode is enabled, you need to use `git` from inside the Space container (using VS Code or SSH). For example: ```shell # Add changes and commit them git add . git commit -m "Persist changes from Dev Mode" # Push the commit to persist them in the repo git push ``` The modal will display a warning if you have uncommitted or unpushed changes in the Space: ## Enabling Dev Mode You can enable the Dev Mode on your Space from the web interface. You can also create a Space with the dev mode enabled: ## Limitations Dev Mode is currently not available for static Spaces. Docker Spaces also have some additional requirements. ### Docker Spaces Dev Mode is supported for Docker Spaces. However, your Space needs to comply with the following rules for Dev Mode to work properly. 1. The following packages must be installed: - `bash` (required to establish SSH connections) - `curl`, `wget` and `procps` (required by the VS Code server process) - `git` and `git-lfs` to be able to commit and push changes from your Dev Mode environment 2. Your application code must be located in the `/app` folder for the Dev Mode daemon to be able to detect changes. 3. The `/app` folder must be owned by the user with uid `1000` to allow you to make changes to the code. 4. The Dockerfile must contain a `CMD` instruction for startup. Checkout [Docker's documentation](https://docs.docker.com/reference/dockerfile/#cmd) about the `CMD` instruction for more details. Dev Mode works well when the base image is debian-based (eg, ubuntu). More exotic linux distros (eg, alpine) are not tested and Dev Mode is not guaranteed to work on them. ### Example of compatible Dockerfiles This is an example of a Dockerfile compatible with Spaces Dev Mode. It installs the required packages with `apt-get`, along with a couple more for developer convenience (namely: `top`, `vim` and `nano`). It then starts a NodeJS application from `/app`. ```Dockerfile FROM node:19-slim RUN apt-get update && \ apt-get install -y \ bash \ git git-lfs \ wget curl procps \ htop vim nano && \ rm -rf /var/lib/apt/lists/* WORKDIR /app COPY --link ./ /app RUN npm i RUN chown 1000 /app USER 1000 CMD ["node", "index.js"] ``` There are several examples of Dev Mode compatible Docker Spaces in this organization. Feel free to duplicate them in your namespace! Example Python app (FastAPI HTTP server): https://huggingface.co/spaces/dev-mode-explorers/dev-mode-python Example Javascript app (Express.js HTTP server): https://huggingface.co/spaces/dev-mode-explorers/dev-mode-javascript ## Feedback You can share your feedback on Spaces Dev Mode directly on the HF Hub: https://huggingface.co/spaces/dev-mode-explorers/README/discussions ### Model(s) Release Checklist https://huggingface.co/docs/hub/model-release-checklist.md # Model(s) Release Checklist The [Hugging Face Hub](https://huggingface.co/models) is the go-to platform for sharing machine learning models. A well-executed release can boost your model's visibility and impact. This section covers **essential** steps for a concise, informative, and user-friendly model release. ## ⏳ Preparing Your Model for Release ### Upload Model Weights When uploading models to the Hub, follow these best practices: - **Use separate repositories for different model weights**: Create individual repositories for each variant of the same architecture. This lets you group them into a [collection](https://huggingface.co/docs/hub/en/collections), which are easier to navigate than directory listings. It also improves visibility because each model has its own URL (`hf.co/org/model-name`), makes search easier, and provides download counts for each one of your models. A great example is the recent [Qwen3-VL collection](https://huggingface.co/collections/Qwen/qwen3-vl) which features various variants of the VL architecture. - **Prefer [`safetensors`](https://huggingface.co/docs/safetensors/en/index) over `pickle` for weight serialization.**: `safetensors` is safer and faster than Python’s `pickle` or `pth`. If you have a `.bin` pickle file, use the [weight conversion tool](https://huggingface.co/docs/safetensors/en/convert-weights) to convert it. ### Write a Comprehensive Model Card A well-crafted model card (the `README.md` in your repository) is essential for discoverability, reproducibility, and effective sharing. Make sure to cover: 1. **Metadata Configuration**: The [metadata section](https://huggingface.co/docs/hub/model-cards#model-card-metadata) (YAML) at the top of your model card is key for search and categorization. Include: ```yaml --- pipeline_tag: text-generation # Specify the task library_name: transformers # Specify the library language: - en # List languages your model supports license: apache-2.0 # Specify a license datasets: - username/dataset # List datasets used for training base_model: username/base-model # If applicable (your model is a fine-tune, quantized, merged version of another model) tags: # Add extra tags which would make the repo searchable using the tag - tag1 - tag2 --- ``` If you create the `README.md` in the Web UI, you’ll see a form with the most important metadata fields we recommend 🤗. | ![metadata template on the hub ui](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/release-checklist/metadata-template.png) | | :--: | | Metadata Form on the Hub UI | 2. **Detailed Model Description**: Provide a clear explanation of what your model does, its architecture, and its intended use cases. Help users quickly decide if it fits their needs. 3. **Usage Examples**: Provide clear, copy-and-run code snippets for inference, fine-tuning, or other common tasks. Keep edits needed by users to a minimum. *Bonus*: Add a well-structured `notebook.ipynb` in the repo showing inference or fine-tuning, so users can open it in [Google Colab and Kaggle Notebooks](https://huggingface.co/docs/hub/en/notebooks) directly. | ![colab and kaggle button](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/release-checklist/colab-kaggle.png) | | :--: | | Google and Kaggle Usage Buttons | 4. **Technical Specifications**: Include training parameters, hardware needs, and other details that help users run the model effectively. 5. **Performance Metrics**: Share benchmarks and evaluation results. Include quantitative metrics and qualitative examples to show strengths and limitations. 6. **Limitations and Biases**: Document known limitations, biases, and ethical considerations so users can make informed choices. To make the process more seamless, click **Import model card template** to pre-fill the `README.md`s with placeholders. | ![model card template button on the hub ui](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/release-checklist/model-card-template-button.png) | ![model card template on the hub](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/release-checklist/model-card-template.png) | |:--: | :--: | | The button to import the model card template | A section of the imported template | ### Enhance Model Discoverability and Usability To maximize reach and usability: 1. **Library Integration**: Add support for one of the many [libraries integrated with the Hugging Face Hub](https://huggingface.co/docs/hub/models-libraries) (such as `transformers`, `diffusers`, `sentence-transformers`, `timm`). This integration significantly increases your model's accessibility and provides users with code snippets for working with your model. For example, to specify that your model works with the `transformers` library: ```yaml --- library_name: transformers --- ``` | ![code snippet tab](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/release-checklist/code-snippet.png) | | :--: | | Code snippet tab | You can also [register your own model library](https://huggingface.co/docs/hub/en/models-adding-libraries) or add Hub support to your library and codebase, so the users know how to download model weights from the Hub. We wrote an extensive guide on uploading best practices [here](https://huggingface.co/docs/hub/models-uploading). > [!NOTE] > Using a registered library also allows you to track downloads of your model over time. 2. **Correct Metadata**: - **Pipeline Tag:** Choose the correct [pipeline tag](https://huggingface.co/docs/hub/model-cards#specifying-a-task--pipelinetag-) so your model shows up in the right searches and widgets. Examples of common pipeline tags: - `text-generation` - For language models that generate text - `text-to-image` - For text-to-image generation models - `image-text-to-text` - For vision-language models (VLMs) that generate text - `text-to-speech` - For models that generate audio from text - **License:** License information is crucial for users to understand how they can use the model. 3. **Research Papers**: If your model has associated papers, cite them in the model card. They will be [cross-linked automatically](https://huggingface.co/docs/hub/model-cards#linking-a-paper). ```markdown ## References * [Model Paper](https://arxiv.org/abs/xxxx.xxxxx) ``` 4. **Collections**: If you're releasing multiple related models or variants, organize them into a [collection](https://huggingface.co/docs/hub/collections). Collections help users discover related models and understand relationships across versions. 5. **Demos**: Create a [Hugging Face Space](https://huggingface.co/docs/hub/spaces) with an interactive demo. This lets users try your model without writing code. You can also [link the model](https://huggingface.co/docs/hub/spaces-config-reference) from the Space to make it appear on the model page UI. ```markdown ## Demo Try this model directly in your browser: [Space Demo](https://huggingface.co/spaces/username/model-demo) ``` When you create a demo, download the model from its Hub repository (not external sources like Google Drive). This cross-links artifacts and improves visibility 6. **Quantized Versions**: Consider uploading quantized versions (for example, GGUF) on a separate repository to improve accessibility for users with limited compute. Link these versions using the [`base_model` metadata field](https://huggingface.co/docs/hub/model-cards#specifying-a-base-model) on the quantized model cards, and document performance differences. ```yaml --- base_model: username/original-model base_model_relation: quantized --- ``` | ![model tree showcasing relations](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/release-checklist/model-tree.png) | | :--: | | Model tree showing quantized versions | 7. **Linking Datasets on the Model Page**: Link datasets in your metadata so they appear directly on your model page. ```yaml --- datasets: - username/dataset - username/dataset-2 --- ``` 8. **New Model Version**: If your model is an update of an existing one, specify it on the older model's card. This will [display a banner](https://huggingface.co/docs/hub/en/model-cards#specifying-a-new-version) on the older page linking to the update. ```yaml --- new_version: username/updated-model --- ``` 9. **Visual Examples**: For image or video generation models, include examples directly on your model page using the [`` card component](https://huggingface.co/docs/hub/en/model-cards-components#the-gallery-component). ```markdown ![Example 1](./images/example1.png) ![Example 2](./images/example2.png) ``` 10. **Carbon Emissions**: If possible, specify the [carbon emissions](https://huggingface.co/docs/hub/model-cards-co2) from training. ```yaml --- co2_eq_emissions: emissions: 123.45 source: "CodeCarbon" training_type: "pre-training" geographical_location: "US-East" hardware_used: "8xA100 GPUs" --- ``` ### Access Control and Visibility 1. **Visibility Settings**: When ready to share your model, switch it to public in your [model settings](https://huggingface.co/docs/hub/repositories-settings). Before doing so, double-check that all documentation and code examples to ensure they're accurate and complete. 2. **Gated Access**: If your model needs controlled access, use the [gated access feature](https://huggingface.co/docs/hub/models-gated) and clearly state the conditions users must meet. This is important for models with dual-use concerns or commercial restrictions. ## 🏁 After Releasing Your Model A successful model release extends beyond the initial publication. To maintain quality and maximize impact: ### Maintenance and Community Engagement 1. **Verify Functionality**: After release, test all code snippets in a clean environment to confirm they work as expected. This ensures users can run your model without errors or confusion. For example, if your model is a `transformers` compatible LLM: ```python from transformers import pipeline # This should run without errors pipe = pipeline("text-generation", model="your-username/your-model") result = pipe("Your test prompt") print(result) ``` 2. **Share Share Share**: Most users discover models through social media, chat channels (like Slack or Discord), or newsletters. Share your model links in these spaces, and also add them to your website or GitHub repositories. The more visits and likes your model receives, the higher it appears on the [Hugging Face Trending section](https://huggingface.co/models?sort=trending), bringing even more visibility 3. **Community Interaction**: Use the Community tab to answer questions, address feedback, and resolve issues promptly. Clarify confusion, accept helpful suggestions, and close off-topic threads to keep discussions focused. ### Tracking Usage and Impact 1. **Usage Metrics**: [Track downloads](https://huggingface.co/docs/hub/en/models-download-stats) and likes to understand your model’s reach and adoption. You can view total download metrics in your model’s settings. 2. **Review Community Contributions**: Regularly check your model’s repository for contributions from other users. Community pull requests and discussions can provide useful feedback, ideas, and opportunities for collaboration. ## 🏢 Enterprise Features [Hugging Face Team & Enterprise](https://huggingface.co/enterprise) subscription offers additional capabilities for teams and organizations: 1. **Access Control**: Set [resource groups](https://huggingface.co/docs/hub/security-resource-groups) to manage access for specific teams or users. This ensures the right permissions and secure collaboration across your organization. 2. **Storage Region**: Choose the data storage region (US or EU) for your model files to meet regional data regulations and compliance requirements. 3. **Advanced Analytics**: Use [Enterprise Analytics features](https://huggingface.co/docs/hub/enterprise-hub-analytics) to gain deeper insights into model usage patterns, downloads, and adoption trends across your organization. 4. **Extended Storage**: Access additional private storage capacity to host more models and larger artifacts as your model portfolio expands. 5. **Organization Blog Posts**: Enterprise organizations can now [publish blog articles directly on Hugging Face](https://huggingface.co/blog/huggingface/blog-articles-for-orgs). This lets you share model releases, research updates, and announcements with the broader community, all from your organization’s profile. By following these guidelines and examples, you’ll make your model release on Hugging Face clear, useful, and impactful. This helps your work reach more people, strengthens the AI community, and increases your model’s visibility. We can’t wait to see what you share next! 🤗 ### Embed your Space in another website https://huggingface.co/docs/hub/spaces-embed.md # Embed your Space in another website Once your Space is up and running you might wish to embed it in a website or in your blog. Embedding or sharing your Space is a great way to allow your audience to interact with your work and demonstrations without requiring any setup on their side. To embed a Space its visibility needs to be public. ## Direct URL A Space is assigned a unique URL you can use to share your Space or embed it in a website. This URL is of the form: `"https://.hf.space"`. For instance, the Space [NimaBoscarino/hotdog-gradio](https://huggingface.co/spaces/NimaBoscarino/hotdog-gradio) has the corresponding URL of `"https://nimaboscarino-hotdog-gradio.hf.space"`. The subdomain is unique and only changes if you move or rename your Space. Your space is always served from the root of this subdomain. You can find the Space URL along with examples snippets of how to embed it directly from the options menu: ## Embedding with IFrames The default embedding method for a Space is using IFrames. Add in the HTML location where you want to embed your Space the following element: ```html .hf.space" frameborder="0" width="850" height="450" > ``` For instance using the [NimaBoscarino/hotdog-gradio](https://huggingface.co/spaces/NimaBoscarino/hotdog-gradio) Space: ## Embedding with WebComponents If the Space you wish to embed is Gradio-based, you can use Web Components to embed your Space. WebComponents are faster than IFrames and automatically adjust to your web page so that you do not need to configure `width` or `height` for your element. First, you need to import the Gradio JS library that corresponds to the Gradio version in the Space by adding the following script to your HTML. Then, add a `gradio-app` element where you want to embed your Space. ```html .hf.space"> ``` Check out the [Gradio documentation](https://www.gradio.app/guides/sharing-your-app#embedding-hosted-spaces) for more details. ### Analytics https://huggingface.co/docs/hub/enterprise-hub-analytics.md # Analytics > [!WARNING] > This feature is part of the Team & Enterprise plans. ## Publisher Analytics Dashboard Track all your repository activity with a detailed downloads overview that shows total downloads for all the Models and Datasets published by your organization. Toggle between "All Time" and "Last Month" views to gain insights across your repositories over different periods. ### Per-repo breakdown Explore the metrics of individual repositories with the per-repository drill-down table. Utilize the built-in search feature to quickly locate specific repositories. Each row also features a time-series graph that illustrates the trend of downloads over time. ## Export Publisher Analytics as CSV Download a comprehensive CSV file containing analytics for all your repositories, including model and dataset download activity. ### Response Structure The CSV file is made of daily download records for each of your models and datasets. ```csv repoType,repoName,total,timestamp,downloads model,huggingface/CodeBERTa-small-v1,4362460,2021-01-22T00:00:00.000Z,4 model,huggingface/CodeBERTa-small-v1,4362460,2021-01-23T00:00:00.000Z,7 model,huggingface/CodeBERTa-small-v1,4362460,2021-01-24T00:00:00.000Z,2 dataset,huggingface/documentation-images,2167284,2021-11-27T00:00:00.000Z,3 dataset,huggingface/documentation-images,2167284,2021-11-28T00:00:00.000Z,18 dataset,huggingface/documentation-images,2167284,2021-11-29T00:00:00.000Z,7 ``` ### Repository Object Structure Each record in the CSV contains: - `repoType`: The type of repository (e.g., "model", "dataset") - `repoName`: Full repository name including organization (e.g., "huggingface/documentation-images") - `total`: Cumulative number of downloads for this repository - `timestamp`: ISO 8601 formatted date (UTC) - `downloads`: Number of downloads for that day Records are ordered chronologically and provide a daily granular view of download activity for each repository. ### ZenML on Spaces https://huggingface.co/docs/hub/spaces-sdks-docker-zenml.md # ZenML on Spaces [ZenML](https://github.com/zenml-io/zenml) is an extensible, open-source MLOps framework for creating portable, production-ready MLOps pipelines. It's built for Data Scientists, ML Engineers, and MLOps Developers to collaborate as they develop to production. ZenML offers a simple and flexible syntax, is cloud- and tool-agnostic, and has interfaces/abstractions catered toward ML workflows. With ZenML you'll have all your favorite tools in one place, so you can tailor a workflow that caters to your specific needs. The ZenML Huggingface Space allows you to get up and running with a deployed version of ZenML with just a few clicks. Within a few minutes, you'll have this default ZenML dashboard deployed and ready for you to connect to from your local machine. In the sections that follow, you'll learn to deploy your own instance of ZenML and use it to view and manage your machine learning pipelines right from the Hub. ZenML on Huggingface Spaces is a **self-contained application completely hosted on the Hub using Docker**. The diagram below illustrates the complete process. ![ZenML on HuggingFace Spaces -- default deployment](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/zenml/hf_spaces_chart.png) Visit [the ZenML documentation](https://docs.zenml.io/) to learn more about its features and how to get started with running your machine learning pipelines through your Huggingface Spaces deployment. You can check out [some small sample examples](https://github.com/zenml-io/zenml/tree/main/examples) of ZenML pipelines to get started or take your pick of some more complex production-grade projects at [the ZenML Projects repository](https://github.com/zenml-io/zenml-projects). ZenML integrates with many of your favorite tools out of the box, [including Huggingface](https://zenml.io/integrations/huggingface) of course! If there's something else you want to use, we're built to be extensible and you can easily make it work with whatever your custom tool or workflow is. ## ⚡️ Deploy ZenML on Spaces You can deploy ZenML on Spaces with just a few clicks: To set up your ZenML app, you need to specify three main components: the Owner (either your personal account or an organization), a Space name, and the Visibility (a bit lower down the page). Note that the space visibility needs to be set to 'Public' if you wish to connect to the ZenML server from your local machine. ![Choose the ZenML Docker template](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/zenml/choose_space.png) You have the option here to select a higher tier machine to use for your server. The advantage of selecting a paid CPU instance is that it is not subject to auto-shutdown policies and thus will stay up as long as you leave it up. In order to make use of a persistent CPU, you'll likely want to create and set up a MySQL database to connect to (see below). To personalize your Space's appearance, such as the title, emojis, and colors, navigate to "Files and Versions" and modify the metadata in your README.md file. Full information on Spaces configuration parameters can be found on the HuggingFace [documentation reference guide](https://huggingface.co/docs/hub/spaces-config-reference). After creating your Space, you'll notice a 'Building' status along with logs displayed on the screen. When this switches to 'Running', your Space is ready for use. If the ZenML login UI isn't visible, try refreshing the page. In the upper-right hand corner of your space you'll see a button with three dots which, when you click on it, will offer you a menu option to "Embed this Space". (See [the HuggingFace documentation](https://huggingface.co/docs/hub/spaces-embed) for more details on this feature.) Copy the "Direct URL" shown in the box that you can now see on the screen. This should look something like this: `https://-.hf.space`. Open that URL and use our default login to access the dashboard (username: 'default', password: (leave it empty)). ## Connecting to your ZenML Server from your Local Machine Once you have your ZenML server up and running, you can connect to it from your local machine. To do this, you'll need to get your Space's 'Direct URL' (see above). > [!WARNING] > Your Space's URL will only be available and usable for connecting from your > local machine if the visibility of the space is set to 'Public'. You can use the 'Direct URL' to connect to your ZenML server from your local machine with the following CLI command (after installing ZenML, and using your custom URL instead of the placeholder): ```shell zenml connect --url '' --username='default' --password='' ``` You can also use the Direct URL in your browser to use the ZenML dashboard as a fullscreen application (i.e. without the HuggingFace Spaces wrapper around it). > [!WARNING] > The ZenML dashboard will currently not work when viewed from within the Huggingface > webpage (i.e. wrapped in the main `https://huggingface.co/...` website). This is on > account of a limitation in how cookies are handled between ZenML and Huggingface. > You **must** view the dashboard from the 'Direct URL' (see above). ## Extra Configuration Options By default the ZenML application will be configured to use a SQLite non-persistent database. If you want to use a persistent database, you can configure this by amending the `Dockerfile` in your Space's root directory. For full details on the various parameters you can change, see [our reference documentation](https://docs.zenml.io/getting-started/deploying-zenml/docker#zenml-server-configuration-options) on configuring ZenML when deployed with Docker. > [!TIP] > If you are using the space just for testing and experimentation, you don't need > to make any changes to the configuration. Everything will work out of the box. You can also use an external secrets backend together with your HuggingFace Spaces as described in [our documentation](https://docs.zenml.io/getting-started/deploying-zenml/docker#zenml-server-configuration-options). You should be sure to use HuggingFace's inbuilt 'Repository secrets' functionality to configure any secrets you need to use in your`Dockerfile` configuration. [See the documentation](https://huggingface.co/docs/hub/spaces-sdks-docker#secret-management) for more details how to set this up. > [!WARNING] > If you wish to use a cloud secrets backend together with ZenML for secrets > management, **you must take the following minimal security precautions** on your ZenML Server on the > Dashboard: > > - change your password on the `default` account that you get when you start. You > can do this from the Dashboard or via the CLI. > - create a new user account with a password and assign it the `admin` role. This > can also be done from the Dashboard (by 'inviting' a new user) or via the CLI. > - reconnect to the server using the new user account and password as described > above, and use this new user account as your working account. > > This is because the default user created by the > HuggingFace Spaces deployment process has no password assigned to it and as the > Space is publicly accessible (since the Space is public) *potentially anyone > could access your secrets without this extra step*. To change your password > navigate to the Settings page by clicking the button in the upper right hand > corner of the Dashboard and then click 'Update Password'. ## Upgrading your ZenML Server on HF Spaces The default space will use the latest version of ZenML automatically. If you want to update your version, you can simply select the 'Factory reboot' option within the 'Settings' tab of the space. Note that this will wipe any data contained within the space and so if you are not using a MySQL persistent database (as described above) you will lose any data contained within your ZenML deployment on the space. You can also configure the space to use an earlier version by updating the `Dockerfile`'s `FROM` import statement at the very top. ## Next Steps As a next step, check out our [Starter Guide to MLOps with ZenML](https://docs.zenml.io/starter-guide/pipelines) which is a series of short practical pages on how to get going quickly. Alternatively, check out [our `quickstart` example](https://github.com/zenml-io/zenml/tree/main/examples/quickstart) which is a full end-to-end example of many of the features of ZenML. ## 🤗 Feedback and support If you are having trouble with your ZenML server on HuggingFace Spaces, you can view the logs by clicking on the "Open Logs" button at the top of the space. This will give you more context of what's happening with your server. If you have suggestions or need specific support for anything else which isn't working, please [join the ZenML Slack community](https://zenml.io/slack-invite/) and we'll be happy to help you out! ### Datasets https://huggingface.co/docs/hub/datasets.md # Datasets The Hugging Face Hub is home to a growing collection of datasets that span a variety of domains and tasks. These docs will guide you through interacting with the datasets on the Hub, uploading new datasets, exploring the datasets contents, and using datasets in your projects. This documentation focuses on the datasets functionality in the Hugging Face Hub and how to use the datasets with supported libraries. For detailed information about the 🤗 Datasets python package, visit the [🤗 Datasets documentation](/docs/datasets/index). ## Contents - [Datasets Overview](./datasets-overview) - [Dataset Cards](./datasets-cards) - [Gated Datasets](./datasets-gated) - [Uploading Datasets](./datasets-adding) - [Downloading Datasets](./datasets-downloading) - [Libraries](./datasets-libraries) - [Dataset Viewer](./datasets-viewer) - [Data files Configuration](./datasets-data-files-configuration) ### Team & Enterprise plans https://huggingface.co/docs/hub/enterprise-hub.md # Team & Enterprise plans > [!TIP] > Subscribe to a Team or Enterprise plan to get access to advanced features for your organization. Team & Enterprise organization plans add advanced capabilities to organizations, enabling safe, compliant and managed collaboration for companies and teams on Hugging Face. In this section we will document the following Enterprise Hub features: - [Single Sign-On (SSO)](./enterprise-sso) - [Advanced Single Sign-On (SSO)](./enterprise-hub-advanced-sso) - [User Provisioning (SCIM)](./enterprise-hub-scim) - [Audit Logs](./audit-logs) - [Storage Regions](./storage-regions) - [Data Studio for Private datasets](./enterprise-hub-datasets) - [Resource Groups](./security-resource-groups) - [Advanced Compute Options](./advanced-compute-options) - [Advanced Security](./enterprise-hub-advanced-security) - [Tokens Management](./enterprise-hub-tokens-management) - [Publisher Analytics](./enterprise-hub-analytics) - [Gating Group Collections](./enterprise-hub-gating-group-collections) - [Network Security](./enterprise-hub-network-security) - [Higher Rate limits](./rate-limits) Finally, Team & Enterprise plans include vastly more [included public storage](./storage-limits), as well as 1TB of [private storage](./storage-limits) per seat in the subscription, i.e. if your organization has 40 members, then you have 40TB included storage for your private models and datasets. ### Advanced Topics https://huggingface.co/docs/hub/models-advanced.md # Advanced Topics ## Contents - [Integrate your library with the Hub](./models-adding-libraries) - [Adding new tasks to the Hub](./models-tasks) - [GGUF format](./gguf) - [DDUF format](./dduf) ### Organizations https://huggingface.co/docs/hub/organizations.md # Organizations The Hugging Face Hub offers **Organizations**, which can be used to group accounts and manage datasets, models, and Spaces. The Hub also allows admins to set user roles to [**control access to repositories**](./organizations-security) and manage their organization's [payment method and billing info](https://huggingface.co/pricing). If an organization needs to track user access to a dataset or a model due to licensing or privacy issues, an organization can enable [user access requests](./datasets-gated). Note: Use the context switcher in your org settings to quickly switch between your account and your orgs. ## Contents - [Managing Organizations](./organizations-managing) - [Organization Cards](./organizations-cards) - [Access Control in Organizations](./organizations-security) ## Next: Power up your organization - [Team & Enterprise Plans](./enterprise-hub) ### Widgets https://huggingface.co/docs/hub/models-widgets.md # Widgets ## What's a widget? Many model repos have a widget that allows anyone to run inferences directly in the browser. These widgets are powered by [Inference Providers](https://huggingface.co/docs/inference-providers), which provide developers streamlined, unified access to hundreds of machine learning models, backed by our serverless inference partners. Here are some examples of current popular models: - [DeepSeek V3](https://huggingface.co/deepseek-ai/DeepSeek-V3-0324) - State-of-the-art open-weights conversational model - [Flux Kontext](https://huggingface.co/black-forest-labs/FLUX.1-Kontext-dev) - Open-weights transformer model for image editing - [Falconsai's NSFW Detection](https://huggingface.co/Falconsai/nsfw_image_detection) - Image content moderation - [ResembleAI's Chatterbox](https://huggingface.co/ResembleAI/chatterbox) - Production-grade open source text-to-speech model. You can explore more models and their widgets on the [models page](https://huggingface.co/models?inference_provider=all&sort=trending) or try them interactively in the [Inference Playground](https://huggingface.co/playground). ## Enabling a widget Widgets are displayed when the model is hosted by at least one Inference Provider, ensuring optimal performance and reliability for the model's inference. Providers autonomously chose and control what models they deploy. The type of widget displayed (text-generation, text to image, etc) is inferred from the model's `pipeline_tag`, a special tag that the Hub tries to compute automatically for all models. The only exception is for the `conversational` widget which is shown on models with a `pipeline_tag` of either `text-generation` or `image-text-to-text`, as long as they’re also tagged as `conversational`. We choose to expose **only one** widget per model for simplicity. For some libraries, such as `transformers`, the model type can be inferred automatically based from configuration files (`config.json`). The architecture can determine the type: for example, `AutoModelForTokenClassification` corresponds to `token-classification`. If you're interested in this, you can see pseudo-code in [this gist](https://gist.github.com/julien-c/857ba86a6c6a895ecd90e7f7cab48046). For most other use cases, we use the model tags to determine the model task type. For example, if there is `tag: text-classification` in the [model card metadata](./model-cards), the inferred `pipeline_tag` will be `text-classification`. **You can always manually override your pipeline type with `pipeline_tag: xxx` in your [model card metadata](./model-cards#model-card-metadata).** (You can also use the metadata GUI editor to do this). ### How can I control my model's widget example input? You can specify the widget input in the model card metadata section: ```yaml widget: - text: "This new restaurant has amazing food and great service!" example_title: "Positive Review" - text: "I'm really disappointed with this product. Poor quality and overpriced." example_title: "Negative Review" - text: "The weather is nice today." example_title: "Neutral Statement" ``` You can provide more than one example input. In the examples dropdown menu of the widget, they will appear as `Example 1`, `Example 2`, etc. Optionally, you can supply `example_title` as well. ```yaml widget: - text: "Is this review positive or negative? Review: Best cast iron skillet you will ever buy." example_title: "Sentiment analysis" - text: "Barack Obama nominated Hilary Clinton as his secretary of state on Monday. He chose her because she had ..." example_title: "Coreference resolution" - text: "On a shelf, there are five books: a gray book, a red book, a purple book, a blue book, and a black book ..." example_title: "Logic puzzles" - text: "The two men running to become New York City's next mayor will face off in their first debate Wednesday night ..." example_title: "Reading comprehension" ``` Moreover, you can specify non-text example inputs in the model card metadata. Refer [here](./models-widgets-examples) for a complete list of sample input formats for all widget types. For vision & audio widget types, provide example inputs with `src` rather than `text`. For example, allow users to choose from two sample audio files for automatic speech recognition tasks by: ```yaml widget: - src: https://example.org/somewhere/speech_samples/sample1.flac example_title: Speech sample 1 - src: https://example.org/somewhere/speech_samples/sample2.flac example_title: Speech sample 2 ``` Note that you can also include example files in your model repository and use them as: ```yaml widget: - src: https://huggingface.co/username/model_repo/resolve/main/sample1.flac example_title: Custom Speech Sample 1 ``` But even more convenient, if the file lives in the corresponding model repo, you can just use the filename or file path inside the repo: ```yaml widget: - src: sample1.flac example_title: Custom Speech Sample 1 ``` or if it was nested inside the repo: ```yaml widget: - src: nested/directory/sample1.flac ``` We provide example inputs for some languages and most widget types in [default-widget-inputs.ts file](https://github.com/huggingface/huggingface.js/blob/main/packages/tasks/src/default-widget-inputs.ts). If some examples are missing, we welcome PRs from the community to add them! ## Example outputs As an extension to example inputs, for each widget example, you can also optionally describe the corresponding model output, directly in the `output` property. This is useful when the model is not yet supported by Inference Providers, so that the model page can still showcase how the model works and what results it gives. For instance, for an [automatic-speech-recognition](./models-widgets-examples#automatic-speech-recognition) model: ```yaml widget: - src: sample1.flac output: text: "Hello my name is Julien" ``` The `output` property should be a YAML dictionary that represents the output format from Inference Providers. For a model that outputs text, see the example above. For a model that outputs labels (like a [text-classification](./models-widgets-examples#text-classification) model for instance), output should look like this: ```yaml widget: - text: "I liked this movie" output: - label: POSITIVE score: 0.8 - label: NEGATIVE score: 0.2 ``` Finally, for a model that outputs an image, audio, or any other kind of asset, the output should include a `url` property linking to either a file name or path inside the repo or a remote URL. For example, for a text-to-image model: ```yaml widget: - text: "picture of a futuristic tiger, artstation" output: url: images/tiger.jpg ``` We can also surface the example outputs in the Hugging Face UI, for instance, for a text-to-image model to display a gallery of cool image generations. ## Widget Availability and Provider Support Not all models have widgets available. Widget availability depends on: 1. **Task Support**: The model's task must be supported by at least one provider in the Inference Providers network 2. **Provider Availability**: At least one provider must be serving the specific model 3. **Model Configuration**: The model must have proper metadata and configuration files To view the full list of supported tasks, check out [our dedicated documentation page](https://huggingface.co/docs/inference-providers/tasks/index). The list of all providers and the tasks they support is available in [this documentation page](https://huggingface.co/docs/inference-providers/index#partners). For models without provider support, you can still showcase functionality using [example outputs](#example-outputs) in your model card. You can also click _Ask for provider support_ directly on the model page to encourage providers to serve the model, given there is enough community interest. ## Exploring Models with the Inference Playground Before integrating models into your applications, you can test them interactively with the [Inference Playground](https://huggingface.co/playground). The playground allows you to: - Test different [chat completion models](https://huggingface.co/models?inference_provider=all&sort=trending&other=conversational) with custom prompts - Compare responses across different models - Experiment with inference parameters like temperature, max tokens, and more - Find the perfect model for your specific use case The playground uses the same Inference Providers infrastructure that powers the widgets, so you can expect similar performance and capabilities when you integrate the models into your own applications. ### Using 🤗 `transformers` at Hugging Face https://huggingface.co/docs/hub/transformers.md # Using 🤗 `transformers` at Hugging Face 🤗 `transformers` is a library maintained by Hugging Face and the community, for state-of-the-art Machine Learning for Pytorch, TensorFlow and JAX. It provides thousands of pretrained models to perform tasks on different modalities such as text, vision, and audio. We are a bit biased, but we really like 🤗 `transformers`! ## Exploring 🤗 transformers in the Hub There are over 630,000 `transformers` models in the Hub which you can find by filtering at the left of [the models page](https://huggingface.co/models?library=transformers&sort=downloads). You can find models for many different tasks: * Extracting the answer from a context ([question-answering](https://huggingface.co/models?library=transformers&pipeline_tag=question-answering&sort=downloads)). * Creating summaries from a large text ([summarization](https://huggingface.co/models?library=transformers&pipeline_tag=summarization&sort=downloads)). * Classify text (e.g. as spam or not spam, [text-classification](https://huggingface.co/models?library=transformers&pipeline_tag=text-classification&sort=downloads)). * Generate a new text with models such as GPT ([text-generation](https://huggingface.co/models?library=transformers&pipeline_tag=text-generation&sort=downloads)). * Identify parts of speech (verb, subject, etc.) or entities (country, organization, etc.) in a sentence ([token-classification](https://huggingface.co/models?library=transformers&pipeline_tag=token-classification&sort=downloads)). * Transcribe audio files to text ([automatic-speech-recognition](https://huggingface.co/models?library=transformers&pipeline_tag=automatic-speech-recognition&sort=downloads)). * Classify the speaker or language in an audio file ([audio-classification](https://huggingface.co/models?library=transformers&pipeline_tag=audio-classification&sort=downloads)). * Detect objects in an image ([object-detection](https://huggingface.co/models?library=transformers&pipeline_tag=object-detection&sort=downloads)). * Segment an image ([image-segmentation](https://huggingface.co/models?library=transformers&pipeline_tag=image-segmentation&sort=downloads)). * Do Reinforcement Learning ([reinforcement-learning](https://huggingface.co/models?library=transformers&pipeline_tag=reinforcement-learning&sort=downloads))! You can try out the models directly in the browser if you want to test them out without downloading them thanks to the in-browser widgets! ## Transformers repository files A [Transformers](https://hf.co/docs/transformers/index) model repository generally contains model files and preprocessor files. ### Model - The **`config.json`** file stores details about the model architecture such as the number of hidden layers, vocabulary size, number of attention heads, the dimensions of each head, and more. This metadata is the model blueprint. - The **`model.safetensors`** file stores the models pretrained layers and weights. For large models, the safetensors file is sharded to limit the amount of memory required to load it. Browse the **`model.safetensors.index.json`** file to see which safetensors file the model weights are being loaded from. ```json { "metadata": { "total_size": 16060522496 }, "weight_map": { "lm_head.weight": "model-00004-of-00004.safetensors", "model.embed_tokens.weight": "model-00001-of-00004.safetensors", ... } } ``` You can also visualize this mapping by clicking on the ↗ button on the model card. [Safetensors](https://hf.co/docs/safetensors/index) is a safer and faster serialization format - compared to [pickle](./security-pickle#use-your-own-serialization-format) - for storing model weights. You may encounter weights pickled in formats such as **`bin`**, **`pth`**, or **`ckpt`**, but **`safetensors`** is increasingly adopted in the model ecosystem as a better alternative. - A model may also have a **`generation_config.json`** file which stores details about how to generate text, such as whether to sample, the top tokens to sample from, the temperature, and the special tokens for starting and stopping generation. ### Preprocessor - The **`tokenizer_config.json`** file stores the special tokens added by a model. These special tokens signal many things to a model such as the beginning of a sentence, specific formatting for chat templates, or indicating an image. This file also shows the maximum input sequence length the model can accept, the preprocessor class, and the outputs it returns. - The **`tokenizer.json`** file stores the model's learned vocabulary. - The **`special_tokens_map.json`** is a mapping of the special tokens. For example, in [Llama 3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct/blob/main/special_tokens_map.json), the beginning of string token is `""`. > [!TIP] > For other modalities, the `tokenizer_config.json` file is replaced by `preprocessor_config.json`. ## Using existing models All `transformer` models are a line away from being used! Depending on how you want to use them, you can use the high-level API using the `pipeline` function or you can use `AutoModel` for more control. ```py # With pipeline, just specify the task and the model id from the Hub. from transformers import pipeline pipe = pipeline("text-generation", model="distilbert/distilgpt2") # If you want more control, you will need to define the tokenizer and model. from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("distilbert/distilgpt2") model = AutoModelForCausalLM.from_pretrained("distilbert/distilgpt2") ``` You can also load a model from a specific version (based on commit hash, tag name, or branch) as follows: ```py model = AutoModel.from_pretrained( "julien-c/EsperBERTo-small", revision="v2.0.1" # tag name, or branch name, or commit hash ) ``` If you want to see how to load a specific model, you can click `Use in Transformers` and you will be given a working snippet that you can load it! If you need further information about the model architecture, you can also click the "Read model documentation" at the bottom of the snippet. ## Sharing your models To read all about sharing models with `transformers`, please head out to the [Share a model](https://huggingface.co/docs/transformers/model_sharing) guide in the official documentation. Many classes in `transformers`, such as the models and tokenizers, have a `push_to_hub` method that allows to easily upload the files to a repository. ```py # Pushing model to your own account model.push_to_hub("my-awesome-model") # Pushing your tokenizer tokenizer.push_to_hub("my-awesome-model") # Pushing all things after training trainer.push_to_hub() ``` There is much more you can do, so we suggest to review the [Share a model](https://huggingface.co/docs/transformers/model_sharing) guide. ## Additional resources * Transformers [library](https://github.com/huggingface/transformers). * Transformers [docs](https://huggingface.co/docs/transformers/index). * Share a model [guide](https://huggingface.co/docs/transformers/model_sharing). ### How to configure OIDC SSO with Okta https://huggingface.co/docs/hub/security-sso-okta-oidc.md # How to configure OIDC SSO with Okta In this guide, we will use Okta as the SSO provider and with the Open ID Connect (OIDC) protocol as our preferred identity protocol. > [!WARNING] > This feature is part of the Team & Enterprise plans. ### Step 1: Create a new application in your Identity Provider Open a new tab/window in your browser and sign in to your Okta account. Navigate to "Admin/Applications" and click the "Create App Integration" button. Then choose an “OIDC - OpenID Connect” application, select the application type "Web Application" and click "Create". ### Step 2: Configure your application in Okta Open a new tab/window in your browser and navigate to the SSO section of your organization's settings. Select the OIDC protocol. Copy the "Redirection URI" from the organization's settings on Hugging Face, and paste it in the "Sign-in redirect URI" field on Okta. The URL looks like this: `https://huggingface.co/organizations/[organizationIdentifier]/oidc/consume`. You can leave the optional Sign-out redirect URIs blank. Save your new application. ### Step 3: Finalize configuration on Hugging Face In your Okta application, under "General", find the following fields: - Client ID - Client secret - Issuer URL You will need these to finalize the SSO setup on Hugging Face. The Okta Issuer URL is generally a URL like `https://tenantId.okta.com`; you can refer to their [guide](https://support.okta.com/help/s/article/What-is-theIssuerlocated-under-the-OpenID-Connect-ID-Token-app-settings-used-for?language=en_US) for more details. In the SSO section of your organization's settings on Hugging Face, copy-paste these values from Okta: - Client ID - Client Secret You can now click on "Update and Test OIDC configuration" to save the settings. You should be redirected to your SSO provider (IdP) login prompt. Once logged in, you'll be redirected to your organization's settings page. A green check mark near the OIDC selector will attest that the test was successful. ### Step 4: Enable SSO in your organization Now that Single Sign-On is configured and tested, you can enable it for members of your organization by clicking on the "Enable" button. Once enabled, members of your organization must complete the SSO authentication flow described in the [How does it work?](./security-sso#how-does-it-work) section. ### Spaces Settings https://huggingface.co/docs/hub/spaces-settings.md # Spaces Settings You can configure your Space's appearance and other settings inside the `YAML` block at the top of the **README.md** file at the root of the repository. For example, if you want to create a Space with Gradio named `Demo Space` with a yellow to orange gradient thumbnail: ```yaml --- title: Demo Space emoji: 🤗 colorFrom: yellow colorTo: orange sdk: gradio app_file: app.py pinned: false --- ``` For additional settings, refer to the [Reference](./spaces-config-reference) section. ### Using Stanza at Hugging Face https://huggingface.co/docs/hub/stanza.md # Using Stanza at Hugging Face `stanza` is a collection of accurate and efficient tools for the linguistic analysis of many human languages. Starting from raw text to syntactic analysis and entity recognition, Stanza brings state-of-the-art NLP models to languages of your choosing. ## Exploring Stanza in the Hub You can find `stanza` models by filtering at the left of the [models page](https://huggingface.co/models?library=stanza&sort=downloads). You can find over 70 models for different languages! All models on the Hub come up with the following features: 1. An automatically generated model card with a brief description and metadata tags that help for discoverability. 2. An interactive widget you can use to play out with the model directly in the browser (for named entity recognition and part of speech). 3. An Inference API that allows to make inference requests (for named entity recognition and part of speech). ## Using existing models The `stanza` library automatically downloads models from the Hub. You can use `stanza.Pipeline` to download the model from the Hub and do inference. ```python import stanza nlp = stanza.Pipeline('en') # download th English model and initialize an English neural pipeline doc = nlp("Barack Obama was born in Hawaii.") # run annotation over a sentence ``` ## Sharing your models To add new official Stanza models, you can follow the process to [add a new language](https://stanfordnlp.github.io/stanza/new_language.html) and then [share your models with the Stanza team](https://stanfordnlp.github.io/stanza/new_language.html#contributing-back-to-stanza). You can also find the official script to upload models to the Hub [here](https://github.com/stanfordnlp/huggingface-models/blob/main/hugging_stanza.py). ## Additional resources * `stanza` [docs](https://stanfordnlp.github.io/stanza/). ### Dask https://huggingface.co/docs/hub/datasets-dask.md # Dask [Dask](https://www.dask.org/?utm_source=hf-docs) is a parallel and distributed computing library that scales the existing Python and PyData ecosystem. In particular, we can use [Dask DataFrame](https://docs.dask.org/en/stable/dataframe.html?utm_source=hf-docs) to scale up pandas workflows. Dask DataFrame parallelizes pandas to handle large tabular data. It closely mirrors the pandas API, making it simple to transition from testing on a single dataset to processing the full dataset. Dask is particularly effective with Parquet, the default format on Hugging Face Datasets, as it supports rich data types, efficient columnar filtering, and compression. A good practical use case for Dask is running data processing or model inference on a dataset in a distributed manner. See, for example, [Coiled's](https://www.coiled.io/?utm_source=hf-docs) excellent blog post on [Scaling AI-Based Data Processing with Hugging Face + Dask](https://huggingface.co/blog/dask-scaling). ## Read and Write Since Dask uses [fsspec](https://filesystem-spec.readthedocs.io) to read and write remote data, you can use the Hugging Face paths ([`hf://`](/docs/huggingface_hub/guides/hf_file_system#integrations)) to read and write data on the Hub. First you need to [Login with your Hugging Face account](/docs/huggingface_hub/quick-start#login), for example using: ``` hf auth login ``` Then you can [Create a dataset repository](/docs/huggingface_hub/quick-start#create-a-repository), for example using: ```python from huggingface_hub import HfApi HfApi().create_repo(repo_id="username/my_dataset", repo_type="dataset") ``` Finally, you can use [Hugging Face paths](/docs/huggingface_hub/guides/hf_file_system#integrations) in Dask. Dask DataFrame supports distributed writing to Parquet on Hugging Face, which uses commits to track dataset changes: ```python import dask.dataframe as dd df.to_parquet("hf://datasets/username/my_dataset") # or write in separate directories if the dataset has train/validation/test splits df_train.to_parquet("hf://datasets/username/my_dataset/train") df_valid.to_parquet("hf://datasets/username/my_dataset/validation") df_test .to_parquet("hf://datasets/username/my_dataset/test") ``` Since this creates one commit per file, it is recommended to squash the history after the upload: ```python from huggingface_hub import HfApi HfApi().super_squash_history(repo_id=repo_id, repo_type="dataset") ``` This creates a dataset repository `username/my_dataset` containing your Dask dataset in Parquet format. You can reload it later: ```python import dask.dataframe as dd df = dd.read_parquet("hf://datasets/username/my_dataset") # or read from separate directories if the dataset has train/validation/test splits df_train = dd.read_parquet("hf://datasets/username/my_dataset/train") df_valid = dd.read_parquet("hf://datasets/username/my_dataset/validation") df_test = dd.read_parquet("hf://datasets/username/my_dataset/test") ``` For more information on the Hugging Face paths and how they are implemented, please refer to the [the client library's documentation on the HfFileSystem](/docs/huggingface_hub/guides/hf_file_system). ## Process data To process a dataset in parallel using Dask, you can first define your data processing function for a pandas DataFrame or Series, and then use the Dask `map_partitions` function to apply this function to all the partitions of a dataset in parallel: ```python def dummy_count_words(texts): return pd.Series([len(text.split(" ")) for text in texts]) ``` or a similar function using pandas string methods (faster): ```python def dummy_count_words(texts): return texts.str.count(" ") ``` In pandas you can use this function on a text column: ```python # pandas API df["num_words"] = dummy_count_words(df.text) ``` And in Dask you can run this function on every partition: ```python # Dask API: run the function on every partition df["num_words"] = df.text.map_partitions(dummy_count_words, meta=int) ``` Note that you also need to provide `meta` which is the type of the pandas Series or DataFrame in the output of your function. This is needed because Dask DataFrame uses a lazy API. Since Dask will only run the data processing once `.compute()` is called, it needs the `meta` argument to know the type of the new column in the meantime. ## Predicate and Projection Pushdown When reading Parquet data from Hugging Face, Dask automatically leverages the metadata in Parquet files to skip entire files or row groups if they are not needed. For example if you apply a filter (predicate) on a Hugging Face Dataset in Parquet format or if you select a subset of the columns (projection), Dask will read the metadata of the Parquet files to discard the parts that are not needed without downloading them. This is possible thanks to a [reimplementation of the Dask DataFrame API](https://docs.coiled.io/blog/dask-dataframe-is-fast.html?utm_source=hf-docs) to support query optimization, which makes Dask faster and more robust. For example this subset of FineWeb-Edu contains many Parquet files. If you can filter the dataset to keep the text from recent CC dumps, Dask will skip most of the files and only download the data that match the filter: ```python import dask.dataframe as dd df = dd.read_parquet("hf://datasets/HuggingFaceFW/fineweb-edu/sample/10BT/*.parquet") # Dask will skip the files or row groups that don't # match the query without downloading them. df = df[df.dump >= "CC-MAIN-2023"] ``` Dask will also read only the required columns for your computation and skip the rest. For example if you drop a column late in your code, it will not bother to load it early on in the pipeline if it's not needed. This is useful when you want to manipulate a subset of the columns or for analytics: ```python # Dask will download the 'dump' and 'token_count' needed # for the filtering and computation and skip the other columns. df.token_count.mean().compute() ``` ## Client Most features in `dask` are optimized for a cluster or a local `Client` to launch the parallel computations: ```python import dask.dataframe as dd from distributed import Client if __name__ == "__main__": # needed for creating new processes client = Client() df = dd.read_parquet(...) ... ``` For local usage, the `Client` uses a Dask `LocalCluster` with multiprocessing by default. You can manually configure the multiprocessing of `LocalCluster` with ```python from dask.distributed import Client, LocalCluster cluster = LocalCluster(n_workers=8, threads_per_worker=8) client = Client(cluster) ``` Note that if you use the default threaded scheduler locally without `Client`, a DataFrame can become slower after certain operations (more details [here](https://github.com/dask/dask-expr/issues/1181)). Find more information on setting up a local or cloud cluster in the [Deploying Dask documentation](https://docs.dask.org/en/latest/deploying.html). ### Using BERTopic at Hugging Face https://huggingface.co/docs/hub/bertopic.md # Using BERTopic at Hugging Face [BERTopic](https://github.com/MaartenGr/BERTopic) is a topic modeling framework that leverages 🤗 transformers and c-TF-IDF to create dense clusters allowing for easily interpretable topics whilst keeping important words in the topic descriptions. BERTopic supports all kinds of topic modeling techniques: Guided Supervised Semi-supervised Manual Multi-topic distributions Hierarchical Class-based Dynamic Online/Incremental Multimodal Multi-aspect Text Generation/LLM Zero-shot (new!) Merge Models (new!) Seed Words (new!) ## Exploring BERTopic on the Hub You can find BERTopic models by filtering at the left of the [models page](https://huggingface.co/models?library=bertopic&sort=trending). BERTopic models hosted on the Hub have a model card with useful information about the models. Thanks to BERTopic Hugging Face Hub integration, you can load BERTopic models with a few lines of code. You can also deploy these models using [Inference Endpoints](https://huggingface.co/inference-endpoints). ## Installation To get started, you can follow the [BERTopic installation guide](https://github.com/MaartenGr/BERTopic#installation). You can also use the following one-line install through pip: ```bash pip install bertopic ``` ## Using Existing Models All BERTopic models can easily be loaded from the Hub: ```py from bertopic import BERTopic topic_model = BERTopic.load("MaartenGr/BERTopic_Wikipedia") ``` Once loaded, you can use BERTopic's features to predict the topics for new instances: ```py topic, prob = topic_model.transform("This is an incredible movie!") topic_model.topic_labels_[topic] ``` Which gives us the following topic: ```text 64_rating_rated_cinematography_film ``` ## Sharing Models When you have created a BERTopic model, you can easily share it with others through the Hugging Face Hub. To do so, we can make use of the `push_to_hf_hub` function that allows us to directly push the model to the Hugging Face Hub: ```python from bertopic import BERTopic # Train model topic_model = BERTopic().fit(my_docs) # Push to HuggingFace Hub topic_model.push_to_hf_hub( repo_id="MaartenGr/BERTopic_ArXiv", save_ctfidf=True ) ``` Note that the saved model does not include the dimensionality reduction and clustering algorithms. Those are removed since they are only necessary to train the model and find relevant topics. Inference is done through a straightforward cosine similarity between the topic and document embeddings. This not only speeds up the model but allows us to have a tiny BERTopic model that we can work with. ## Additional Resources * [BERTopic repository](https://github.com/MaartenGr/BERTopic) * [BERTopic docs](https://maartengr.github.io/BERTopic/) * [BERTopic models in the Hub](https://huggingface.co/models?library=bertopic&sort=trending) ### Resource groups https://huggingface.co/docs/hub/enterprise-hub-resource-groups.md # Resource groups > [!WARNING] > This feature is part of the Team & Enterprise plans. Resource Groups allow organizations to enforce fine-grained access control to their repositories. This feature allows organization administrators to: - Group related repositories together for better organization - Control member access at a group level rather than individual repository level - Assign different permission roles (read, contributor, write, admin) to team members - Keep private repositories visible only to authorized group members - Enable multiple teams to work independently within the same organization This Enterprise Hub feature helps organizations manage complex team structures and maintain proper access control over their repositories. [Getting started with Resource Groups →](./security-resource-groups) ### Webhooks https://huggingface.co/docs/hub/webhooks.md # Webhooks > [!TIP] > Webhooks are now publicly available! Webhooks are a foundation for MLOps-related features. They allow you to listen for new changes on specific repos or to all repos belonging to particular set of users/organizations (not just your repos, but any repo). You can use them to auto-convert models, build community bots, or build CI/CD for your models, datasets, and Spaces (and much more!). The documentation for Webhooks is below – or you can also browse our **guides** showcasing a few possible use cases of Webhooks: - [Fine-tune a new model whenever a dataset gets updated (Python)](./webhooks-guide-auto-retrain) - [Create a discussion bot on the Hub, using a LLM API (NodeJS)](./webhooks-guide-discussion-bot) - [Create metadata quality reports (Python)](./webhooks-guide-metadata-review) - and more to come… ## Create your Webhook You can create new Webhooks and edit existing ones in your Webhooks [settings](https://huggingface.co/settings/webhooks): ![Settings of an individual webhook](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/webhook-settings.png) Webhooks can watch for repos updates, Pull Requests, discussions, and new comments. It's even possible to create a Space to react to your Webhooks! ## Webhook Payloads After registering a Webhook, you will be notified of new events via an `HTTP POST` call on the specified target URL. The payload is encoded in JSON. You can view the history of payloads sent in the activity tab of the webhook settings page, it's also possible to replay past webhooks for easier debugging: ![image.png](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/webhook-activity.png) As an example, here is the full payload when a Pull Request is opened: ```json { "event": { "action": "create", "scope": "discussion" }, "repo": { "type": "model", "name": "openai-community/gpt2", "id": "621ffdc036468d709f17434d", "private": false, "url": { "web": "https://huggingface.co/openai-community/gpt2", "api": "https://huggingface.co/api/models/openai-community/gpt2" }, "owner": { "id": "628b753283ef59b5be89e937" } }, "discussion": { "id": "6399f58518721fdd27fc9ca9", "title": "Update co2 emissions", "url": { "web": "https://huggingface.co/openai-community/gpt2/discussions/19", "api": "https://huggingface.co/api/models/openai-community/gpt2/discussions/19" }, "status": "open", "author": { "id": "61d2f90c3c2083e1c08af22d" }, "num": 19, "isPullRequest": true, "changes": { "base": "refs/heads/main" } }, "comment": { "id": "6399f58518721fdd27fc9caa", "author": { "id": "61d2f90c3c2083e1c08af22d" }, "content": "Add co2 emissions information to the model card", "hidden": false, "url": { "web": "https://huggingface.co/openai-community/gpt2/discussions/19#6399f58518721fdd27fc9caa" } }, "webhook": { "id": "6390e855e30d9209411de93b", "version": 3 } } ``` ### Event The top-level properties `event` is always specified and used to determine the nature of the event. It has two sub-properties: `event.action` and `event.scope`. `event.scope` will be one of the following values: - `"repo"` - Global events on repos. Possible values for the associated `action`: `"create"`, `"delete"`, `"update"`, `"move"`. - `"repo.content"` - Events on the repo's content, such as new commits or tags. It triggers on new Pull Requests as well due to the newly created reference/commit. The associated `action` is always `"update"`. - `"repo.config"` - Events on the config: update Space secrets, update settings, update DOIs, disabled or not, etc. The associated `action` is always `"update"`. - `"discussion"` - Creating a discussion or Pull Request, updating the title or status, and merging. Possible values for the associated `action`: `"create"`, `"delete"`, `"update"`. - `"discussion.comment"` - Creating, updating, and hiding a comment. Possible values for the associated `action`: `"create"`, `"update"`. More scopes can be added in the future. To handle unknown events, your webhook handler can consider any action on a narrowed scope to be an `"update"` action on the broader scope. For example, if the `"repo.config.dois"` scope is added in the future, any event with that scope can be considered by your webhook handler as an `"update"` action on the `"repo.config"` scope. ### Repo In the current version of webhooks, the top-level property `repo` is always specified, as events can always be associated with a repo. For example, consider the following value: ```json "repo": { "type": "model", "name": "some-user/some-repo", "id": "6366c000a2abcdf2fd69a080", "private": false, "url": { "web": "https://huggingface.co/some-user/some-repo", "api": "https://huggingface.co/api/models/some-user/some-repo" }, "headSha": "c379e821c9c95d613899e8c4343e4bfee2b0c600", "tags": [ "license:other", "has_space" ], "owner": { "id": "61d2000c3c2083e1c08af22d" } } ``` `repo.headSha` is the sha of the latest commit on the repo's `main` branch. It is only sent when `event.scope` starts with `"repo"`, not on community events like discussions and comments. ### Code changes On code changes, the top-level property `updatedRefs` is specified on repo events. It is an array of references that have been updated. Here is an example value: ```json "updatedRefs": [ { "ref": "refs/heads/main", "oldSha": "ce9a4674fa833a68d5a73ec355f0ea95eedd60b7", "newSha": "575db8b7a51b6f85eb06eee540738584589f131c" }, { "ref": "refs/tags/test", "oldSha": null, "newSha": "575db8b7a51b6f85eb06eee540738584589f131c" } ] ``` Newly created references will have `oldSha` set to `null`. Deleted references will have `newSha` set to `null`. You can react to new commits on specific pull requests, new tags, or new branches. ### Config changes When the top-level property `event.scope` is `"repo.config"`, the `updatedConfig` property is specified. It is an object containing the updated config. Here is an example value: ```json "updatedConfig": { "private": false } ``` or ```json "updatedConfig": { "xetEnabled": true, } ``` or, when the updated config key is not supported by the webhook: ```json "updatedConfig": {} ``` For now only `private` and `xetEnabled` are supported. If you would benefit from more config keys being present here, please let us know at website@huggingface.co. ### Discussions and Pull Requests The top-level property `discussion` is specified on community events (discussions and Pull Requests). The `discussion.isPullRequest` property is a boolean indicating if the discussion is also a Pull Request (on the Hub, a PR is a special type of discussion). Here is an example value: ```json "discussion": { "id": "639885d811ae2bad2b7ba461", "title": "Hello!", "url": { "web": "https://huggingface.co/some-user/some-repo/discussions/3", "api": "https://huggingface.co/api/models/some-user/some-repo/discussions/3" }, "status": "open", "author": { "id": "61d2000c3c2083e1c08af22d" }, "isPullRequest": true, "changes": { "base": "refs/heads/main" } "num": 3 } ``` ### Comment The top level property `comment` is specified when a comment is created (including on discussion creation) or updated. Here is an example value: ```json "comment": { "id": "6398872887bfcfb93a306f18", "author": { "id": "61d2000c3c2083e1c08af22d" }, "content": "This adds an env key", "hidden": false, "url": { "web": "https://huggingface.co/some-user/some-repo/discussions/4#6398872887bfcfb93a306f18" } } ``` ## Webhook secret Setting a Webhook secret is useful to make sure payloads sent to your Webhook handler URL are actually from Hugging Face. If you set a secret for your Webhook, it will be sent along as an `X-Webhook-Secret` HTTP header on every request. Only ASCII characters are supported. > [!TIP] > It's also possible to add the secret directly in the handler URL. For example, setting it as a query parameter: https://example.com/webhook?secret=XXX. > > This can be helpful if accessing the HTTP headers of the request is complicated for your Webhook handler. ## Rate limiting Each Webhook is limited to 1,000 triggers per 24 hours. You can view your usage in the Webhook settings page in the "Activity" tab. If you need to increase the number of triggers for your Webhook, upgrade to PRO, Team or Enterprise and contact us at website@huggingface.co. ## Developing your Webhooks If you do not have an HTTPS endpoint/URL, you can try out public tools for webhook testing. These tools act as catch-all (capture all requests) sent to them and give 200 OK status code. [Beeceptor](https://beeceptor.com/) is one tool you can use to create a temporary HTTP endpoint and review the incoming payload. Another such tool is [Webhook.site](https://webhook.site/). Additionally, you can route a real Webhook payload to the code running locally on your machine during development. This is a great way to test and debug for faster integrations. You can do this by exposing your localhost port to the Internet. To be able to go this path, you can use [ngrok](https://ngrok.com/) or [localtunnel](https://theboroer.github.io/localtunnel-www/). ## Debugging Webhooks You can easily find recently generated events for your webhooks. Open the activity tab for your webhook. There you will see the list of recent events. ![image.png](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/webhook-payload.png) Here you can review the HTTP status code and the payload of the generated events. Additionally, you can replay these events by clicking on the `Replay` button! Note: When changing the target URL or secret of a Webhook, replaying an event will send the payload to the updated URL. ## FAQ ##### Can I define webhooks on my organization vs my user account? No, this is not currently supported. ##### How can I subscribe to all events on HF (or across a whole repo type, like on all models)? This is not currently exposed to end users but we can toggle this for you if you send an email to website@huggingface.co. ### Advanced Single Sign-On (SSO) https://huggingface.co/docs/hub/enterprise-hub-advanced-sso.md # Advanced Single Sign-On (SSO) > [!WARNING] > This feature is part of the Enterprise Plus plan. Advanced Single Sign-On (SSO) capabilities extend the standard [SSO features](./security-sso) available in the Enterprise Hub, offering enhanced control and automation for user management and access across the entire Hugging Face platform for your organization members. ## User Provisioning Advanced SSO introduces automated user provisioning, which simplifies the onboarding and offboarding of users. * **Just-In-Time (JIT) Provisioning**: When a user from your organization attempts to log in to Hugging Face for the first time via SSO, an account can be automatically created for them if one doesn't already exist. Their profile information and role mappings can be populated based on attributes from your IdP. * **System for Cross-domain Identity Management (SCIM)**: For more robust user lifecycle management, SCIM allows your IdP to communicate user identity information to Hugging Face. This enables automatic creation, updates (e.g., name changes, role changes), and deactivation of user accounts on Hugging Face as changes occur in your IdP. This ensures that user access is always up-to-date with their status in your organization. Learn more about how to set up and manage SCIM in our [dedicated guide](./enterprise-hub-scim). ## Global SSO Enforcement Beyond gating access to specific organizational content, Advanced SSO can be configured to make your IdP the mandatory authentication route for all your organization's members interacting with any part of the Hugging Face platform. Your organization's members will be required to authenticate via your IdP for all Hugging Face services, not just when accessing private or organizational repositories. This feature is particularly beneficial for organizations requiring a higher degree of control, security, and automation in managing their users on Hugging Face. ## Limitations on Managed User Accounts > [!WARNING] > Important Considerations for Managed Accounts. To ensure organizational control and data governance, user accounts provisioned and managed via Advanced SSO ("managed user accounts") have specific limitations: * **No Public Content Creation**: Managed user accounts cannot create public content on the Hugging Face platform. This includes, but is not limited to, public models, datasets, or Spaces. All content created by these accounts is restricted to within your organization or private visibility. * **No External Collaboration**: Managed user accounts are restricted from collaborating outside of your Hugging Face organization. This means they cannot, for example, join other organizations, contribute to repositories outside their own organization. These restrictions are in place to maintain the integrity and security boundaries defined by your enterprise. If members of your organization require the ability to create public content or collaborate more broadly on the Hugging Face platform, they will need to do so using a separate, personal Hugging Face account that is not managed by your organization's Advanced SSO. ### 🟧 Label Studio on Spaces https://huggingface.co/docs/hub/spaces-sdks-docker-label-studio.md # 🟧 Label Studio on Spaces [Label Studio](https://labelstud.io) is an [open-source data labeling platform](https://github.com/heartexlabs/label-studio) for labeling, annotating, and exploring many different data types. Additionally, Label Studio includes a powerful [machine learning interface](https://labelstud.io/guide/ml.html) that can be used for new model training, active learning, supervised learning, and many other training techniques. This guide will teach you how to deploy Label Studio for data labeling and annotation within the Hugging Face Hub. You can use the default configuration of Label Studio as a self-contained application hosted completely on the Hub using Docker for demonstration and evaluation purposes, or you can attach your own database and cloud storage to host a fully-featured production-ready application hosted on Spaces. ## ⚡️ Deploy Label Studio on Spaces You can deploy Label Studio on Spaces with just a few clicks: Spaces requires you to define: * An **Owner**: either your personal account or an organization you're a part of. * A **Space name**: the name of the Space within the account you're creating the Space. * The **Visibility**: _private_ if you want the Space to be visible only to you or your organization, or _public_ if you want it to be visible to other users or applications using the Label Studio API (suggested). ## 🚀 Using the Default Configuration By default, Label Studio is installed in Spaces with a configuration that uses local storage for the application database to store configuration, account credentials, and project information. Labeling tasks and data items are also held in local storage. > [!WARNING] > Storage in Hugging Face Spaces is ephemeral, and the data you store in the default > configuration can be lost in a reboot or reset of the Space. Because of this, > we strongly encourage you to use the default configuration only for testing and > demonstration purposes. After launching Label Studio, you will be presented with the standard login screen. You can start by creating a new account using your email address and logging in with your new credentials. Periodically after logging in, Label Studio will warn you that the storage is ephemeral and data could be lost if your Space is restarted. You will also be preset with a prompt from Heidi, the helpful Label Studio mascot, to create a new project to start labeling your data. To get started, check out the Label Studio ["Zero to One" tutorial](https://labelstud.io/blog/introduction-to-label-studio-in-hugging-face-spaces/) with a guide on how to build an annotation interface for sentiment analysis. ## 🛠️ Configuring a Production-Ready Instance of Label Studio To make your Space production-ready, you will need to make three configuration changes: * Disable the unrestricted creation of new accounts. * Enable persistence by attaching an external database. * Attach cloud storage for labeling tasks. ### Disable Unrestricted Creation of New Accounts The default configuration on Label Studio allows for the unrestricted creation of new accounts for anyone who has the URL for your application. You can [restrict signups](https://labelstud.io/guide/signup.html#Restrict-signup-for-local-deployments) by adding the following configuration secrets to your Space **Settings**. * `LABEL_STUDIO_DISABLE_SIGNUP_WITHOUT_LINK`: Setting this value to `true` will disable unrestricted account creation. * `LABEL_STUDIO_USERNAME`: This is the username of the account that you will use as the first user in your Label Studio Space. It should be a valid email address. * `LABEL_STUDIO_PASSWORD`: The password that will be associated with the first user account. Restart the Space to apply these settings. The ability to create new accounts from the login screen will be disabled. To create new accounts, you will need to invite new users in the `Organization` settings in the Label Studio application. ### Enable Configuration Persistence By default, this Space stores all project configuration and data annotations in local storage with SQLite. If the Space is reset, all configuration and annotation data in the Space will be lost. You can enable configuration persistence by [connecting an external Postgres database to your space](https://labelstud.io/guide/storedata.html#PostgreSQL-database), guaranteeing that all project and annotation settings are preserved. Set the following secret variables to match your own hosted instance of Postgres. We strongly recommend setting these as secrets to prevent leaking information about your database service to the public in your spaces definition. * `DJANGO_DB`: Set this to `default`. * `POSTGRE_NAME`: Set this to the name of the Postgres database. * `POSTGRE_USER`: Set this to the Postgres username. * `POSTGRE_PASSWORD`: Set this to the password for your Postgres user. * `POSTGRE_HOST`: Set this to the host that your Postgres database is running on. * `POSTGRE_PORT`: Set this to the port that your Pogtgres database is running on. * `STORAGE_PERSISTENCE`: Set this to `1` to remove the warning about ephemeral storage. Restart the Space to apply these settings. Information about users, projects, and annotations will be stored in the database, and will be reloaded by Label Studio if the space is restarted or reset. ### Enable Cloud Storage By default, the only data storage enabled for this Space is local. In the case of a Space reset, all data will be lost. To enable permanent storage, you must enable a [cloud storage connector](https://labelstud.io/guide/storage.html). Choose the appropriate cloud connector and configure the secrets for it. #### Amazon S3 * `STORAGE_TYPE`: Set this to `s3`. * `STORAGE_AWS_ACCESS_KEY_ID`: `` * `STORAGE_AWS_SECRET_ACCESS_KEY`: `` * `STORAGE_AWS_BUCKET_NAME`: `` * `STORAGE_AWS_REGION_NAME`: `` * `STORAGE_AWS_FOLDER`: Set this to an empty string. #### Google Cloud Storage * `STORAGE_TYPE`: Set this to `gcs`. * `STORAGE_GCS_BUCKET_NAME`: `` * `STORAGE_GCS_PROJECT_ID`: `` * `STORAGE_GCS_FOLDER`: Set this to an empty string. * `GOOGLE_APPLICATION_CREDENTIALS`: Set this to `/opt/heartex/secrets/key.json`. #### Azure Blob Storage * `STORAGE_TYPE`: Set this to `azure`. * `STORAGE_AZURE_ACCOUNT_NAME`: `` * `STORAGE_AZURE_ACCOUNT_KEY`: `` * `STORAGE_AZURE_CONTAINER_NAME`: `` * `STORAGE_AZURE_FOLDER`: Set this to an empty string. ## 🤗 Next Steps, Feedback, and Support To get started with Label Studio, check out the Label Studio ["Zero to One" tutorial](https://labelstud.io/blog/introduction-to-label-studio-in-hugging-face-spaces/), which walks you through an example sentiment analysis annotation project. You can find a full set of resources about Label Studio and the Label Studio community on at the [Label Studio Home Page](https://labelstud.io). This includes [full documentation](https://labelstud.io/guide/), an [interactive playground](https://labelstud.io/playground/) for trying out different annotation interfaces, and links to join the [Label Studio Slack Community](https://slack.labelstudio.heartex.com/?source=spaces). ### How to configure SAML SSO with Okta https://huggingface.co/docs/hub/security-sso-okta-saml.md # How to configure SAML SSO with Okta In this guide, we will use Okta as the SSO provider and with the Security Assertion Markup Language (SAML) protocol as our preferred identity protocol. We currently support SP-initiated and IdP-initiated authentication. User provisioning is part of Enterprise Plus's [Advanced SSO](./enterprise-hub-advanced-sso). > [!WARNING] > This feature is part of the Team & Enterprise plans. ### Step 1: Create a new application in your Identity Provider Open a new tab/window in your browser and sign in to your Okta account. Navigate to "Admin/Applications" and click the "Create App Integration" button. Then choose an "SAML 2.0" application and click "Create". ### Step 2: Configure your application on Okta Open a new tab/window in your browser and navigate to the SSO section of your organization's settings. Select the SAML protocol. Copy the "Assertion Consumer Service URL" from the organization's settings on Hugging Face, and paste it in the "Single sign-on URL" field on Okta. The URL looks like this: `https://huggingface.co/organizations/[organizationIdentifier]/saml/consume`. On Okta, set the following settings: - Set Audience URI (SP Entity Id) to match the "SP Entity ID" value on Hugging Face. - Set Name ID format to EmailAddress. - Under "Show Advanced Settings", verify that Response and Assertion Signature are set to: Signed. Save your new application. ### Step 3: Finalize configuration on Hugging Face In your Okta application, under "Sign On/Settings/More details", find the following fields: - Sign-on URL - Public certificate - SP Entity ID You will need them to finalize the SSO setup on Hugging Face. In the SSO section of your organization's settings, copy-paste these values from Okta: - Sign-on URL - SP Entity ID - Public certificate The public certificate must have the following format: ``` -----BEGIN CERTIFICATE----- {certificate} -----END CERTIFICATE----- ``` You can now click on "Update and Test SAML configuration" to save the settings. You should be redirected to your SSO provider (IdP) login prompt. Once logged in, you'll be redirected to your organization's settings page. A green check mark near the SAML selector will attest that the test was successful. ### Step 4: Enable SSO in your organization Now that Single Sign-On is configured and tested, you can enable it for members of your organization by clicking on the "Enable" button. Once enabled, members of your organization must complete the SSO authentication flow described in the [How does it work?](./security-sso#how-does-it-work) section. ### Using GPU Spaces https://huggingface.co/docs/hub/spaces-gpus.md # Using GPU Spaces You can upgrade your Space to use a GPU accelerator using the _Settings_ button in the top navigation bar of the Space. You can even request a free upgrade if you are building a cool demo for a side project! > [!TIP] > Longer-term, we would also like to expose non-GPU hardware, like HPU, IPU or TPU. If you have a specific AI hardware you'd like to run on, please let us know (website at huggingface.co). As soon as your Space is running on GPU you can see which hardware it’s running on directly from this badge: ## Hardware Specs In the following tables, you can see the Specs for the different upgrade options. ### CPU | **Hardware** | **CPU** | **Memory** | **GPU Memory** | **Disk** | **Hourly Price** | |----------------------- |-------------- |------------- |---------------- |---------- | ----------------- | | CPU Basic | 2 vCPU | 16 GB | - | 50 GB | Free! | | CPU Upgrade | 8 vCPU | 32 GB | - | 50 GB | $0.03 | ### GPU | **Hardware** | **CPU** | **Memory** | **GPU Memory** | **Disk** | **Hourly Price** | |----------------------- |-------------- |------------- |---------------- |---------- | ----------------- | | Nvidia T4 - small | 4 vCPU | 15 GB | 16 GB | 50 GB | $0.40 | | Nvidia T4 - medium | 8 vCPU | 30 GB | 16 GB | 100 GB | $0.60 | | 1x Nvidia L4 | 8 vCPU | 30 GB | 24 GB | 400 GB | $0.80 | | 4x Nvidia L4 | 48 vCPU | 186 GB | 96 GB | 3200 GB | $3.80 | | 1x Nvidia L40S | 8 vCPU | 62 GB | 48 GB | 380 GB | $1.80 | | 4x Nvidia L40S | 48 vCPU | 382 GB | 192 GB | 3200 GB | $8.30 | | 8x Nvidia L40S | 192 vCPU | 1534 GB | 384 GB | 6500 GB | $23.50 | | Nvidia A10G - small | 4 vCPU | 14 GB | 24 GB | 110 GB | $1.00 | | Nvidia A10G - large | 12 vCPU | 46 GB | 24 GB | 200 GB | $1.50 | | 2x Nvidia A10G - large | 24 vCPU | 92 GB | 48 GB | 1000 GB | $3.00 | | 4x Nvidia A10G - large | 48 vCPU | 184 GB | 96 GB | 2000 GB | $5.00 | | Nvidia A100 - large | 12 vCPU | 142 GB | 80 GB | 1000 GB | $2.50 | | Nvidia H100 | 23 vCPU | 240 GB | 80 GB | 3000 GB | $4.50 | | 8x Nvidia H100 | 184 vCPU | 1920 GB | 640 GB | 24 TB | $36.00 | ## Configure hardware programmatically You can programmatically configure your Space hardware using `huggingface_hub`. This allows for a wide range of use cases where you need to dynamically assign GPUs. Check out [this guide](https://huggingface.co/docs/huggingface_hub/main/en/guides/manage_spaces) for more details. ## Framework specific requirements[[frameworks]] Most Spaces should run out of the box after a GPU upgrade, but sometimes you'll need to install CUDA versions of the machine learning frameworks you use. Please, follow this guide to ensure your Space takes advantage of the improved hardware. ### PyTorch You'll need to install a version of PyTorch compatible with the built-in CUDA drivers. Adding the following two lines to your `requirements.txt` file should work: ``` --extra-index-url https://download.pytorch.org/whl/cu113 torch ``` You can verify whether the installation was successful by running the following code in your `app.py` and checking the output in your Space logs: ```Python import torch print(f"Is CUDA available: {torch.cuda.is_available()}") # True print(f"CUDA device: {torch.cuda.get_device_name(torch.cuda.current_device())}") # Tesla T4 ``` Many frameworks automatically use the GPU if one is available. This is the case for the Pipelines in 🤗 `transformers`, `fastai` and many others. In other cases, or if you use PyTorch directly, you may need to move your models and data to the GPU to ensure computation is done on the accelerator and not on the CPU. You can use PyTorch's `.to()` syntax, for example: ```Python model = load_pytorch_model() model = model.to("cuda") ``` ### JAX If you use JAX, you need to specify the URL that contains CUDA compatible packages. Please, add the following lines to your `requirements.txt` file: ``` -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html jax[cuda11_pip] jaxlib ``` After that, you can verify the installation by printing the output from the following code and checking it in your Space logs. ```Python import jax print(f"JAX devices: {jax.devices()}") # JAX devices: [StreamExecutorGpuDevice(id=0, process_index=0)] print(f"JAX device type: {jax.devices()[0].device_kind}") # JAX device type: Tesla T4 ``` ### Tensorflow The default `tensorflow` installation should recognize the CUDA device. Just add `tensorflow` to your `requirements.txt` file and use the following code in your `app.py` to verify in your Space logs. ```Python import tensorflow as tf print(tf.config.list_physical_devices('GPU')) # [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')] ``` ## Billing Billing on Spaces is based on hardware usage and is computed by the minute: you get charged for every minute the Space runs on the requested hardware, regardless of whether the Space is used. During a Space's lifecycle, it is only billed when the Space is actually `Running`. This means that there is no cost during build or startup. If a running Space starts to fail, it will be automatically suspended and the billing will stop. Spaces running on free hardware are suspended automatically if they are not used for an extended period of time (e.g. two days). Upgraded Spaces run indefinitely by default, even if there is no usage. You can change this behavior by [setting a custom "sleep time"](#sleep-time) in the Space's settings. To interrupt the billing on your Space, you can change the Hardware to CPU basic, or [pause](#pause) it. Additional information about billing can be found in the [dedicated Hub-wide section](./billing). ### Community GPU Grants Do you have an awesome Space but need help covering the GPU hardware upgrade costs? We love helping out those with an innovative Space so please feel free to apply for a community GPU grant and see if yours makes the cut! This application can be found in your Space hardware repo settings in the lower left corner under "sleep time settings": ![Community GPU Grant](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/ask-for-community-grant.png) ## Set a custom sleep time[[sleep-time]] If your Space runs on the default `cpu-basic` hardware, it will go to sleep if inactive for more than a set time (currently, 48 hours). Anyone visiting your Space will restart it automatically. If you want your Space never to deactivate or if you want to set a custom sleep time, you need to upgrade to paid hardware. By default, an upgraded Space will never go to sleep. However, you can use this setting for your upgraded Space to become idle (`stopped` stage) when it's unused 😴. You are not going to be charged for the upgraded hardware while it is asleep. The Space will 'wake up' or get restarted once it receives a new visitor. The following interface will then be available in your Spaces hardware settings: The following options are available: ## Pausing a Space[[pause]] You can `pause` a Space from the repo settings. A "paused" Space means that the Space is on hold and will not use resources until manually restarted, and only the owner of a paused Space can restart it. Paused time is not billed. ### Secrets Scanning https://huggingface.co/docs/hub/security-secrets.md # Secrets Scanning It is important to manage [your secrets (env variables) properly](./spaces-overview#managing-secrets). The most common way people expose their secrets to the outside world is by hard-coding their secrets in their code files directly, which makes it possible for a malicious user to utilize your secrets and services your secrets have access to. For example, this is what a compromised `app.py` file might look like: ```py import numpy as np import scipy as sp api_key = "sw-xyz1234567891213" def call_inference(prompt: str) -> str: result = call_api(prompt, api_key) return result ``` To prevent this issue, we run [TruffleHog](https://trufflesecurity.com/trufflehog) on each push you make. TruffleHog scans for hard-coded secrets, and we will send you an email upon detection. You'll only receive emails for verified secrets, which are the ones that have been confirmed to work for authentication against their respective providers. Note, however, that unverified secrets are not necessarily harmless or invalid: verification can fail due to technical reasons, such as in the case of a network error. TruffleHog can verify secrets that work across multiple services, it is not restricted to Hugging Face tokens. You can opt-out from those email notifications from [your settings](https://huggingface.co/settings/notifications). ### Custom Python Spaces https://huggingface.co/docs/hub/spaces-sdks-python.md # Custom Python Spaces > [!TIP] > Spaces now support arbitrary Dockerfiles so you can host any Python app directly using [Docker Spaces](./spaces-sdks-docker). While not an official workflow, you are able to run your own Python + interface stack in Spaces by selecting Gradio as your SDK and serving a frontend on port `7860`. See the [templates](https://huggingface.co/templates#spaces) for examples. Spaces are served in iframes, which by default restrict links from opening in the parent page. The simplest solution is to open them in a new window: ```HTML Spaces ``` Usually, the height of Spaces is automatically adjusted when using the Gradio library interface. However, if you provide your own frontend in the Gradio SDK and the content height is larger than the viewport, you'll need to add an [iFrame Resizer script](https://cdnjs.com/libraries/iframe-resizer), so the content is scrollable in the iframe: ```HTML ``` As an example, here is the same Space with and without the script: - https://huggingface.co/spaces/ronvolutional/http-server - https://huggingface.co/spaces/ronvolutional/iframe-test ### SQL Console: Query Hugging Face datasets in your browser https://huggingface.co/docs/hub/datasets-viewer-sql-console.md # SQL Console: Query Hugging Face datasets in your browser You can run SQL queries on the dataset in the browser using the SQL Console. The SQL Console is powered by [DuckDB](https://duckdb.org/) WASM and runs entirely in the browser. You can access the SQL Console from the Data Studio. To learn more about the SQL Console, see the SQL Console blog post. Through the SQL Console, you can: - Run [DuckDB SQL queries](https://duckdb.org/docs/sql/query_syntax/select) on the dataset (_checkout [SQL Snippets](https://huggingface.co/spaces/cfahlgren1/sql-snippets) for useful queries_) - Share results of the query with others via a link (_check out [this example](https://huggingface.co/datasets/gretelai/synthetic-gsm8k-reflection-405b?sql_console=true&sql=FROM+histogram%28%0A++train%2C%0A++topic%2C%0A++bin_count+%3A%3D+10%0A%29)_) - Download the results of the query to a Parquet or CSV file - Embed the results of the query in your own webpage using an iframe - Query datasets with natural language > [!TIP] > You can also use the DuckDB locally through the CLI to query the dataset via the `hf://` protocol. See the DuckDB Datasets documentation for more information. The SQL Console provides a convenient `Copy to DuckDB CLI` button that generates the SQL query for creating views and executing your query in the DuckDB CLI. ## Examples ### Filtering The SQL Console makes filtering datasets really easy. For example, if you want to filter the `SkunkworksAI/reasoning-0.01` dataset for instructions and responses with a reasoning length of at least 10, you can use the following query: Here's the SQL to sort by length of the reasoning ```sql SELECT * FROM train WHERE LENGTH(reasoning_chains) > 10; ``` ### Histogram Many dataset authors choose to include statistics about the distribution of the data in the dataset. Using the DuckDB `histogram` function, we can plot a histogram of a column's values. For example, to plot a histogram of the `Rating` column in the [Lichess/chess-puzzles](https://huggingface.co/datasets/Lichess/chess-puzzles) dataset, you can use the following query: Learn more about the `histogram` function and parameters here. ```sql from histogram(train, Rating) ``` ### Regex Matching One of the most powerful features of DuckDB is the deep support for regular expressions. You can use the `regexp` function to match patterns in your data. Using the [regexp_matches](https://duckdb.org/docs/sql/functions/char.html#regexp_matchesstring-pattern) function, we can filter the [GeneralReasoning/GeneralThought-195k](https://huggingface.co/datasets/GeneralReasoning/GeneralThought-195K) dataset for instructions that contain markdown code blocks. Learn more about the DuckDB regex functions here. ```sql SELECT * FROM train WHERE regexp_matches(model_answer, '```') LIMIT 10; ``` ### Leakage Detection Leakage detection is the process of identifying whether data in a dataset is present in multiple splits, for example, whether the test set is present in the training set. Learn more about leakage detection here. ```sql WITH overlapping_rows AS ( SELECT COALESCE( (SELECT COUNT(*) AS overlap_count FROM train INTERSECT SELECT COUNT(*) AS overlap_count FROM test), 0 ) AS overlap_count ), total_unique_rows AS ( SELECT COUNT(*) AS total_count FROM ( SELECT * FROM train UNION SELECT * FROM test ) combined ) SELECT overlap_count, total_count, CASE WHEN total_count > 0 THEN (overlap_count * 100.0 / total_count) ELSE 0 END AS overlap_percentage FROM overlapping_rows, total_unique_rows; ``` ### Spaces ZeroGPU: Dynamic GPU Allocation for Spaces https://huggingface.co/docs/hub/spaces-zerogpu.md # Spaces ZeroGPU: Dynamic GPU Allocation for Spaces ZeroGPU is a shared infrastructure that optimizes GPU usage for AI models and demos on Hugging Face Spaces. It dynamically allocates and releases NVIDIA H200 GPUs as needed, offering: 1. **Free GPU Access**: Enables cost-effective GPU usage for Spaces. 2. **Multi-GPU Support**: Allows Spaces to leverage multiple GPUs concurrently on a single application. Unlike traditional single-GPU allocations, ZeroGPU's efficient system lowers barriers for developers, researchers, and organizations to deploy AI models by maximizing resource utilization and power efficiency. ## Using and hosting ZeroGPU Spaces - **Using existing ZeroGPU Spaces** - ZeroGPU Spaces are available to use for free to all users. (Visit [the curated list](https://huggingface.co/spaces/enzostvs/zero-gpu-spaces)). - [PRO users](https://huggingface.co/subscribe/pro) get x7 more daily usage quota and highest priority in GPU queues when using any ZeroGPU Spaces. - **Hosting your own ZeroGPU Spaces** - Personal accounts: [Subscribe to PRO](https://huggingface.co/settings/billing/subscription) to access ZeroGPU in the hardware options when creating a new Gradio SDK Space. - Organizations: [Subscribe to a Team or Enterprise plan](https://huggingface.co/enterprise) to enable ZeroGPU Spaces for all organization members. ## Technical Specifications - **GPU Type**: Nvidia H200 slice - **Available VRAM**: 70GB per workload ## Compatibility ZeroGPU Spaces are designed to be compatible with most PyTorch-based GPU Spaces. While compatibility is enhanced for high-level Hugging Face libraries like `transformers` and `diffusers`, users should be aware that: - Currently, ZeroGPU Spaces are exclusively compatible with the **Gradio SDK**. - ZeroGPU Spaces may have limited compatibility compared to standard GPU Spaces. - Unexpected issues may arise in some scenarios. ### Supported Versions - **Gradio**: 4+ - **PyTorch**: Almost all versions from **2.1.0** to **latest** are supported See full list - 2.1.0 - 2.1.1 - 2.1.2 - 2.2.0 - 2.2.2 - 2.4.0 - 2.5.1 - 2.6.0 - 2.7.1 - 2.8.0 - **Python**: 3.10.13 ## Getting started with ZeroGPU To utilize ZeroGPU in your Space, follow these steps: 1. Make sure the ZeroGPU hardware is selected in your Space settings. 2. Import the `spaces` module. 3. Decorate GPU-dependent functions with `@spaces.GPU`. This decoration process allows the Space to request a GPU when the function is called and release it upon completion. ### Example Usage ```python import spaces from diffusers import DiffusionPipeline pipe = DiffusionPipeline.from_pretrained(...) pipe.to('cuda') @spaces.GPU def generate(prompt): return pipe(prompt).images gr.Interface( fn=generate, inputs=gr.Text(), outputs=gr.Gallery(), ).launch() ``` Note: The `@spaces.GPU` decorator is designed to be effect-free in non-ZeroGPU environments, ensuring compatibility across different setups. ## Duration Management For functions expected to exceed the default 60-second of GPU runtime, you can specify a custom duration: ```python @spaces.GPU(duration=120) def generate(prompt): return pipe(prompt).images ``` This sets the maximum function runtime to 120 seconds. Specifying shorter durations for quicker functions will improve queue priority for Space visitors. ### Dynamic duration `@spaces.GPU` also supports dynamic durations. Instead of directly passing a duration, simply pass a callable that takes the same inputs as your decorated function and returns a duration value: ```python def get_duration(prompt, steps): step_duration = 3.75 return steps * step_duration @spaces.GPU(duration=get_duration) def generate(prompt, steps): return pipe(prompt, num_inference_steps=steps).images ``` ## Compilation ZeroGPU does not support `torch.compile`, but you can use PyTorch **ahead-of-time** compilation (requires torch `2.8+`) Check out this [blogpost](https://huggingface.co/blog/zerogpu-aoti) for a complete guide on ahead-of-time compilation on ZeroGPU. ## Usage Tiers GPU usage is subject to **daily** quotas, per account tier: | Account type | Daily GPU quota | Queue priority | | ------------------------------ | ---------------- | --------------- | | Unauthenticated | 2 minutes | Low | | Free account | 3.5 minutes | Medium | | PRO account | 25 minutes | Highest | | Team organization member | 25 minutes | Highest | | Enterprise organization member | 45 minutes | Highest | > [!NOTE] > Remaining quota directly impacts priority in ZeroGPU queues. ## Hosting Limitations - **Personal accounts ([PRO subscribers](https://huggingface.co/subscribe/pro))**: Maximum of 10 ZeroGPU Spaces. - **Organization accounts ([Enterprise Hub](https://huggingface.co/enterprise))**: Maximum of 50 ZeroGPU Spaces. By leveraging ZeroGPU, developers can create more efficient and scalable Spaces, maximizing GPU utilization while minimizing costs. ## Recommendations If your demo uses a large model, we recommend using optimizations like ahead-of-time compilation and flash-attention 3. You can learn how to leverage these with ZeroGPU in [this post](https://huggingface.co/blog/zerogpu-aoti). These optimizations will help you to maximize the advantages of ZeroGPU hours and provide a better user experience. ## Feedback You can share your feedback on Spaces ZeroGPU directly on the HF Hub: https://huggingface.co/spaces/zero-gpu-explorers/README/discussions ### Argilla on Spaces https://huggingface.co/docs/hub/spaces-sdks-docker-argilla.md # Argilla on Spaces Argilla is a free and open source tool to build and iterate on data for AI. It can be deployed on the Hub with a few clicks and Hugging Face OAuth enabled. This enables other HF users to join your Argilla server to annotate datasets, perfect for running community annotation initiatives! With Argilla you can: - Configure datasets for collecting human feedback with a growing number questions (Label, NER, Ranking, Rating, free text, etc.) - Use model outputs/predictions to evaluate them or speeding up the annotation process. - UI users can explore, find, and label the most interesting/critical subsets using Argilla's search and semantic similarity features. - Pull and push datasets from the Hugging Face Hub for versioning and model training. The best place to get started with Argilla on Spaces is [this guide](http://docs.argilla.io/latest/getting_started/quickstart/). ### Security https://huggingface.co/docs/hub/security.md # Security The Hugging Face Hub offers several security features to ensure that your code and data are secure. Beyond offering [private repositories](./repositories-settings#private-repositories) for models, datasets, and Spaces, the Hub supports access tokens, resource groups, MFA, commit signatures, malware scanning, and more. Hugging Face is GDPR compliant. If a contract or specific data storage is something you'll need, we recommend taking a look at our [Enterprise Hub Support](https://huggingface.co/support). Hugging Face can also offer Business Associate Addendums or GDPR data processing agreements through an [Enterprise Plan](https://huggingface.co/pricing). Hugging Face is also [SOC2 Type 2 certified](https://us.aicpa.org/interestareas/frc/assuranceadvisoryservices/aicpasoc2report.html), meaning we provide security certification to our customers and actively monitor and patch any security weaknesses. For any other security questions, please feel free to send us an email at security@huggingface.co. ## Contents - [User Access Tokens](./security-tokens) - [Two-Factor Authentication (2FA)](./security-2fa) - [Git over SSH](./security-git-ssh) - [Signing commits with GPG](./security-gpg) - [Single Sign-On (SSO)](./security-sso) - [Advanced Access Control (Resource Groups)](./security-resource-groups) - [Malware Scanning](./security-malware) - [Pickle Scanning](./security-pickle) - [Secrets Scanning](./security-secrets) - [Third-party scanner: Protect AI](./security-protectai) - [Third-party scanner: JFrog](./security-jfrog) ### Getting Started with Repositories https://huggingface.co/docs/hub/repositories-getting-started.md # Getting Started with Repositories This beginner-friendly guide will help you get the basic skills you need to create and manage your repository on the Hub. Each section builds on the previous one, so feel free to choose where to start! ## Requirements This document shows how to handle repositories through the web interface as well as through the terminal. There are no requirements if working with the UI. If you want to work with the terminal, please follow these installation instructions. If you do not have `git` available as a CLI command yet, you will need to [install Git](https://git-scm.com/downloads) for your platform. You will also need to [install Git-Xet](./xet/using-xet-storage#git-xet), which will be used to handle large files such as images and model weights. > [!TIP] > To be able to download and upload large files from Git, you need to install the [Git Xet](./xet/using-xet-storage#git) extension. To be able to push your code to the Hub, you'll need to authenticate somehow. The easiest way to do this is by installing the [`huggingface_hub` CLI](https://huggingface.co/docs/huggingface_hub/index) and running the login command: ```bash python -m pip install huggingface_hub hf auth login ``` **The content in the Getting Started section of this document is also available as a video!** ## Creating a repository Using the Hub's web interface you can easily create repositories, add files (even large ones!), explore models, visualize diffs, and much more. There are three kinds of repositories on the Hub, and in this guide you'll be creating a **model repository** for demonstration purposes. For information on creating and managing models, datasets, and Spaces, refer to their respective documentation. 1. To create a new repository, visit [huggingface.co/new](http://huggingface.co/new): 2. Specify the owner of the repository: this can be either you or any of the organizations you’re affiliated with. 3. Enter your model’s name. This will also be the name of the repository. 4. Specify whether you want your model to be public or private. 5. Specify the license. You can leave the *License* field blank for now. To learn about licenses, visit the [**Licenses**](repositories-licenses) documentation. After creating your model repository, you should see a page like this: Note that the Hub prompts you to create a *Model Card*, which you can learn about in the [**Model Cards documentation**](./model-cards). Including a Model Card in your model repo is best practice, but since we're only making a test repo at the moment we can skip this. ## Adding files to a repository (Web UI) To add files to your repository via the web UI, start by selecting the **Files** tab, navigating to the desired directory, and then clicking **Add file**. You'll be given the option to create a new file or upload a file directly from your computer. ### Creating a new file Choosing to create a new file will take you to the following editor screen, where you can choose a name for your file, add content, and save your file with a message that summarizes your changes. Instead of directly committing the new file to your repo's `main` branch, you can select `Open as a pull request` to create a [Pull Request](./repositories-pull-requests-discussions). ### Uploading a file If you choose _Upload file_ you'll be able to choose a local file to upload, along with a message summarizing your changes to the repo. As with creating new files, you can select `Open as a pull request` to create a [Pull Request](./repositories-pull-requests-discussions) instead of adding your changes directly to the `main` branch of your repo. ## Adding files to a repository (terminal)[[terminal]] ### Cloning repositories Downloading repositories to your local machine is called *cloning*. You can use the following commands to load your repo and navigate to it: ```bash git clone https://huggingface.co// cd ``` Or for a dataset repo: ```bash git clone https://huggingface.co/datasets// cd ``` You can clone over SSH with the following command: ```bash git clone git@hf.co:/ cd ``` You'll need to add your SSH public key to [your user settings](https://huggingface.co/settings/keys) to push changes or access private repositories. ### Set up Now's the time, you can add any files you want to the repository! 🔥 Do you have files larger than 10MB? Those files should be tracked with [`git-xet`](./xet/using-xet-storage#git-xet), which you can initialize with: ```bash git xet install ``` When you use Hugging Face to create a repository, Hugging Face automatically provides a list of common file extensions for common Machine Learning large files in the `.gitattributes` file, which `git-xet` uses to efficiently track changes to your large files. However, you might need to add new extensions if your file types are not already handled. You can do so with `git xet track "*.your_extension"`. ### Pushing files You can use Git to save new files and any changes to already existing files as a bundle of changes called a *commit*, which can be thought of as a "revision" to your project. To create a commit, you have to `add` the files to let Git know that we're planning on saving the changes and then `commit` those changes. In order to sync the new commit with the Hugging Face Hub, you then `push` the commit to the Hub. ```bash # Create any files you like! Then... git add . git commit -m "First model version" # You can choose any descriptive message git push ``` And you're done! You can check your repository on Hugging Face with all the recently added files. For example, in the screenshot below the user added a number of files. Note that some files in this example have a size of `1.04 GB`, so the repo uses Xet to track it. > [!TIP] > If you cloned the repository with HTTP, you might be asked to fill your username and password on every push operation. The simplest way to avoid repetition is to [switch to SSH](#cloning-repositories), instead of HTTP. Alternatively, if you have to use HTTP, you might find it helpful to setup a [git credential helper](https://git-scm.com/docs/gitcredentials#_avoiding_repetition) to autofill your username and password. ## Viewing a repo's history Every time you go through the `add`-`commit`-`push` cycle, the repo will keep track of every change you've made to your files. The UI allows you to explore the model files and commits and to see the difference (also known as *diff*) introduced by each commit. To see the history, you can click on the **History: X commits** link. You can click on an individual commit to see what changes that commit introduced: ### Using TensorBoard https://huggingface.co/docs/hub/tensorboard.md # Using TensorBoard TensorBoard provides tooling for tracking and visualizing metrics as well as visualizing models. All repositories that contain TensorBoard traces have an automatic tab with a hosted TensorBoard instance for anyone to check it out without any additional effort! ## Exploring TensorBoard models on the Hub Over 52k repositories have TensorBoard traces on the Hub. You can find them by filtering at the left of the [models page](https://huggingface.co/models?filter=tensorboard). As an example, if you go to the [aubmindlab/bert-base-arabertv02](https://huggingface.co/aubmindlab/bert-base-arabertv02) repository, there is a **Metrics** tab. If you select it, you'll view a TensorBoard instance. ## Adding your TensorBoard traces The Hub automatically detects TensorBoard traces (such as `tfevents`). Once you push your TensorBoard files to the Hub, they will automatically start an instance. ## Additional resources * TensorBoard [documentation](https://www.tensorflow.org/tensorboard). ### Displaying carbon emissions for your model https://huggingface.co/docs/hub/model-cards-co2.md # Displaying carbon emissions for your model ## Why is it beneficial to calculate the carbon emissions of my model? Training ML models is often energy-intensive and can produce a substantial carbon footprint, as described by [Strubell et al.](https://arxiv.org/abs/1906.02243). It's therefore important to *track* and *report* the emissions of models to get a better idea of the environmental impacts of our field. ## What information should I include about the carbon footprint of my model? If you can, you should include information about: - where the model was trained (in terms of location) - the hardware used -- e.g. GPU, TPU, or CPU, and how many - training type: pre-training or fine-tuning - the estimated carbon footprint of the model, calculated in real-time with the [Code Carbon](https://github.com/mlco2/codecarbon) package or after training using the [ML CO2 Calculator](https://mlco2.github.io/impact/). ## Carbon footprint metadata You can add the carbon footprint data to the model card metadata (in the README.md file). The structure of the metadata should be: ```yaml --- co2_eq_emissions: emissions: number (in grams of CO2) source: "source of the information, either directly from AutoTrain, code carbon or from a scientific article documenting the model" training_type: "pre-training or fine-tuning" geographical_location: "as granular as possible, for instance Quebec, Canada or Brooklyn, NY, USA. To check your compute's electricity grid, you can check out https://app.electricitymap.org." hardware_used: "how much compute and what kind, e.g. 8 v100 GPUs" --- ``` ## How is the carbon footprint of my model calculated? 🌎 Considering the computing hardware, location, usage, and training time, you can estimate how much CO2 the model produced. The math is pretty simple! ➕ First, you take the *carbon intensity* of the electric grid used for the training -- this is how much CO2 is produced by KwH of electricity used. The carbon intensity depends on the location of the hardware and the [energy mix](https://electricitymap.org/) used at that location -- whether it's renewable energy like solar 🌞, wind 🌬️ and hydro 💧, or non-renewable energy like coal ⚫ and natural gas 💨. The more renewable energy gets used for training, the less carbon-intensive it is! Then, you take the power consumption of the GPU during training using the `pynvml` library. Finally, you multiply the power consumption and carbon intensity by the training time of the model, and you have an estimate of the CO2 emission. Keep in mind that this isn't an exact number because other factors come into play -- like the energy used for data center heating and cooling -- which will increase carbon emissions. But this will give you a good idea of the scale of CO2 emissions that your model is producing! To add **Carbon Emissions** metadata to your models: 1. If you are using **AutoTrain**, this is tracked for you 🔥 2. Otherwise, use a tracker like Code Carbon in your training code, then specify ```yaml co2_eq_emissions: emissions: 1.2345 ``` in your model card metadata, where `1.2345` is the emissions value in **grams**. To learn more about the carbon footprint of Transformers, check out the [video](https://www.youtube.com/watch?v=ftWlj4FBHTg), part of the Hugging Face Course! ### Distilabel https://huggingface.co/docs/hub/datasets-distilabel.md # Distilabel Distilabel is a framework for synthetic data and AI feedback for engineers who need fast, reliable and scalable pipelines based on verified research papers. Distilabel can be used for generating synthetic data and AI feedback for a wide variety of projects including traditional predictive NLP (classification, extraction, etc.), or generative and large language model scenarios (instruction following, dialogue generation, judging etc.). Distilabel's programmatic approach allows you to build scalable pipelines data generation and AI feedback. The goal of distilabel is to accelerate your AI development by quickly generating high-quality, diverse datasets based on verified research methodologies for generating and judging with AI feedback. ## What do people build with distilabel? The Argilla community uses distilabel to create amazing [datasets](https://huggingface.co/datasets?other=distilabel) and [models](https://huggingface.co/models?other=distilabel). - The [1M OpenHermesPreference](https://huggingface.co/datasets/argilla/OpenHermesPreferences) is a dataset of ~1 million AI preferences that have been generated using the [teknium/OpenHermes-2.5](https://huggingface.co/datasets/teknium/OpenHermes-2.5) LLM. It is a great example on how you can use distilabel to scale and increase dataset development. - [distilabeled Intel Orca DPO dataset](https://huggingface.co/datasets/argilla/distilabel-intel-orca-dpo-pairs) used to fine-tune the [improved OpenHermes model](https://huggingface.co/argilla/distilabeled-OpenHermes-2.5-Mistral-7B). This dataset was built by combining human curation in Argilla with AI feedback from distilabel, leading to an improved version of the Intel Orca dataset and outperforming models fine-tuned on the original dataset. - The [haiku DPO data](https://github.com/davanstrien/haiku-dpo) is an example how anyone can create a synthetic dataset for a specific task, which after curation and evaluation can be used for fine-tuning custom LLMs. ## Prerequisites First [login with your Hugging Face account](/docs/huggingface_hub/quick-start#login): ```bash hf auth login ``` Make sure you have `distilabel` installed: ```bash pip install -U distilabel[vllm] ``` ## Distilabel pipelines Distilabel pipelines can be built with any number of interconnected steps or tasks. The output of one step or task is fed as input to another. A series of steps can be chained together to build complex data processing and generation pipelines with LLMs. The input of each step is a batch of data, containing a list of dictionaries, where each dictionary represents a row of the dataset, and the keys are the column names. To feed data from and to the Hugging Face hub, we've defined a `Distiset` class as an abstraction of a `datasets.DatasetDict`. ## Distiset as dataset object A Pipeline in distilabel returns a special type of Hugging Face `datasets.DatasetDict` which is called `Distiset`. The Pipeline can output multiple subsets in the Distiset, which is a dictionary-like object with one entry per subset. A Distiset can then be pushed seamlessly to the Hugging face Hub, with all the subsets in the same repository. ## Load data from the Hub to a Distiset To showcase an example of loading data from the Hub, we will reproduce the [Prometheus 2 paper](https://arxiv.org/pdf/2405.01535) and use the PrometheusEval task implemented in distilabel. The Prometheus 2 and Prometheuseval task direct assessment and pairwise ranking tasks i.e. assessing the quality of a single isolated response for a given instruction with or without a reference answer, and assessing the quality of one response against another one for a given instruction with or without a reference answer, respectively. We will use these task on a dataset loaded from the Hub, which was created by the Hugging Face H4 team named [HuggingFaceH4/instruction-dataset](https://huggingface.co/datasets/HuggingFaceH4/instruction-dataset). ```python from distilabel.llms import vLLM from distilabel.pipeline import Pipeline from distilabel.steps import KeepColumns, LoadDataFromHub from distilabel.steps.tasks import PrometheusEval if __name__ == "__main__": with Pipeline(name="prometheus") as pipeline: load_dataset = LoadDataFromHub( name="load_dataset", repo_id="HuggingFaceH4/instruction-dataset", split="test", output_mappings={"prompt": "instruction", "completion": "generation"}, ) task = PrometheusEval( name="task", llm=vLLM( model="prometheus-eval/prometheus-7b-v2.0", chat_template="[INST] {{ messages[0]['content'] }}\n{{ messages[1]['content'] }}[/INST]", ), mode="absolute", rubric="factual-validity", reference=False, num_generations=1, group_generations=False, ) keep_columns = KeepColumns( name="keep_columns", columns=["instruction", "generation", "feedback", "result", "model_name"], ) load_dataset >> task >> keep_columns ``` Then we need to call `pipeline.run` with the runtime parameters so that the pipeline can be launched and data can be stores in the `Distiset` object. ```python distiset = pipeline.run( parameters={ task.name: { "llm": { "generation_kwargs": { "max_new_tokens": 1024, "temperature": 0.7, }, }, }, }, ) ``` ## Push a distilabel Distiset to the Hub Push the `Distiset` to a Hugging Face repository, where each one of the subsets will correspond to a different configuration: ```python distiset.push_to_hub( "my-org/my-dataset", commit_message="Initial commit", private=False, token=os.getenv("HF_TOKEN"), ) ``` ## 📚 Resources - [🚀 Distilabel Docs](https://distilabel.argilla.io/latest/) - [🚀 Distilabel Docs - distiset](https://distilabel.argilla.io/latest/sections/how_to_guides/advanced/distiset/) - [🚀 Distilabel Docs - prometheus](https://distilabel.argilla.io/1.2.0/sections/pipeline_samples/papers/prometheus/) - [🆕 Introducing distilabel](https://argilla.io/blog/introducing-distilabel-1/) ### marimo on Spaces https://huggingface.co/docs/hub/spaces-sdks-docker-marimo.md # marimo on Spaces [marimo](https://github.com/marimo-team/marimo) is a reactive notebook for Python that models notebooks as dataflow graphs. When you run a cell or interact with a UI element, marimo automatically runs affected cells (or marks them as stale), keeping code and outputs consistent and preventing bugs before they happen. Every marimo notebook is stored as pure Python, executable as a script, and deployable as an app. Key features: - ⚡️ **reactive:** run a cell, and marimo reactively runs all dependent cells or marks them as stale - 🖐️ **interactive:** bind sliders, tables, plots, and more to Python — no callbacks required - 🔬 **reproducible:** no hidden state, deterministic execution, built-in package management - 🏃 **executable:** execute as a Python script, parametrized by CLI args - 🛜 **shareable:** deploy as an interactive web app or slides, run in the browser via WASM - 🛢️ **designed for data:** query dataframes and databases with SQL, filter and search dataframes ## Deploying marimo apps on Spaces To get started with marimo on Spaces, click the button below: This will start building your Space using marimo's Docker template. If successful, you should see a similar application to the [marimo introduction notebook](https://huggingface.co/spaces/marimo-team/marimo-app-template). ## Customizing your marimo app When you create a marimo Space, you'll get a few key files to help you get started: ### 1. app.py This is your main marimo notebook file that defines your app's logic. marimo notebooks are pure Python files that use the `@app.cell` decorator to define cells. To learn more about building notebooks and apps, see [the marimo documentation](https://docs.marimo.io). As your app grows, you can organize your code into modules and import them into your main notebook. ### 2. Dockerfile The Dockerfile for a marimo app is minimal since marimo has few system dependencies. The key requirements are: - It installs the dependencies listed in `requirements.txt` (using `uv`) - It creates a non-root user for security - It runs the app using `marimo run app.py` You may need to modify this file if your application requires additional system dependencies, permissions, or other CLI flags. ### 3. requirements.txt The Space will automatically install dependencies listed in the `requirements.txt` file. At minimum, you must include `marimo` in this file. You will want to add any other required packages your app needs. The marimo Space template provides a basic setup that you can extend based on your needs. When deployed, your notebook will run in "app mode" which hides the code cells and only shows the interactive outputs - perfect for sharing with end users. You can opt to include the code cells in your app by setting adding `--include-code` to the `marimo run` command in the Dockerfile. ## Additional Resources and Support - [marimo documentation](https://docs.marimo.io) - [marimo GitHub repository](https://github.com/marimo-team/marimo) - [marimo Discord](https://marimo.io/discord) - [marimo template Space](https://huggingface.co/spaces/marimo-team/marimo-app-template) ## Troubleshooting If you encounter issues: 1. Make sure your notebook runs locally in app mode using `marimo run app.py` 2. Check that all required packages are listed in `requirements.txt` 3. Verify the port configuration matches (7860 is the default for Spaces) 4. Check Space logs for any Python errors For more help, visit the [marimo Discord](https://marimo.io/discord) or [open an issue](https://github.com/marimo-team/marimo/issues). ### Using sample-factory at Hugging Face https://huggingface.co/docs/hub/sample-factory.md # Using sample-factory at Hugging Face [`sample-factory`](https://github.com/alex-petrenko/sample-factory) is a codebase for high throughput asynchronous reinforcement learning. It has integrations with the Hugging Face Hub to share models with evaluation results and training metrics. ## Exploring sample-factory in the Hub You can find `sample-factory` models by filtering at the left of the [models page](https://huggingface.co/models?library=sample-factory). All models on the Hub come up with useful features: 1. An automatically generated model card with a description, a training configuration, and more. 2. Metadata tags that help for discoverability. 3. Evaluation results to compare with other models. 4. A video widget where you can watch your agent performing. ## Install the library To install the `sample-factory` library, you need to install the package: `pip install sample-factory` SF is known to work on Linux and MacOS. There is no Windows support at this time. ## Loading models from the Hub ### Using load_from_hub To download a model from the Hugging Face Hub to use with Sample-Factory, use the `load_from_hub` script: ``` python -m sample_factory.huggingface.load_from_hub -r -d ``` The command line arguments are: - `-r`: The repo ID for the HF repository to download from. The repo ID should be in the format `/` - `-d`: An optional argument to specify the directory to save the experiment to. Defaults to `./train_dir` which will save the repo to `./train_dir/` ### Download Model Repository Directly Hugging Face repositories can be downloaded directly using `git clone`: ``` git clone git@hf.co: # example: git clone git@hf.co:bigscience/bloom ``` ## Using Downloaded Models with Sample-Factory After downloading the model, you can run the models in the repo with the enjoy script corresponding to your environment. For example, if you are downloading a `mujoco-ant` model, it can be run with: ``` python -m sf_examples.mujoco.enjoy_mujoco --algo=APPO --env=mujoco_ant --experiment= --train_dir=./train_dir ``` Note, you may have to specify the `--train_dir` if your local train_dir has a different path than the one in the `cfg.json` ## Sharing your models ### Using push_to_hub If you want to upload without generating evaluation metrics or a replay video, you can use the `push_to_hub` script: ``` python -m sample_factory.huggingface.push_to_hub -r / -d ``` The command line arguments are: - `-r`: The repo_id to save on HF Hub. This is the same as `hf_repository` in the enjoy script and must be in the form `/` - `-d`: The full path to your experiment directory to upload ### Using enjoy.py You can upload your models to the Hub using your environment's `enjoy` script with the `--push_to_hub` flag. Uploading using `enjoy` can also generate evaluation metrics and a replay video. The evaluation metrics are generated by running your model on the specified environment for a number of episodes and reporting the mean and std reward of those runs. Other relevant command line arguments are: - `--hf_repository`: The repository to push to. Must be of the form `/`. The model will be saved to `https://huggingface.co//` - `--max_num_episodes`: Number of episodes to evaluate on before uploading. Used to generate evaluation metrics. It is recommended to use multiple episodes to generate an accurate mean and std. - `--max_num_frames`: Number of frames to evaluate on before uploading. An alternative to `max_num_episodes` - `--no_render`: A flag that disables rendering and showing the environment steps. It is recommended to set this flag to speed up the evaluation process. You can also save a video of the model during evaluation to upload to the hub with the `--save_video` flag - `--video_frames`: The number of frames to be rendered in the video. Defaults to -1 which renders an entire episode - `--video_name`: The name of the video to save as. If `None`, will save to `replay.mp4` in your experiment directory For example: ``` python -m sf_examples.mujoco_examples.enjoy_mujoco --algo=APPO --env=mujoco_ant --experiment= --train_dir=./train_dir --max_num_episodes=10 --push_to_hub --hf_username= --hf_repository= --save_video --no_render ``` ### JupyterLab on Spaces https://huggingface.co/docs/hub/spaces-sdks-docker-jupyter.md # JupyterLab on Spaces [JupyterLab](https://jupyter.org/) is a web-based interactive development environment for Jupyter notebooks, code, and data. It is a great tool for data science and machine learning, and it is widely used by the community. With Hugging Face Spaces, you can deploy your own JupyterLab instance and use it for development directly from the Hugging Face website. ## ⚡️ Deploy a JupyterLab instance on Spaces You can deploy JupyterLab on Spaces with just a few clicks. First, go to [this link](https://huggingface.co/new-space?template=SpacesExamples/jupyterlab) or click the button below: Spaces requires you to define: * An **Owner**: either your personal account or an organization you're a part of. * A **Space name**: the name of the Space within the account you're creating the Space. * The **Visibility**: _private_ if you want the Space to be visible only to you or your organization, or _public_ if you want it to be visible to other users. * The **Hardware**: the hardware you want to use for your JupyterLab instance. This goes from CPUs to H100s. * You can optionally configure a `JUPYTER_TOKEN` password to protect your JupyterLab workspace. When unspecified, defaults to `huggingface`. We strongly recommend setting this up if your Space is public or if the Space is in an organization. Storage in Hugging Face Spaces is ephemeral, and the data you store in the default configuration can be lost in a reboot or reset of the Space. We recommend to save your work to a remote location or to use persistent storage for your data. ### Setting up persistent storage To set up persistent storage on the Space, you go to the Settings page of your Space and choose one of the options: `small`, `medium` and `large`. Once persistent storage is set up, the JupyterLab image gets mounted in `/data`. ## Read more - [HF Docker Spaces](https://huggingface.co/docs/hub/spaces-sdks-docker) If you have any feedback or change requests, please don't hesitate to reach out to the owners on the [Feedback Discussion](https://huggingface.co/spaces/SpacesExamples/jupyterlab/discussions/3). ## Acknowledgments This template was created by [camenduru](https://twitter.com/camenduru) and [nateraw](https://huggingface.co/nateraw), with contributions from [osanseviero](https://huggingface.co/osanseviero) and [azzr](https://huggingface.co/azzr). ### Using 🤗 Datasets https://huggingface.co/docs/hub/datasets-usage.md # Using 🤗 Datasets Once you've found an interesting dataset on the Hugging Face Hub, you can load the dataset using 🤗 Datasets. You can click on the [**Use this dataset** button](https://huggingface.co/datasets/nyu-mll/glue?library=datasets) to copy the code to load a dataset. First you need to [Login with your Hugging Face account](/docs/huggingface_hub/quick-start#login), for example using: ``` hf auth login ``` And then you can load a dataset from the Hugging Face Hub using ```python from datasets import load_dataset dataset = load_dataset("username/my_dataset") # or load the separate splits if the dataset has train/validation/test splits train_dataset = load_dataset("username/my_dataset", split="train") valid_dataset = load_dataset("username/my_dataset", split="validation") test_dataset = load_dataset("username/my_dataset", split="test") ``` You can also upload datasets to the Hugging Face Hub: ```python my_new_dataset.push_to_hub("username/my_new_dataset") ``` This creates a dataset repository `username/my_new_dataset` containing your Dataset in Parquet format, that you can reload later. For more information about using 🤗 Datasets, check out the [tutorials](/docs/datasets/tutorial) and [how-to guides](/docs/datasets/how_to) available in the 🤗 Datasets documentation. ### Uploading models https://huggingface.co/docs/hub/models-uploading.md # Uploading models To upload models to the Hub, you'll need to create an account at [Hugging Face](https://huggingface.co/join). Models on the Hub are [Git-based repositories](./repositories), which give you versioning, branches, discoverability and sharing features, integration with dozens of libraries, and more! You have control over what you want to upload to your repository, which could include checkpoints, configs, and any other files. You can link repositories with an individual user, such as [osanseviero/fashion_brands_patterns](https://huggingface.co/osanseviero/fashion_brands_patterns), or with an organization, such as [facebook/bart-large-xsum](https://huggingface.co/facebook/bart-large-xsum). Organizations can collect models related to a company, community, or library! If you choose an organization, the model will be featured on the organization’s page, and every member of the organization will have the ability to contribute to the repository. You can create a new organization [here](https://huggingface.co/organizations/new). > **_NOTE:_** Models do NOT need to be compatible with the Transformers/Diffusers libraries to get download metrics. Any custom model is supported. Read more below! There are several ways to upload models for them to be nicely integrated into the Hub and get [download metrics](models-download-stats), described below. - In case your model is designed for a library that has [built-in support](#upload-from-a-library-with-built-in-support), you can use the methods provided by the library. Custom models that use `trust_remote_code=True` can also leverage these methods. - In case your model is a custom PyTorch model, one can leverage the [`PyTorchModelHubMixin` class](#upload-a-pytorch-model-using-huggingfacehub) as it allows to add `from_pretrained`, `push_to_hub` to any `nn.Module` class, just like models in the Transformers, Diffusers and Timm libraries. - In addition to programmatic uploads, you can always use the [web interface](#using-the-web-interface) or [the git command line](#using-git). Once your model is uploaded, we suggest adding a [Model Card](./model-cards) to your repo to document your model and make it more discoverable. Example [repository](https://huggingface.co/LiheYoung/depth_anything_vitl14) that leverages [PyTorchModelHubMixin](#upload-a-pytorch-model-using-huggingfacehub). Downloads are shown on the right. ## Using the web interface To create a brand new model repository, visit [huggingface.co/new](http://huggingface.co/new). Then follow these steps: 1. In the "Files and versions" tab, select "Add File" and specify "Upload File": 2. From there, select a file from your computer to upload and leave a helpful commit message to know what you are uploading: 3. Afterwards, click **Commit changes** to upload your model to the Hub! 4. Inspect files and history You can check your repository with all the recently added files! The UI allows you to explore the model files and commits and to see the diff introduced by each commit: 5. Add metadata You can add metadata to your model card. You can specify: * the type of task this model is for, enabling widgets and the Inference API. * the used library (`transformers`, `spaCy`, etc.) * the language * the dataset * metrics * license * a lot more! Read more about model tags [here](./model-cards#model-card-metadata). 6. Add TensorBoard traces Any repository that contains TensorBoard traces (filenames that contain `tfevents`) is categorized with the [`TensorBoard` tag](https://huggingface.co/models?filter=tensorboard). As a convention, we suggest that you save traces under the `runs/` subfolder. The "Training metrics" tab then makes it easy to review charts of the logged variables, like the loss or the accuracy. Models trained with 🤗 Transformers will generate [TensorBoard traces](https://huggingface.co/docs/transformers/main_classes/callback#transformers.integrations.TensorBoardCallback) by default if [`tensorboard`](https://pypi.org/project/tensorboard/) is installed. ## Upload from a library with built-in support First check if your model is from a library that has built-in support to push to/load from the Hub, like Transformers, Diffusers, Timm, Asteroid, etc.: https://huggingface.co/docs/hub/models-libraries. Below we'll show how easy this is for a library like Transformers: ```python from transformers import BertConfig, BertModel config = BertConfig() model = BertModel(config) model.push_to_hub("nielsr/my-awesome-bert-model") # reload model = BertModel.from_pretrained("nielsr/my-awesome-bert-model") ``` Some libraries, like Transformers, support loading [code from the Hub](https://huggingface.co/docs/transformers/custom_models). This is a way to make your model work with Transformers using the `trust_remote_code=True` flag. You may want to consider this option instead of a full-fledged library integration. ## Upload a PyTorch model using huggingface_hub In case your model is a (custom) PyTorch model, you can leverage the `PyTorchModelHubMixin` [class](https://huggingface.co/docs/huggingface_hub/package_reference/mixins#huggingface_hub.PyTorchModelHubMixin) available in the [huggingface_hub](https://github.com/huggingface/huggingface_hub) Python library. It is a minimal class which adds `from_pretrained` and `push_to_hub` capabilities to any `nn.Module`, along with download metrics. Here is how to use it (assuming you have run `pip install huggingface_hub`): ```python import torch import torch.nn as nn from huggingface_hub import PyTorchModelHubMixin class MyModel( nn.Module, PyTorchModelHubMixin, # optionally, you can add metadata which gets pushed to the model card repo_url="your-repo-url", pipeline_tag="text-to-image", license="mit", ): def __init__(self, num_channels: int, hidden_size: int, num_classes: int): super().__init__() self.param = nn.Parameter(torch.rand(num_channels, hidden_size)) self.linear = nn.Linear(hidden_size, num_classes) def forward(self, x): return self.linear(x + self.param) # create model config = {"num_channels": 3, "hidden_size": 32, "num_classes": 10} model = MyModel(**config) # save locally model.save_pretrained("my-awesome-model") # push to the hub model.push_to_hub("your-hf-username/my-awesome-model") # reload model = MyModel.from_pretrained("your-hf-username/my-awesome-model") ``` As you can see, the only requirement is that your model inherits from `PyTorchModelHubMixin`. All instance attributes will be automatically serialized to a `config.json` file. Note that the `init` method can only take arguments which are JSON serializable. Python dataclasses are supported. This comes with automated download metrics, meaning that you'll be able to see how many times the model is downloaded, the same way they are available for models integrated natively in the Transformers, Diffusers or Timm libraries. With this mixin class, each separate checkpoint is stored on the Hub in a single repository consisting of 2 files: - a `pytorch_model.bin` or `model.safetensors` file containing the weights - a `config.json` file which is a serialized version of the model configuration. This class is used for counting download metrics: everytime a user calls `from_pretrained` to load a `config.json`, the count goes up by one. See [this guide](https://huggingface.co/docs/hub/models-download-stats) regarding automated download metrics. It's recommended to add a model card to each checkpoint so that people can read what the model is about, have a link to the paper, etc. Visit [the huggingface_hub's documentation](https://huggingface.co/docs/huggingface_hub/guides/integrations) to learn more. Alternatively, one can also simply programmatically upload files or folders to the hub: https://huggingface.co/docs/huggingface_hub/guides/upload. ## Using Git Finally, since model repos are just Git repositories, you can also use Git to push your model files to the Hub. Follow the guide on [Getting Started with Repositories](repositories-getting-started#terminal) to learn about using the `git` CLI to commit and push your models. ### GGUF usage with llama.cpp https://huggingface.co/docs/hub/gguf-llamacpp.md # GGUF usage with llama.cpp > [!TIP] > You can now deploy any llama.cpp compatible GGUF on Hugging Face Endpoints, read more about it [here](https://huggingface.co/docs/inference-endpoints/en/others/llamacpp_container) Llama.cpp allows you to download and run inference on a GGUF simply by providing a path to the Hugging Face repo path and the file name. llama.cpp downloads the model checkpoint and automatically caches it. The location of the cache is defined by `LLAMA_CACHE` environment variable; read more about it [here](https://github.com/ggerganov/llama.cpp/pull/7826). You can install llama.cpp through brew (works on Mac and Linux), or you can build it from source. There are also pre-built binaries and Docker images that you can [check in the official documentation](https://github.com/ggerganov/llama.cpp?tab=readme-ov-file#usage). ### Option 1: Install with brew/ winget ```bash brew install llama.cpp ``` or, on windows via winget ```bash winget install llama.cpp ``` ### Option 2: build from source Step 1: Clone llama.cpp from GitHub. ``` git clone https://github.com/ggerganov/llama.cpp ``` Step 2: Move into the llama.cpp folder and build it. You can also add hardware-specific flags (for ex: `-DGGML_CUDA=1` for Nvidia GPUs). ``` cd llama.cpp cmake -B build # optionally, add -DGGML_CUDA=ON to activate CUDA cmake --build build --config Release ``` Note: for other hardware support (for ex: AMD ROCm, Intel SYCL), please refer to [llama.cpp's build guide](https://github.com/ggml-org/llama.cpp/blob/master/docs/build.md) Once installed, you can use the `llama-cli` or `llama-server` as follows: ```bash llama-cli -hf bartowski/Llama-3.2-3B-Instruct-GGUF:Q8_0 ``` Note: You can explicitly add `-no-cnv` to run the CLI in raw completion mode (non-chat mode). Additionally, you can invoke an OpenAI spec chat completions endpoint directly using the llama.cpp server: ```bash llama-server -hf bartowski/Llama-3.2-3B-Instruct-GGUF:Q8_0 ``` After running the server you can simply utilise the endpoint as below: ```bash curl http://localhost:8080/v1/chat/completions \ -H "Content-Type: application/json" \ -H "Authorization: Bearer no-key" \ -d '{ "messages": [ { "role": "system", "content": "You are an AI assistant. Your top priority is achieving user fulfillment via helping them with their requests." }, { "role": "user", "content": "Write a limerick about Python exceptions" } ] }' ``` Replace `-hf` with any valid Hugging Face hub repo name - off you go! 🦙 ### Using Asteroid at Hugging Face https://huggingface.co/docs/hub/asteroid.md # Using Asteroid at Hugging Face `asteroid` is a Pytorch toolkit for audio source separation. It enables fast experimentation on common datasets with support for a large range of datasets and recipes to reproduce papers. ## Exploring Asteroid in the Hub You can find `asteroid` models by filtering at the left of the [models page](https://huggingface.co/models?filter=asteroid). All models on the Hub come up with the following features: 1. An automatically generated model card with a description, training configuration, metrics, and more. 2. Metadata tags that help for discoverability and contain information such as licenses and datasets. 3. An interactive widget you can use to play out with the model directly in the browser. 4. An Inference API that allows to make inference requests. ## Using existing models For a full guide on loading pre-trained models, we recommend checking out the [official guide](https://github.com/asteroid-team/asteroid/blob/master/docs/source/readmes/pretrained_models.md). All model classes (`BaseModel`, `ConvTasNet`, etc) have a `from_pretrained` method that allows to load models from the Hub. ```py from asteroid.models import ConvTasNet model = ConvTasNet.from_pretrained('mpariente/ConvTasNet_WHAM_sepclean') ``` If you want to see how to load a specific model, you can click `Use in Adapter Transformers` and you will be given a working snippet that you can load it! ## Sharing your models At the moment there is no automatic method to upload your models to the Hub, but the process to upload them is documented in the [official guide](https://github.com/asteroid-team/asteroid/blob/master/docs/source/readmes/pretrained_models.md#share-your-models). All the recipes create all the needed files to upload a model to the Hub. The process usually involves the following steps: 1. Create and clone a model repository. 2. Moving files from the recipe output to the repository (model card, model filte, TensorBoard traces). 3. Push the files (`git add` + `git commit` + `git push`). Once you do this, you can try out your model directly in the browser and share it with the rest of the community. ## Additional resources * Asteroid [website](https://asteroid-team.github.io/). * Asteroid [library](https://github.com/asteroid-team/asteroid). * Integration [docs](https://github.com/asteroid-team/asteroid/blob/master/docs/source/readmes/pretrained_models.md). ### Pull requests and Discussions https://huggingface.co/docs/hub/repositories-pull-requests-discussions.md # Pull requests and Discussions Hub Pull requests and Discussions allow users to do community contributions to repositories. Pull requests and discussions work the same for all the repo types. At a high level, the aim is to build a simpler version of other git hosts' (like GitHub's) PRs and Issues: - no forks are involved: contributors push to a special `ref` branch directly on the source repo. - there's no hard distinction between discussions and PRs: they are essentially the same so they are displayed in the same lists. - they are streamlined for ML (i.e. models/datasets/spaces repos), not arbitrary repos. _Note, Pull Requests and discussions can be enabled or disabled from the [repository settings](./repositories-settings#disabling-discussions--pull-requests)_ ## List By going to the community tab in any repository, you can see all Discussions and Pull requests. You can also filter to only see the ones that are open. ## View The Discussion page allows you to see the comments from different users. If it's a Pull Request, you can see all the changes by going to the Files changed tab. ## Editing a Discussion / Pull request title If you opened a PR or discussion, are the author of the repository, or have write access to it, you can edit the discussion title by clicking on the pencil button. ## Pin a Discussion / Pull Request If you have write access to a repository, you can pin discussions and Pull Requests. Pinned discussions appear at the top of all the discussions. ## Lock a Discussion / Pull Request If you have write access to a repository, you can lock discussions or Pull Requests. Once a discussion is locked, previous comments are still visible and users won't be able to add new comments. ## Comment edition and moderation If you wrote a comment or have write access to the repository, you can edit the content of the comment from the contextual menu in the top-right corner of the comment box. Once the comment has been edited, a new link will appear above the comment. This link shows the edit history. You can also hide a comment. Hiding a comment is irreversible, and nobody will be able to see its content nor edit it anymore. Read also [moderation](./moderation) to see how to report an abusive comment. ## Can I use Markdown and LaTeX in my comments and discussions? Yes! You can use Markdown to add formatting to your comments. Additionally, you can utilize LaTeX for mathematical typesetting, your formulas will be rendered with [KaTeX](https://katex.org/) before being parsed in Markdown. For LaTeX equations, you have to use the following delimiters: - `$$ ... $$` for display mode - `\\(...\\)` for inline mode (no space between the slashes and the parenthesis). ## How do I manage Pull requests locally? Let's assume your PR number is 42. ```bash git fetch origin refs/pr/42:pr/42 git checkout pr/42 # Do your changes git add . git commit -m "Add your change" git push origin pr/42:refs/pr/42 ``` ### Draft mode Draft mode is the default status when opening a new Pull request from scratch in "Advanced mode". With this status, other contributors know that your Pull request is under work and it cannot be merged. When your branch is ready, just hit the "Publish" button to change the status of the Pull request to "Open". Note that once published you cannot go back to draft mode. ## Pull requests advanced usage ### Where in the git repo are changes stored? Our Pull requests do not use forks and branches, but instead custom "branches" called `refs` that are stored directly on the source repo. [Git References](https://git-scm.com/book/en/v2/Git-Internals-Git-References) are the internal machinery of git which already stores tags and branches. The advantage of using custom refs (like `refs/pr/42` for instance) instead of branches is that they're not fetched (by default) by people (including the repo "owner") cloning the repo, but they can still be fetched on demand. ### Fetching all Pull requests: for git magicians 🧙‍♀️ You can tweak your local **refspec** to fetch all Pull requests: 1. Fetch ```bash git fetch origin refs/pr/*:refs/remotes/origin/pr/* ``` 2. create a local branch tracking the ref ```bash git checkout pr/{PR_NUMBER} # for example: git checkout pr/42 ``` 3. IF you make local changes, to push to the PR ref: ```bash git push origin pr/{PR_NUMBER}:refs/pr/{PR_NUMBER} # for example: git push origin pr/42:refs/pr/42 ``` ### Gated models https://huggingface.co/docs/hub/models-gated.md # Gated models To give more control over how models are used, the Hub allows model authors to enable **access requests** for their models. Users must agree to share their contact information (username and email address) with the model authors to access the model files when enabled. Model authors can configure this request with additional fields. A model with access requests enabled is called a **gated model**. Access requests are always granted to individual users rather than to entire organizations. A common use case of gated models is to provide access to early research models before the wider release. ## Manage gated models as a model author To enable access requests, go to the model settings page. By default, the model is not gated. Click on **Enable Access request** in the top-right corner. By default, access to the model is automatically granted to the user when requesting it. This is referred to as **automatic approval**. In this mode, any user can access your model once they've shared their personal information with you. If you want to manually approve which users can access your model, you must set it to **manual approval**. When this is the case, you will notice more options: - **Add access** allows you to search for a user and grant them access even if they did not request it. - **Notification frequency** lets you configure when to get notified if new users request access. It can be set to once a day or real-time. By default, an email is sent to your primary email address. For models hosted under an organization, emails are by default sent to the first 5 admins of the organization. In both cases (user or organization) you can set a different email address in the **Notifications email** field. ### Review access requests Once access requests are enabled, you have full control of who can access your model or not, whether the approval mode is manual or automatic. You can review and manage requests either from the UI or via the API. #### From the UI You can review who has access to your gated model from its settings page by clicking on the **Review access requests** button. This will open a modal with 3 lists of users: - **pending**: the list of users waiting for approval to access your model. This list is empty unless you've selected **manual approval**. You can either **Accept** or **Reject** the demand. If the demand is rejected, the user cannot access your model and cannot request access again. - **accepted**: the complete list of users with access to your model. You can choose to **Reject** access at any time for any user, whether the approval mode is manual or automatic. You can also **Cancel** the approval, which will move the user to the *pending* list. - **rejected**: the list of users you've manually rejected. Those users cannot access your models. If they go to your model repository, they will see a message *Your request to access this repo has been rejected by the repo's authors*. #### Via the API You can automate the approval of access requests by using the API. You must pass a `token` with `write` access to the gated repository. To generate a token, go to [your user settings](https://huggingface.co/settings/tokens). | Method | URI | Description | Headers | Payload | ------ | --- | ----------- | ------- | ------- | | `GET` | `/api/models/{repo_id}/user-access-request/pending` | Retrieve the list of pending requests. | `{"authorization": "Bearer $token"}` | | | `GET` | `/api/models/{repo_id}/user-access-request/accepted` | Retrieve the list of accepted requests. | `{"authorization": "Bearer $token"}` | | | `GET` | `/api/models/{repo_id}/user-access-request/rejected` | Retrieve the list of rejected requests. | `{"authorization": "Bearer $token"}` | | | `POST` | `/api/models/{repo_id}/user-access-request/handle` | Change the status of a given access request to `status`. | `{"authorization": "Bearer $token"}` | `{"status": "accepted"/"rejected"/"pending", "user": "username", "rejectionReason": "Optional rejection reason that will be visible to the user (max 200 characters)."}` | | `POST` | `/api/models/{repo_id}/user-access-request/grant` | Allow a specific user to access your repo. | `{"authorization": "Bearer $token"}` | `{"user": "username"} ` | The base URL for the HTTP endpoints above is `https://huggingface.co`. **NEW!** Those endpoints are now officially supported in our Python client `huggingface_hub`. List the access requests to your model with [`list_pending_access_requests`](https://huggingface.co/docs/huggingface_hub/main/en/package_reference/hf_api#huggingface_hub.HfApi.list_pending_access_requests), [`list_accepted_access_requests`](https://huggingface.co/docs/huggingface_hub/main/en/package_reference/hf_api#huggingface_hub.HfApi.list_accepted_access_requests) and [`list_rejected_access_requests`](https://huggingface.co/docs/huggingface_hub/main/en/package_reference/hf_api#huggingface_hub.HfApi.list_rejected_access_requests). You can also accept, cancel and reject access requests with [`accept_access_request`](https://huggingface.co/docs/huggingface_hub/main/en/package_reference/hf_api#huggingface_hub.HfApi.accept_access_request), [`cancel_access_request`](https://huggingface.co/docs/huggingface_hub/main/en/package_reference/hf_api#huggingface_hub.HfApi.cancel_access_request), [`reject_access_request`](https://huggingface.co/docs/huggingface_hub/main/en/package_reference/hf_api#huggingface_hub.HfApi.reject_access_request). Finally, you can grant access to a user with [`grant_access`](https://huggingface.co/docs/huggingface_hub/main/en/package_reference/hf_api#huggingface_hub.HfApi.grant_access). ### Download access report You can download a report of all access requests for a gated model with the **download user access report** button. Click on it to download a json file with a list of users. For each entry, you have: - **user**: the user id. Example: *julien-c*. - **fullname**: name of the user on the Hub. Example: *Julien Chaumond*. - **status**: status of the request. Either `"pending"`, `"accepted"` or `"rejected"`. - **email**: email of the user. - **time**: datetime when the user initially made the request. ### Customize requested information By default, users landing on your gated model will be asked to share their contact information (email and username) by clicking the **Agree and send request to access repo** button. If you want to collect more user information, you can configure additional fields. This information will be accessible from the **Settings** tab. To do so, add an `extra_gated_fields` property to your [model card metadata](./model-cards#model-card-metadata) containing a list of key/value pairs. The *key* is the name of the field and *value* its type or an object with a `type` field. The list of field types is: - `text`: a single-line text field. - `checkbox`: a checkbox field. - `date_picker`: a date picker field. - `country`: a country dropdown. The list of countries is based on the [ISO 3166-1 alpha-2](https://en.wikipedia.org/wiki/ISO_3166-1_alpha-2) standard. - `select`: a dropdown with a list of options. The list of options is defined in the `options` field. Example: `options: ["option 1", "option 2", {label: "option3", value: "opt3"}]`. Finally, you can also personalize the message displayed to the user with the `extra_gated_prompt` extra field. Here is an example of customized request form where the user is asked to provide their company name and country and acknowledge that the model is for non-commercial use only. ```yaml --- extra_gated_prompt: "You agree to not use the model to conduct experiments that cause harm to human subjects." extra_gated_fields: Company: text Country: country Specific date: date_picker I want to use this model for: type: select options: - Research - Education - label: Other value: other I agree to use this model for non-commercial use ONLY: checkbox --- ``` In some cases, you might also want to modify the default text in the gate heading, description, and button. For those use cases, you can modify `extra_gated_heading`, `extra_gated_description` and `extra_gated_button_content` like this: ```yaml --- extra_gated_heading: "Acknowledge license to accept the repository" extra_gated_description: "Our team may take 2-3 days to process your request" extra_gated_button_content: "Acknowledge license" --- ``` ### Example use cases of programmatically managing access requests Here are a few interesting use cases of programmatically managing access requests for gated repos we've seen organically emerge in the community. As a reminder, the model repo needs to be set to manual approval, otherwise users get access to it automatically. Possible use cases of programmatic management include: - If you have advanced user request screening requirements (for advanced compliance requirements, etc) or you wish to handle the user requests outside the Hub. - An example for this was Meta's [Llama 2](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf) initial release where users had to request access on a Meta website. - You can ask users for their HF username in your access flow, and then use a script to programmatically accept user requests on the Hub based on your set of conditions. - If you want to condition access to a model based on completing a payment flow (note that the actual payment flow happens outside of the Hub). - Here's an [example repo](https://huggingface.co/Trelis/openchat_3.5-function-calling-v3) from TrelisResearch that uses this use case. - [@RonanMcGovern](https://huggingface.co/RonanMcGovern) has posted a [video about the flow](https://www.youtube.com/watch?v=2OT2SI5auQU) and tips on how to implement it. ## Manage gated models as an organization (Enterprise Hub) [Enterprise Hub](https://huggingface.co/docs/hub/en/enterprise-hub) subscribers can create a Gating Group Collection to grant (or reject) access to all the models and datasets in a collection at once. More information about Gating Group Collections can be found in [our dedicated doc](https://huggingface.co/docs/hub/en/enterprise-hub-gating-group-collections). ## Access gated models as a user As a user, if you want to use a gated model, you will need to request access to it. This means that you must be logged in to a Hugging Face user account. Requesting access can only be done from your browser. Go to the model on the Hub and you will be prompted to share your information: By clicking on **Agree**, you agree to share your username and email address with the model authors. In some cases, additional fields might be requested. To help the model authors decide whether to grant you access, try to fill out the form as completely as possible. Once the access request is sent, there are two possibilities. If the approval mechanism is automatic, you immediately get access to the model files. Otherwise, the requests have to be approved manually by the authors, which can take more time. > [!WARNING] > The model authors have complete control over model access. In particular, they can decide at any time to block your access to the model without prior notice, regardless of approval mechanism or if your request has already been approved. ### Download files To download files from a gated model you'll need to be authenticated. In the browser, this is automatic as long as you are logged in with your account. If you are using a script, you will need to provide a [user token](./security-tokens). In the Hugging Face Python ecosystem (`transformers`, `diffusers`, `datasets`, etc.), you can login your machine using the [`huggingface_hub`](https://huggingface.co/docs/huggingface_hub/index) library and running in your terminal: ```bash hf auth login ``` Alternatively, you can programmatically login using `login()` in a notebook or a script: ```python >>> from huggingface_hub import login >>> login() ``` You can also provide the `token` parameter to most loading methods in the libraries (`from_pretrained`, `hf_hub_download`, `load_dataset`, etc.), directly from your scripts. For more details about how to login, check out the [login guide](https://huggingface.co/docs/huggingface_hub/quick-start#login). ### Restricting Access for EU Users For gated models, you can add an additional layer of access control to specifically restrict users from European Union countries. This is useful if your model's license or terms of use prohibit its distribution in the EU. To enable this, add the `extra_gated_eu_disallowed: true` property to your model card's metadata. **Important:** This feature will only activate if your model is already gated. If `gated: false` or the property is not set, this restriction will not apply. ```yaml --- license: mit gated: true extra_gated_eu_disallowed: true --- ``` The system identifies a user's location based on their IP address. ### Moderation https://huggingface.co/docs/hub/moderation.md # Moderation > [!TIP] > Check out the [Code of Conduct](https://huggingface.co/code-of-conduct) and the [Content Guidelines](https://huggingface.co/content-guidelines). ## Reporting a repository To report a repository, you can click the three dots at the top right of a repository. Afterwards, you can click "Report the repository". This will allow you to explain what's the reason behind the report (Ethical issue, legal issue, not working, or other) and a description for the report. Once you do this, a **public discussion** will be opened. ## Reporting a comment To report a comment, you can click the three dots at the top right of a comment. That will submit a request for the Hugging Face team to review. ### Run with Docker https://huggingface.co/docs/hub/spaces-run-with-docker.md # Run with Docker You can use Docker to run most Spaces locally. To view instructions to download and run Spaces' Docker images, click on the "Run with Docker" button on the top-right corner of your Space page: ## Login to the Docker registry Some Spaces will require you to login to Hugging Face's Docker registry. To do so, you'll need to provide: - Your Hugging Face username as `username` - A User Access Token as `password`. Generate one [here](https://huggingface.co/settings/tokens). ### Using Stable-Baselines3 at Hugging Face https://huggingface.co/docs/hub/stable-baselines3.md # Using Stable-Baselines3 at Hugging Face `stable-baselines3` is a set of reliable implementations of reinforcement learning algorithms in PyTorch. ## Exploring Stable-Baselines3 in the Hub You can find Stable-Baselines3 models by filtering at the left of the [models page](https://huggingface.co/models?library=stable-baselines3). All models on the Hub come up with useful features: 1. An automatically generated model card with a description, a training configuration, and more. 2. Metadata tags that help for discoverability. 3. Evaluation results to compare with other models. 4. A video widget where you can watch your agent performing. ## Install the library To install the `stable-baselines3` library, you need to install two packages: - `stable-baselines3`: Stable-Baselines3 library. - `huggingface-sb3`: additional code to load and upload Stable-baselines3 models from the Hub. ``` pip install stable-baselines3 pip install huggingface-sb3 ``` ## Using existing models You can simply download a model from the Hub using the `load_from_hub` function ``` checkpoint = load_from_hub( repo_id="sb3/demo-hf-CartPole-v1", filename="ppo-CartPole-v1.zip", ) ``` You need to define two parameters: - `--repo-id`: the name of the Hugging Face repo you want to download. - `--filename`: the file you want to download. ## Sharing your models You can easily upload your models using two different functions: 1. `package_to_hub()`: save the model, evaluate it, generate a model card and record a replay video of your agent before pushing the complete repo to the Hub. ``` package_to_hub(model=model, model_name="ppo-LunarLander-v2", model_architecture="PPO", env_id=env_id, eval_env=eval_env, repo_id="ThomasSimonini/ppo-LunarLander-v2", commit_message="Test commit") ``` You need to define seven parameters: - `--model`: your trained model. - `--model_architecture`: name of the architecture of your model (DQN, PPO, A2C, SAC...). - `--env_id`: name of the environment. - `--eval_env`: environment used to evaluate the agent. - `--repo-id`: the name of the Hugging Face repo you want to create or update. It’s `/`. - `--commit-message`. - `--filename`: the file you want to push to the Hub. 2. `push_to_hub()`: simply push a file to the Hub ``` push_to_hub( repo_id="ThomasSimonini/ppo-LunarLander-v2", filename="ppo-LunarLander-v2.zip", commit_message="Added LunarLander-v2 model trained with PPO", ) ``` You need to define three parameters: - `--repo-id`: the name of the Hugging Face repo you want to create or update. It’s `/`. - `--filename`: the file you want to push to the Hub. - `--commit-message`. ## Additional resources * Hugging Face Stable-Baselines3 [documentation](https://github.com/huggingface/huggingface_sb3#hugging-face--x-stable-baselines3-v20) * Stable-Baselines3 [documentation](https://stable-baselines3.readthedocs.io/en/master/) ### Managing Spaces with CircleCI Workflows https://huggingface.co/docs/hub/spaces-circleci.md # Managing Spaces with CircleCI Workflows You can keep your app in sync with your GitHub repository with a **CircleCI workflow**. [CircleCI](https://circleci.com) is a continuous integration and continuous delivery (CI/CD) platform that helps automate the software development process. A [CircleCI workflow](https://circleci.com/docs/workflows/) is a set of automated tasks defined in a configuration file, orchestrated by CircleCI, to streamline the process of building, testing, and deploying software applications. *Note: For files larger than 10MB, Spaces requires Git-LFS. If you don't want to use Git-LFS, you may need to review your files and check your history. Use a tool like [BFG Repo-Cleaner](https://rtyley.github.io/bfg-repo-cleaner/) to remove any large files from your history. BFG Repo-Cleaner will keep a local copy of your repository as a backup.* First, set up your GitHub repository and Spaces app together. Add your Spaces app as an additional remote to your existing Git repository. ```bash git remote add space https://huggingface.co/spaces/HF_USERNAME/SPACE_NAME ``` Then force push to sync everything for the first time: ```bash git push --force space main ``` Next, set up a [CircleCI workflow](https://circleci.com/docs/workflows/) to push your `main` git branch to Spaces. In the example below: * Replace `HF_USERNAME` with your username and `SPACE_NAME` with your Space name. * [Create a context in CircleCI](https://circleci.com/docs/contexts/) and add an env variable into it called *HF_PERSONAL_TOKEN* (you can give it any name, use the key you create in place of HF_PERSONAL_TOKEN) and the value as your Hugging Face API token. You can find your Hugging Face API token under **API Tokens** on [your Hugging Face profile](https://huggingface.co/settings/tokens). ```yaml version: 2.1 workflows: main: jobs: - sync-to-huggingface: context: - HuggingFace filters: branches: only: - main jobs: sync-to-huggingface: docker: - image: alpine resource_class: small steps: - run: name: install git command: apk update && apk add openssh-client git - checkout - run: name: push to Huggingface hub command: | git config user.email "" git config user.name "" git push -f https://HF_USERNAME:${HF_PERSONAL_TOKEN}@huggingface.co/spaces/HF_USERNAME/SPACE_NAME main ``` ### Annotated Model Card Template https://huggingface.co/docs/hub/model-card-annotated.md # Annotated Model Card Template ## Template [modelcard_template.md file](https://github.com/huggingface/huggingface_hub/blob/main/src/huggingface_hub/templates/modelcard_template.md) ## Directions Fully filling out a model card requires input from a few different roles. (One person may have more than one role.) We’ll refer to these roles as the **developer**, who writes the code and runs training; the **sociotechnic**, who is skilled at analyzing the interaction of technology and society long-term (this includes lawyers, ethicists, sociologists, or rights advocates); and the **project organizer**, who understands the overall scope and reach of the model, can roughly fill out each part of the card, and who serves as a contact person for model card updates. * The **developer** is necessary for filling out [Training Procedure](#training-procedure-optional) and [Technical Specifications](#technical-specifications-optional). They are also particularly useful for the “Limitations” section of [Bias, Risks, and Limitations](#bias-risks-and-limitations). They are responsible for providing [Results](#results) for the Evaluation, and ideally work with the other roles to define the rest of the Evaluation: [Testing Data, Factors & Metrics](#testing-data-factors--metrics). * The **sociotechnic** is necessary for filling out “Bias” and “Risks” within [Bias, Risks, and Limitations](#bias-risks-and-limitations), and particularly useful for “Out of Scope Use” within [Uses](#uses). * The **project organizer** is necessary for filling out [Model Details](#model-details) and [Uses](#uses). They might also fill out [Training Data](#training-data). Project organizers could also be in charge of [Citation](#citation-optional), [Glossary](#glossary-optional), [Model Card Contact](#model-card-contact), [Model Card Authors](#model-card-authors-optional), and [More Information](#more-information-optional). _Instructions are provided below, in italics._ Template variable names appear in `monospace`. --- # Model Name **Section Overview:** Provide the model name and a 1-2 sentence summary of what the model is. `model_id` `model_summary` # Table of Contents **Section Overview:** Provide this with links to each section, to enable people to easily jump around/use the file in other locations with the preserved TOC/print out the content/etc. # Model Details **Section Overview:** This section provides basic information about what the model is, its current status, and where it came from. It should be useful for anyone who wants to reference the model. ## Model Description `model_description` _Provide basic details about the model. This includes the architecture, version, if it was introduced in a paper, if an original implementation is available, and the creators. Any copyright should be attributed here. General information about training procedures, parameters, and important disclaimers can also be mentioned in this section._ * **Developed by:** `developers` _List (and ideally link to) the people who built the model._ * **Funded by:** `funded_by` _List (and ideally link to) the funding sources that financially, computationally, or otherwise supported or enabled this model._ * **Shared by [optional]:** `shared_by` _List (and ideally link to) the people/organization making the model available online._ * **Model type:** `model_type` _You can name the “type” as:_ _1. Supervision/Learning Method_ _2. Machine Learning Type_ _3. Modality_ * **Language(s)** [NLP]: `language` _Use this field when the system uses or processes natural (human) language._ * **License:** `license` _Name and link to the license being used._ * **Finetuned From Model [optional]:** `base_model` _If this model has another model as its base, link to that model here._ ## Model Sources [optional] * **Repository:** `repo` * **Paper [optional]:** `paper` * **Demo [optional]:** `demo` _Provide sources for the user to directly see the model and its details. Additional kinds of resources – training logs, lessons learned, etc. – belong in the [More Information](#more-information-optional) section. If you include one thing for this section, link to the repository._ # Uses **Section Overview:** This section addresses questions around how the model is intended to be used in different applied contexts, discusses the foreseeable users of the model (including those affected by the model), and describes uses that are considered out of scope or misuse of the model. Note this section is not intended to include the license usage details. For that, link directly to the license. ## Direct Use `direct_use` _Explain how the model can be used without fine-tuning, post-processing, or plugging into a pipeline. An example code snippet is recommended._ ## Downstream Use [optional] `downstream_use` _Explain how this model can be used when fine-tuned for a task or when plugged into a larger ecosystem or app. An example code snippet is recommended._ ## Out-of-Scope Use `out_of_scope_use` _List how the model may foreseeably be misused (used in a way it will not work for) and address what users ought not do with the model._ # Bias, Risks, and Limitations **Section Overview:** This section identifies foreseeable harms, misunderstandings, and technical and sociotechnical limitations. It also provides information on warnings and potential mitigations. Bias, risks, and limitations can sometimes be inseparable/refer to the same issues. Generally, bias and risks are sociotechnical, while limitations are technical: - A **bias** is a stereotype or disproportionate performance (skew) for some subpopulations. - A **risk** is a socially-relevant issue that the model might cause. - A **limitation** is a likely failure mode that can be addressed following the listed Recommendations. `bias_risks_limitations` _What are the known or foreseeable issues stemming from this model?_ ## Recommendations `bias_recommendations` _What are recommendations with respect to the foreseeable issues? This can include everything from “downsample your image” to filtering explicit content._ # Training Details **Section Overview:** This section provides information to describe and replicate training, including the training data, the speed and size of training elements, and the environmental impact of training. This relates heavily to the [Technical Specifications](#technical-specifications-optional) as well, and content here should link to that section when it is relevant to the training procedure. It is useful for people who want to learn more about the model inputs and training footprint. It is relevant for anyone who wants to know the basics of what the model is learning. ## Training Data `training_data` _Write 1-2 sentences on what the training data is. Ideally this links to a Dataset Card for further information. Links to documentation related to data pre-processing or additional filtering may go here as well as in [More Information](#more-information-optional)._ ## Training Procedure [optional] ### Preprocessing `preprocessing` _Detail tokenization, resizing/rewriting (depending on the modality), etc._ ### Speeds, Sizes, Times `speeds_sizes_times` _Detail throughput, start/end time, checkpoint sizes, etc._ # Evaluation **Section Overview:** This section describes the evaluation protocols, what is being measured in the evaluation, and provides the results. Evaluation ideally has at least two parts, with one part looking at quantitative measurement of general performance ([Testing Data, Factors & Metrics](#testing-data-factors--metrics)), such as may be done with benchmarking; and another looking at performance with respect to specific social safety issues ([Societal Impact Assessment](#societal-impact-assessment-optional)), such as may be done with red-teaming. You can also specify your model's evaluation results in a structured way in the model card metadata. Results are parsed by the Hub and displayed in a widget on the model page. See https://huggingface.co/docs/hub/model-cards#evaluation-results. ## Testing Data, Factors & Metrics _Evaluation is ideally **disaggregated** with respect to different factors, such as task, domain and population subgroup; and calculated with metrics that are most meaningful for foreseeable contexts of use. Equal evaluation performance across different subgroups is said to be "fair" across those subgroups; target fairness metrics should be decided based on which errors are more likely to be problematic in light of the model use. However, this section is most commonly used to report aggregate evaluation performance on different task benchmarks._ ### Testing Data `testing_data` _Describe testing data or link to its Dataset Card._ ### Factors `testing_factors` _What are the foreseeable characteristics that will influence how the model behaves? Evaluation should ideally be disaggregated across these factors in order to uncover disparities in performance._ ### Metrics `testing_metrics` _What metrics will be used for evaluation?_ ## Results `results` _Results should be based on the Factors and Metrics defined above._ ### Summary `results_summary` _What do the results say? This can function as a kind of tl;dr for general audiences._ ## Societal Impact Assessment [optional] _Use this free text section to explain how this model has been evaluated for risk of societal harm, such as for child safety, NCII, privacy, and violence. This might take the form of answers to the following questions:_ - _Is this model safe for kids to use? Why or why not?_ - _Has this model been tested to evaluate risks pertaining to non-consensual intimate imagery (including CSEM)?_ - _Has this model been tested to evaluate risks pertaining to violent activities, or depictions of violence? What were the results?_ _Quantitative numbers on each issue may also be provided._ # Model Examination [optional] **Section Overview:** This is an experimental section some developers are beginning to add, where work on explainability/interpretability may go. `model_examination` # Environmental Impact **Section Overview:** Summarizes the information necessary to calculate environmental impacts such as electricity usage and carbon emissions. * **Hardware Type:** `hardware_type` * **Hours used:** `hours_used` * **Cloud Provider:** `cloud_provider` * **Compute Region:** `cloud_region` * **Carbon Emitted:** `co2_emitted` _Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700)._ # Technical Specifications [optional] **Section Overview:** This section includes details about the model objective and architecture, and the compute infrastructure. It is useful for people interested in model development. Writing this section usually requires the model developer to be directly involved. ## Model Architecture and Objective `model_specs` ## Compute Infrastructure `compute_infrastructure` ### Hardware `hardware_requirements` _What are the minimum hardware requirements, e.g. processing, storage, and memory requirements?_ ### Software `software` # Citation [optional] **Section Overview:** The developers’ preferred citation for this model. This is often a paper. ### BibTeX `citation_bibtex` ### APA `citation_apa` # Glossary [optional] **Section Overview:** This section defines common terms and how metrics are calculated. `glossary` _Clearly define terms in order to be accessible across audiences._ # More Information [optional] **Section Overview:** This section provides links to writing on dataset creation, technical specifications, lessons learned, and initial results. `more_information` # Model Card Authors [optional] **Section Overview:** This section lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction. `model_card_authors` # Model Card Contact **Section Overview:** Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors `model_card_contact` # How to Get Started with the Model **Section Overview:** Provides a code snippet to show how to use the model. `get_started_code` --- **Please cite as:** Ozoani, Ezi and Gerchick, Marissa and Mitchell, Margaret. Model Card Guidebook. Hugging Face, 2022. https://huggingface.co/docs/hub/en/model-card-guidebook ### Repositories https://huggingface.co/docs/hub/repositories.md # Repositories Models, Spaces, and Datasets are hosted on the Hugging Face Hub as [Git repositories](https://git-scm.com/about), which means that version control and collaboration are core elements of the Hub. In a nutshell, a repository (also known as a **repo**) is a place where code and assets can be stored to back up your work, share it with the community, and work in a team. Unlike other collaboration platforms, our Git repositories are optimized for Machine Learning and AI files – large binary files, usually in specific file formats like Parquet and Safetensors, and up to [Terabyte-scale sizes](https://huggingface.co/blog/from-files-to-chunks)! To achieve this, we built [Xet](./xet/index), a modern custom storage system built specifically for AI/ML development, enabling chunk-level deduplication, smaller uploads, and faster downloads. In these pages, you will go over the basics of getting started with Git and Xet and interacting with repositories on the Hub. Once you get the hang of it, you can explore the best practices and next steps that we've compiled for effective repository usage. ## Contents - [Getting Started with Repositories](./repositories-getting-started) - [Settings](./repositories-settings) - [Storage Limits](./storage-limits) - [Storage Backend (Xet)](./xet/index) - [Pull Requests & Discussions](./repositories-pull-requests-discussions) - [Pull Requests advanced usage](./repositories-pull-requests-discussions#pull-requests-advanced-usage) - [Collections](./collections) - [Notifications](./notifications) - [Webhooks](./webhooks) - [Next Steps](./repositories-next-steps) - [Licenses](./repositories-licenses) ### Webhook guide: Setup an automatic metadata quality review for models and datasets https://huggingface.co/docs/hub/webhooks-guide-metadata-review.md # Webhook guide: Setup an automatic metadata quality review for models and datasets > [!TIP] > Webhooks are now publicly available! This guide will walk you through creating a system that reacts to changes to a user's or organization's models or datasets on the Hub and creates a 'metadata review' for the changed repository. ## What are we building and why? Before we dive into the technical details involved in this particular workflow, we'll quickly outline what we're creating and why. [Model cards](https://huggingface.co/docs/hub/model-cards) and [dataset cards](https://huggingface.co/docs/hub/datasets-cards) are essential tools for documenting machine learning models and datasets. The Hugging Face Hub uses a `README.md` file containing a [YAML](https://en.wikipedia.org/wiki/YAML) header block to generate model and dataset cards. This `YAML` section defines metadata relating to the model or dataset. For example: ```yaml --- language: - "List of ISO 639-1 code for your language" - lang1 - lang2 tags: - tag1 - tag2 license: "any valid license identifier" datasets: - dataset1 --- ``` This metadata contains essential information about your model or dataset for potential users. The license, for example, defines the terms under which a model or dataset can be used. Hub users can also use the fields defined in the `YAML` metadata as filters for identifying models or datasets that fit specific criteria. Since the metadata defined in this block is essential for potential users of our models and datasets, it is important that we complete this section. In a team or organization setting, users pushing models and datasets to the Hub may have differing familiarity with the importance of this YAML metadata block. While someone in a team could take on the responsibility of reviewing this metadata, there may instead be some automation we can do to help us with this problem. The result will be a metadata review report automatically posted or updated when a repository on the Hub changes. For our metadata quality, this system works similarly to [CI/CD](https://en.wikipedia.org/wiki/CI/CD). ![Metadata review](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/webhooks-guides/003-metadata-review/metadata-report-screenshot.png) You can also find an example review [here](https://huggingface.co/datasets/davanstrien/test_webhook/discussions/1#63d932fe19aa7b8ed2718b3f). ## Using the Hub Client Library to create a model review card The `huggingface_hub` is a Python library that allows you to interact with the Hub. We can use this library to [download model and dataset cards](https://huggingface.co/docs/huggingface_hub/how-to-model-cards) from the Hub using the `DatasetCard.load` or `ModelCard.load` methods. In particular, we'll use these methods to load a Python dictionary, which contains the metadata defined in the `YAML` of our model or dataset card. We'll create a small Python function to wrap these methods and do some exception handling. ```python from huggingface_hub import DatasetCard, ModelCard from huggingface_hub.utils import EntryNotFoundError def load_repo_card_metadata(repo_type, repo_name): if repo_type == "dataset": try: return DatasetCard.load(repo_name).data.to_dict() except EntryNotFoundError: return {} if repo_type == "model": try: return ModelCard.load(repo_name).data.to_dict() except EntryNotFoundError: return {} ``` This function will return a Python dictionary containing the metadata associated with the repository (or an empty dictionary if there is no metadata). ```python {'license': 'afl-3.0'} ``` ## Creating our metadata review report Once we have a Python dictionary containing the metadata associated with a repository, we'll create a 'report card' for our metadata review. In this particular instance, we'll review our metadata by defining some metadata fields for which we want values. For example, we may want to ensure that the `license` field has always been completed. To rate our metadata, we'll count which metadata fields are present out of our desired fields and return a percentage score based on the coverage of the required metadata fields we want to see values. Since we have a Python dictionary containing our metadata, we can loop through this dictionary to check if our desired keys are there. If a desired metadata field (a key in our dictionary) is missing, we'll assign the value as `None`. ```python def create_metadata_key_dict(card_data, repo_type: str): shared_keys = ["tags", "license"] if repo_type == "model": model_keys = ["library_name", "datasets", "metrics", "co2", "pipeline_tag"] shared_keys.extend(model_keys) keys = shared_keys return {key: card_data.get(key) for key in keys} if repo_type == "dataset": # [...] ``` This function will return a dictionary containing keys representing the metadata fields we require for our model or dataset. The dictionary values will either include the metadata entered for that field or `None` if that metadata field is missing in the `YAML`. ```python {'tags': None, 'license': 'afl-3.0', 'library_name': None, 'datasets': None, 'metrics': None, 'co2': None, 'pipeline_tag': None} ``` Once we have this dictionary, we can create our metadata report. In the interest of brevity, we won't include the complete code here, but the Hugging Face Spaces [repository](https://huggingface.co/spaces/librarian-bot/webhook_metadata_reviewer/blob/main/main.py) for this Webhook contains the full code. We create one function which creates a markdown table that produces a prettier version of the data we have in our metadata coverage dictionary. ```python def create_metadata_breakdown_table(desired_metadata_dictionary): # [...] return tabulate( table_data, tablefmt="github", headers=("Metadata Field", "Provided Value") ) ``` We also have a Python function that generates a score (representing the percentage of the desired metadata fields present) ```python def calculate_grade(desired_metadata_dictionary): # [...] return round(score, 2) ``` and a Python function that creates a markdown report for our metadata review. This report contains both the score and metadata table, along with some explanation of what the report contains. ```python def create_markdown_report( desired_metadata_dictionary, repo_name, repo_type, score, update: bool = False ): # [...] return report ``` ## How to post the review automatically? We now have a markdown formatted metadata review report. We'll use the `huggingface_hub` library to post this review. We define a function that takes back the Webhook data received from the Hub, parses the data, and creates the metadata report. Depending on whether a report has previously been created, the function creates a new report or posts a new issue to an existing metadata review thread. ```python def create_or_update_report(data): if parsed_post := parse_webhook_post(data): repo_type, repo_name = parsed_post else: return Response("Unable to parse webhook data", status_code=400) # [...] return True ``` > [!TIP] > `:=` is the Python Syntax for an assignment expression operator added to the Python language in version 3.8 (colloquially known as the walrus operator). People have mixed opinions on this syntax, and it doesn't change how Python evaluates the code if you don't use this. You can read more about this operator in this Real Python article. ## Creating a Webhook to respond to changes on the Hub We've now got the core functionality for creating a metadata review report for a model or dataset. The next step is to use Webhooks to respond to changes automatically. ## Create a Webhook in your user profile First, create your Webhook by going to https://huggingface.co/settings/webhooks. - Input a few target repositories that your Webhook will listen to (you will likely want to limit this to your own repositories or the repositories of the organization you belong to). - Input a secret to make your Webhook more secure (if you don't know what to choose for this, you may want to use a [password generator](https://1password.com/password-generator/) to generate a sufficiently long random string for your secret). - We can pass a dummy URL for the `Webhook URL` parameter for now. Your Webhook will look like this: ![webhook settings](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/webhooks-guides/003-metadata-review/webhook-settings.png) ## Create a new Bot user profile This guide creates a separate user account that will post the metadata reviews. ![Bot user account](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/webhooks-guides/003-metadata-review/librarian-bot-profile.png) > [!TIP] > When creating a bot that will interact with other users on the Hub, we ask that you clearly label the account as a "Bot" (see profile screenshot). ## Create a Webhook listener We now need some way of listening to Webhook events. There are many possible tools you can use to listen to Webhook events. Many existing services, such as [Zapier](https://zapier.com/) and [IFTTT](https://ifttt.com), can use Webhooks to trigger actions (for example, they could post a tweet every time a model is updated). In this case, we'll implement our Webhook listener using [FastAPI](https://fastapi.tiangolo.com/). [FastAPI](https://fastapi.tiangolo.com/) is a Python web framework. We'll use FastAPI to create a Webhook listener. In particular, we need to implement a route that accepts `POST` requests on `/webhook`. For authentication, we'll compare the `X-Webhook-Secret` header with a `WEBHOOK_SECRET` secret that can be passed to our [Docker container at runtime](./spaces-sdks-docker#runtime). ```python from fastapi import FastAPI, Request, Response import os KEY = os.environ.get("WEBHOOK_SECRET") app = FastAPI() @app.post("/webhook") async def webhook(request: Request): if request.method == "POST": if request.headers.get("X-Webhook-Secret") != KEY: return Response("Invalid secret", status_code=401) data = await request.json() result = create_or_update_report(data) return "Webhook received!" if result else result ``` The above function will receive Webhook events and creates or updates the metadata review report for the changed repository. ## Use Spaces to deploy our Webhook app Our [main.py](https://huggingface.co/spaces/librarian-bot/webhook_metadata_reviewer/blob/main/main.py) file contains all the code we need for our Webhook app. To deploy it, we'll use a [Space](./spaces-overview). For our Space, we'll use Docker to run our app. The [Dockerfile](https://huggingface.co/spaces/librarian-bot/webhook_metadata_reviewer/blob/main/Dockerfile) copies our app file, installs the required dependencies, and runs the application. To populate the `KEY` variable, we'll also set a `WEBHOOK_SECRET` secret for our Space with the secret we generated earlier. You can read more about Docker Spaces [here](./spaces-sdks-docker). Finally, we need to update the URL in our Webhook settings to the URL of our Space. We can get our Space’s “direct URL” from the contextual menu. Click on “Embed this Space” and copy the “Direct URL”. ![direct url](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/webhooks-guides/003-metadata-review/direct-url.png) Once we have this URL, we can pass this to the `Webhook URL` parameter in our Webhook settings. Our bot should now start posting reviews when monitored repositories change! ## Conclusion and next steps We now have an automatic metadata review bot! Here are some ideas for how you could build on this guide: - The metadata review done by our bot was relatively crude; you could add more complex rules for reviewing metadata. - You could use the full `README.md` file for doing the review. - You may want to define 'rules' which are particularly important for your organization and use a webhook to check these are followed. If you build a metadata quality app using Webhooks, please tag me @davanstrien; I would love to know about it! ### Access control in organizations https://huggingface.co/docs/hub/organizations-security.md # Access control in organizations > [!TIP] > You can set up [Single Sign-On (SSO)](./security-sso) to be able to map access control rules from your organization's Identity Provider. > [!TIP] > Advanced and more fine-grained access control can be achieved with [Resource Groups](./security-resource-groups). > > The Resource Group feature is part of the Team & Enterprise plans. Members of organizations can have four different roles: `read`, `contributor`, `write`, or `admin`: - `read`: read-only access to the Organization's repos and metadata/settings (eg, the Organization's profile, members list, API token, etc). - `contributor`: additional write rights to the subset of the Organization's repos that were created by the user. I.e., users can create repos and _then_ modify only those repos. This is similar to the `write` role, but scoped to repos _created_ by the user. - `write`: write rights to all the Organization's repos. Users can create, delete, or rename any repo in the Organization namespace. A user can also edit and delete files from the browser editor and push content with `git`. - `admin`: in addition to write rights on repos, admin members can update the Organization's profile, refresh the Organization's API token, and manage Organization members. As an organization `admin`, go to the **Members** section of the org settings to manage roles for users. ## Viewing members' email address > [!WARNING] > This feature is part of the Team & Enterprise plans. You may be able to view the email addresses of members of your organization. The visibility of the email addresses depends on the organization's SSO configuration, or verified organization status. - By [verifying an email domain](./organizations-managing#organization-email-domain) for your organization, you can view the email addresses of members with a matching email domain. - If SSO is configured for your organization, you can view the email address for each of your organization members by setting `Matching email domains` in the SSO configuration ## Managing Access Tokens with access to my organization See [Tokens Management](./enterprise-hub-tokens-management) ### Network Security https://huggingface.co/docs/hub/enterprise-hub-network-security.md # Network Security > [!WARNING] > This feature is part of the Enterprise Plus plan. ## Define your organization IP Ranges You can list the IP addresses of your organization's outbound traffic to apply for higher rate limits and/or to enforce authenticated access to Hugging Face from your corporate network. The outbound IP address ranges are defined in CIDR format. For example, `52.219.168.0/24` or `2600:1f69:7400::/40`. You can set multiple ranges, one per line. ## Higher Hub Rate Limits Most of the actions on the Hub have limits; for example, users are limited to creating a certain number of repositories per day. Enterprise Plus automatically gives your users the highest rate limits possible for every action. Additionally, once your IP ranges are set, enabling the "Higher Hub Rate Limits" option allows your organization to benefit from the highest HTTP rate limits on the Hub API, unlocking large volumes of model or dataset downloads. For more information about rate limits, see the [Hub Rate limits](./rate-limits) documentation. ## Restrict organization access to your IP ranges only This option restricts access to your organization's resources to only those coming from your defined IP ranges. No one can access your organization resources outside your IP ranges. The rules also apply to access tokens. When enabled, this option unlocks additional nested security settings below. ### Require login for users in your IP ranges When this option is enabled, anyone visiting Hugging Face from your corporate network must be logged in and belong to your organization (requires a manual verification when IP ranges have changed). If enabled, you can optionally define a content access policy. All public pages will show the following message if access is unauthenticated: ### Content Access Policy Define a fine-grained Content Access Policy by blocking certain sections of the Hugging Face Hub. For example, you can block your organization's members from accessing Spaces by adding `/spaces/*` to the blocked URLs. When users of your organization navigate to a page that matches the URL pattern, they'll be presented the following page: To define Blocked URLs, enter URL patterns, without the domain name, one per line: The Allowed URLs field, enables you to define some exception to the blocking rules, especially. For example by allowing a specific URL within the Blocked URLs pattern, ie `/spaces/meta-llama/*` ### Docker Spaces Examples https://huggingface.co/docs/hub/spaces-sdks-docker-examples.md # Docker Spaces Examples We gathered some example demos in the [Spaces Examples](https://huggingface.co/SpacesExamples) organization. Please check them out! * Dummy FastAPI app: https://huggingface.co/spaces/DockerTemplates/fastapi_dummy * FastAPI app serving a static site and using `transformers`: https://huggingface.co/spaces/DockerTemplates/fastapi_t5 * Phoenix app for https://huggingface.co/spaces/DockerTemplates/single_file_phx_bumblebee_ml * HTTP endpoint in Go with query parameters https://huggingface.co/spaces/XciD/test-docker-go?q=Adrien * Shiny app written in Python https://huggingface.co/spaces/elonmuskceo/shiny-orbit-simulation * Genie.jl app in Julia https://huggingface.co/spaces/nooji/GenieOnHuggingFaceSpaces * Argilla app for data labelling and curation: https://huggingface.co/spaces/argilla/live-demo and [write-up about hosting Argilla on Spaces](./spaces-sdks-docker-argilla) by [@dvilasuero](https://huggingface.co/dvilasuero) 🎉 * JupyterLab and VSCode: https://huggingface.co/spaces/DockerTemplates/docker-examples by [@camenduru](https://twitter.com/camenduru) and [@nateraw](https://hf.co/nateraw). * Zeno app for interactive model evaluation: https://huggingface.co/spaces/zeno-ml/diffusiondb and [instructions for setup](https://zenoml.com/docs/deployment#hugging-face-spaces) * Gradio App: https://huggingface.co/spaces/sayakpaul/demo-docker-gradio ### Query datasets https://huggingface.co/docs/hub/datasets-duckdb-select.md # Query datasets Querying datasets is a fundamental step in data analysis. Here, we'll guide you through querying datasets using various methods. There are [several ways](https://duckdb.org/docs/data/parquet/overview.html) to select your data. Using the `FROM` syntax: ```bash FROM 'hf://datasets/jamescalam/world-cities-geo/train.jsonl' SELECT city, country, region LIMIT 3; ┌────────────────┬─────────────┬───────────────┐ │ city │ country │ region │ │ varchar │ varchar │ varchar │ ├────────────────┼─────────────┼───────────────┤ │ Kabul │ Afghanistan │ Southern Asia │ │ Kandahar │ Afghanistan │ Southern Asia │ │ Mazar-e Sharif │ Afghanistan │ Southern Asia │ └────────────────┴─────────────┴───────────────┘ ``` Using the `SELECT` and `FROM` syntax: ```bash SELECT city, country, region FROM 'hf://datasets/jamescalam/world-cities-geo/train.jsonl' USING SAMPLE 3; ┌──────────┬─────────┬────────────────┐ │ city │ country │ region │ │ varchar │ varchar │ varchar │ ├──────────┼─────────┼────────────────┤ │ Wenzhou │ China │ Eastern Asia │ │ Valdez │ Ecuador │ South America │ │ Aplahoue │ Benin │ Western Africa │ └──────────┴─────────┴────────────────┘ ``` Count all JSONL files matching a glob pattern: ```bash SELECT COUNT(*) FROM 'hf://datasets/jamescalam/world-cities-geo/*.jsonl'; ┌──────────────┐ │ count_star() │ │ int64 │ ├──────────────┤ │ 9083 │ └──────────────┘ ``` You can also query Parquet files using the `read_parquet` function (or its alias `parquet_scan`). This function, along with other [parameters](https://duckdb.org/docs/data/parquet/overview.html#parameters), provides flexibility in handling Parquet files specially if they dont have a `.parquet` extension. Let's explore these functions using the auto-converted Parquet files from the same dataset. Select using [read_parquet](https://duckdb.org/docs/guides/file_formats/query_parquet.html) function: ```bash SELECT * FROM read_parquet('hf://datasets/jamescalam/world-cities-geo@~parquet/default/**/*.parquet') LIMIT 3; ┌────────────────┬─────────────┬───────────────┬───────────┬────────────┬────────────┬────────────────────┬───────────────────┬────────────────────┐ │ city │ country │ region │ continent │ latitude │ longitude │ x │ y │ z │ │ varchar │ varchar │ varchar │ varchar │ double │ double │ double │ double │ double │ ├────────────────┼─────────────┼───────────────┼───────────┼────────────┼────────────┼────────────────────┼───────────────────┼────────────────────┤ │ Kabul │ Afghanistan │ Southern Asia │ Asia │ 34.5166667 │ 69.1833344 │ 1865.546409629258 │ 4906.785732164055 │ 3610.1012966606136 │ │ Kandahar │ Afghanistan │ Southern Asia │ Asia │ 31.61 │ 65.6999969 │ 2232.782351694877 │ 4945.064042683584 │ 3339.261233224765 │ │ Mazar-e Sharif │ Afghanistan │ Southern Asia │ Asia │ 36.7069444 │ 67.1122208 │ 1986.5057687360124 │ 4705.51748048584 │ 3808.088900172991 │ └────────────────┴─────────────┴───────────────┴───────────┴────────────┴────────────┴────────────────────┴───────────────────┴────────────────────┘ ``` Read all files that match a glob pattern and include a filename column specifying which file each row came from: ```bash SELECT city, country, filename FROM read_parquet('hf://datasets/jamescalam/world-cities-geo@~parquet/default/**/*.parquet', filename = true) LIMIT 3; ┌────────────────┬─────────────┬───────────────────────────────────────────────────────────────────────────────┐ │ city │ country │ filename │ │ varchar │ varchar │ varchar │ ├────────────────┼─────────────┼───────────────────────────────────────────────────────────────────────────────┤ │ Kabul │ Afghanistan │ hf://datasets/jamescalam/world-cities-geo@~parquet/default/train/0000.parquet │ │ Kandahar │ Afghanistan │ hf://datasets/jamescalam/world-cities-geo@~parquet/default/train/0000.parquet │ │ Mazar-e Sharif │ Afghanistan │ hf://datasets/jamescalam/world-cities-geo@~parquet/default/train/0000.parquet │ └────────────────┴─────────────┴───────────────────────────────────────────────────────────────────────────────┘ ``` ## Get metadata and schema The [parquet_metadata](https://duckdb.org/docs/data/parquet/metadata.html) function can be used to query the metadata contained within a Parquet file. ```bash SELECT * FROM parquet_metadata('hf://datasets/jamescalam/world-cities-geo@~parquet/default/train/0000.parquet'); ┌───────────────────────────────────────────────────────────────────────────────┬──────────────┬────────────────────┬─────────────┐ │ file_name │ row_group_id │ row_group_num_rows │ compression │ │ varchar │ int64 │ int64 │ varchar │ ├───────────────────────────────────────────────────────────────────────────────┼──────────────┼────────────────────┼─────────────┤ │ hf://datasets/jamescalam/world-cities-geo@~parquet/default/train/0000.parquet │ 0 │ 1000 │ SNAPPY │ │ hf://datasets/jamescalam/world-cities-geo@~parquet/default/train/0000.parquet │ 0 │ 1000 │ SNAPPY │ │ hf://datasets/jamescalam/world-cities-geo@~parquet/default/train/0000.parquet │ 0 │ 1000 │ SNAPPY │ └───────────────────────────────────────────────────────────────────────────────┴──────────────┴────────────────────┴─────────────┘ ``` Fetch the column names and column types: ```bash DESCRIBE SELECT * FROM 'hf://datasets/jamescalam/world-cities-geo@~parquet/default/train/0000.parquet'; ┌─────────────┬─────────────┬─────────┬─────────┬─────────┬─────────┐ │ column_name │ column_type │ null │ key │ default │ extra │ │ varchar │ varchar │ varchar │ varchar │ varchar │ varchar │ ├─────────────┼─────────────┼─────────┼─────────┼─────────┼─────────┤ │ city │ VARCHAR │ YES │ │ │ │ │ country │ VARCHAR │ YES │ │ │ │ │ region │ VARCHAR │ YES │ │ │ │ │ continent │ VARCHAR │ YES │ │ │ │ │ latitude │ DOUBLE │ YES │ │ │ │ │ longitude │ DOUBLE │ YES │ │ │ │ │ x │ DOUBLE │ YES │ │ │ │ │ y │ DOUBLE │ YES │ │ │ │ │ z │ DOUBLE │ YES │ │ │ │ └─────────────┴─────────────┴─────────┴─────────┴─────────┴─────────┘ ``` Fetch the internal schema (excluding the file name): ```bash SELECT * EXCLUDE (file_name) FROM parquet_schema('hf://datasets/jamescalam/world-cities-geo@~parquet/default/train/0000.parquet'); ┌───────────┬────────────┬─────────────┬─────────────────┬──────────────┬────────────────┬───────┬───────────┬──────────┬──────────────┐ │ name │ type │ type_length │ repetition_type │ num_children │ converted_type │ scale │ precision │ field_id │ logical_type │ │ varchar │ varchar │ varchar │ varchar │ int64 │ varchar │ int64 │ int64 │ int64 │ varchar │ ├───────────┼────────────┼─────────────┼─────────────────┼──────────────┼────────────────┼───────┼───────────┼──────────┼──────────────┤ │ schema │ │ │ REQUIRED │ 9 │ │ │ │ │ │ │ city │ BYTE_ARRAY │ │ OPTIONAL │ │ UTF8 │ │ │ │ StringType() │ │ country │ BYTE_ARRAY │ │ OPTIONAL │ │ UTF8 │ │ │ │ StringType() │ │ region │ BYTE_ARRAY │ │ OPTIONAL │ │ UTF8 │ │ │ │ StringType() │ │ continent │ BYTE_ARRAY │ │ OPTIONAL │ │ UTF8 │ │ │ │ StringType() │ │ latitude │ DOUBLE │ │ OPTIONAL │ │ │ │ │ │ │ │ longitude │ DOUBLE │ │ OPTIONAL │ │ │ │ │ │ │ │ x │ DOUBLE │ │ OPTIONAL │ │ │ │ │ │ │ │ y │ DOUBLE │ │ OPTIONAL │ │ │ │ │ │ │ │ z │ DOUBLE │ │ OPTIONAL │ │ │ │ │ │ │ ├───────────┴────────────┴─────────────┴─────────────────┴──────────────┴────────────────┴───────┴───────────┴──────────┴──────────────┤ ``` ## Get statistics The `SUMMARIZE` command can be used to get various aggregates over a query (min, max, approx_unique, avg, std, q25, q50, q75, count). It returns these statistics along with the column name, column type, and the percentage of NULL values. ```bash SUMMARIZE SELECT latitude, longitude FROM 'hf://datasets/jamescalam/world-cities-geo@~parquet/default/train/0000.parquet'; ┌─────────────┬─────────────┬──────────────┬─────────────┬───────────────┬────────────────────┬────────────────────┬────────────────────┬────────────────────┬────────────────────┬───────┬─────────────────┐ │ column_name │ column_type │ min │ max │ approx_unique │ avg │ std │ q25 │ q50 │ q75 │ count │ null_percentage │ │ varchar │ varchar │ varchar │ varchar │ int64 │ varchar │ varchar │ varchar │ varchar │ varchar │ int64 │ decimal(9,2) │ ├─────────────┼─────────────┼──────────────┼─────────────┼───────────────┼────────────────────┼────────────────────┼────────────────────┼────────────────────┼────────────────────┼───────┼─────────────────┤ │ latitude │ DOUBLE │ -54.8 │ 67.8557214 │ 7324 │ 22.5004568364307 │ 26.770454684690925 │ 6.089858461951687 │ 29.321258648324747 │ 44.90191158328915 │ 9083 │ 0.00 │ │ longitude │ DOUBLE │ -175.2166595 │ 179.3833313 │ 7802 │ 14.699333721953098 │ 63.93672742608224 │ -6.877990418604821 │ 19.12963979385393 │ 43.873513093419966 │ 9083 │ 0.00 │ └─────────────┴─────────────┴──────────────┴─────────────┴───────────────┴────────────────────┴────────────────────┴────────────────────┴────────────────────┴────────────────────┴───────┴─────────────────┘ ``` ### Panel on Spaces https://huggingface.co/docs/hub/spaces-sdks-docker-panel.md # Panel on Spaces [Panel](https://panel.holoviz.org/) is an open-source Python library that lets you easily build powerful tools, dashboards and complex applications entirely in Python. It has a batteries-included philosophy, putting the PyData ecosystem, powerful data tables and much more at your fingertips. High-level reactive APIs and lower-level callback based APIs ensure you can quickly build exploratory applications, but you aren’t limited if you build complex, multi-page apps with rich interactivity. Panel is a member of the [HoloViz](https://holoviz.org/) ecosystem, your gateway into a connected ecosystem of data exploration tools. Visit [Panel documentation](https://panel.holoviz.org/) to learn more about making powerful applications. ## 🚀 Deploy Panel on Spaces You can deploy Panel on Spaces with just a few clicks: There are a few key parameters you need to define: the Owner (either your personal account or an organization), a Space name, and Visibility. In case you intend to execute computationally intensive deep learning models, consider upgrading to a GPU to boost performance. Once you have created the Space, it will start out in “Building” status, which will change to “Running” once your Space is ready to go. ## ⚡️ What will you see? When your Space is built and ready, you will see this image classification Panel app which will let you fetch a random image and run the OpenAI CLIP classifier model on it. Check out our [blog post](https://blog.holoviz.org/building_an_interactive_ml_dashboard_in_panel.html) for a walkthrough of this app. ## 🛠️ How to customize and make your own app? The Space template will populate a few files to get your app started: Three files are important: ### 1. app.py This file defines your Panel application code. You can start by modifying the existing application or replace it entirely to build your own application. To learn more about writing your own Panel app, refer to the [Panel documentation](https://panel.holoviz.org/). ### 2. Dockerfile The Dockerfile contains a sequence of commands that Docker will execute to construct and launch an image as a container that your Panel app will run in. Typically, to serve a Panel app, we use the command `panel serve app.py`. In this specific file, we divide the command into a list of strings. Furthermore, we must define the address and port because Hugging Face will expect to serve your application on port 7860. Additionally, we need to specify the `allow-websocket-origin` flag to enable the connection to the server's websocket. ### 3. requirements.txt This file defines the required packages for our Panel app. When using Space, dependencies listed in the requirements.txt file will be automatically installed. You have the freedom to modify this file by removing unnecessary packages or adding additional ones that are required for your application. Feel free to make the necessary changes to ensure your app has the appropriate packages installed. ## 🌐 Join Our Community The Panel community is vibrant and supportive, with experienced developers and data scientists eager to help and share their knowledge. Join us and connect with us: - [Discord](https://discord.gg/aRFhC3Dz9w) - [Discourse](https://discourse.holoviz.org/) - [Twitter](https://twitter.com/Panel_Org) - [LinkedIn](https://www.linkedin.com/company/panel-org) - [Github](https://github.com/holoviz/panel) ### FiftyOne https://huggingface.co/docs/hub/datasets-fiftyone.md # FiftyOne FiftyOne is an open-source toolkit for curating, visualizing, and managing unstructured visual data. The library streamlines data-centric workflows, from finding low-confidence predictions to identifying poor-quality samples and uncovering hidden patterns in your data. The library supports all sorts of visual data, from images and videos to PDFs, point clouds, and meshes. FiftyOne accommodates object detections, keypoints, polylines, and custom schemas. FiftyOne is integrated with the Hugging Face Hub so that you can load and share FiftyOne datasets directly from the Hub. 🚀 Try the FiftyOne 🤝 Hugging Face Integration in [Colab](https://colab.research.google.com/drive/1l0kzfbJ2wtUw1EGS1tq1PJYoWenMlihp?usp=sharing)! ## Prerequisites First [login with your Hugging Face account](/docs/huggingface_hub/quick-start#login): ```bash hf auth login ``` Make sure you have `fiftyone>=0.24.0` installed: ```bash pip install -U fiftyone ``` ## Loading Visual Datasets from the Hub With `load_from_hub()` from FiftyOne's Hugging Face utils, you can load: - Any FiftyOne dataset uploaded to the hub - Most image-based datasets stored in Parquet files (which is the standard for datasets uploaded to the hub via the `datasets` library) ### Loading FiftyOne datasets from the Hub Any dataset pushed to the hub in one of FiftyOne’s [supported common formats](https://docs.voxel51.com/user_guide/dataset_creation/datasets.html#supported-import-formats) should have all of the necessary configuration info in its dataset repo on the hub, so you can load the dataset by specifying its `repo_id`. As an example, to load the [VisDrone detection dataset](https://huggingface.co/datasets/Voxel51/VisDrone2019-DET): ```python import fiftyone as fo from fiftyone.utils import load_from_hub ## load from the hub dataset = load_from_hub("Voxel51/VisDrone2019-DET") ## visualize in app session = fo.launch_app(dataset) ``` ![FiftyOne VisDrone dataset](https://cdn-uploads.huggingface.co/production/uploads/63127e2495407887cb79c5ea/0eKxe_GSsBjt8wMjT9qaI.jpeg) You can [customize the download process](https://docs.voxel51.com/integrations/huggingface.html#configuring-the-download-process), including the number of samples to download, the name of the created dataset object, or whether or not it is persisted to disk. You can list all the available FiftyOne datasets on the Hub using: ```python from huggingface_hub import HfApi api = HfApi() api.list_datasets(tags="fiftyone") ``` ### Loading Parquet Datasets from the Hub with FiftyOne You can also use the `load_from_hub()` function to load datasets from Parquet files. Type conversions are handled for you, and images are downloaded from URLs if necessary. With this functionality, [you can load](https://docs.voxel51.com/integrations/huggingface.html#basic-examples) any of the following: - [FiftyOne-Compatible Image Classification Datasets](https://huggingface.co/collections/Voxel51/fiftyone-compatible-image-classification-datasets-665dfd51020d8b66a56c9b6f), like [Food101](https://huggingface.co/datasets/food101) and [ImageNet-Sketch](https://huggingface.co/datasets/imagenet_sketch) - [FiftyOne-Compatible Object Detection Datasets](https://huggingface.co/collections/Voxel51/fiftyone-compatible-object-detection-datasets-665e0279c94ae552c7159a2b) like [CPPE-5](https://huggingface.co/datasets/cppe-5) and [WIDER FACE](https://huggingface.co/datasets/wider_face) - [FiftyOne-Compatible Segmentation Datasets](https://huggingface.co/collections/Voxel51/fiftyone-compatible-image-segmentation-datasets-665e15b6ddb96a4d7226a380) like [SceneParse150](https://huggingface.co/datasets/scene_parse_150) and [Sidewalk Semantic](https://huggingface.co/datasets/segments/sidewalk-semantic) - [FiftyOne-Compatible Image Captioning Datasets](https://huggingface.co/collections/Voxel51/fiftyone-compatible-image-captioning-datasets-665e16e29350244c06084505) like [COYO-700M](https://huggingface.co/datasets/kakaobrain/coyo-700m) and [New Yorker Caption Contest](https://huggingface.co/datasets/jmhessel/newyorker_caption_contest) - [FiftyOne-Compatible Visual Question-Answering Datasets](https://huggingface.co/collections/Voxel51/fiftyone-compatible-vqa-datasets-665e16424ecc8a718156248a) like [TextVQA](https://huggingface.co/datasets/textvqa) and [ScienceQA](https://huggingface.co/datasets/derek-thomas/ScienceQA) As an example, we can load the first 1,000 samples from the [WikiArt dataset](https://huggingface.co/datasets/huggan/wikiart) into FiftyOne with: ```python import fiftyone as fo from fiftyone.utils.huggingface import load_from_hub dataset = load_from_hub( "huggan/wikiart", ## repo_id format="parquet", ## for Parquet format classification_fields=["artist", "style", "genre"], ## columns to treat as classification labels max_samples=1000, # number of samples to load name="wikiart", # name of the dataset in FiftyOne ) ``` ![WikiArt Dataset](https://cdn-uploads.huggingface.co/production/uploads/63127e2495407887cb79c5ea/PCqCvTlNTG5SLtcK5fwuQ.jpeg) ## Pushing FiftyOne Datasets to the Hub You can push a dataset to the hub with: ```python import fiftyone as fo import fiftyone.zoo as foz from fiftyone.utils.huggingface import push_to_hub ## load example dataset dataset = foz.load_zoo_dataset("quickstart") ## push to hub push_to_hub(dataset, "my-hf-dataset") ``` When you call `push_to_hub()`, the dataset will be uploaded to the repo with the specified repo name under your username, and the repo will be created if necessary. A [Dataset Card](./datasets-cards) will automatically be generated and populated with instructions for loading the dataset from the hub. You can upload a thumbnail image/gif to appear on the Dataset Card with the `preview_path` argument. Here’s an example using many of these arguments, which would upload the first three samples of FiftyOne's [Quickstart Video](https://docs.voxel51.com/user_guide/dataset_zoo/datasets.html#quickstart-video) dataset to the private repo `username/my-quickstart-video-dataset` with tags, an MIT license, a description, and a preview image: ```python dataset = foz.load_from_zoo("quickstart-video", max_samples=3) push_to_hub( dataset, "my-quickstart-video-dataset", tags=["video", "tracking"], license="mit", description="A dataset of video samples for tracking tasks", private=True, preview_path="" ) ``` ## 📚 Resources - [🚀 Code-Along Colab Notebook](https://colab.research.google.com/drive/1l0kzfbJ2wtUw1EGS1tq1PJYoWenMlihp?usp=sharing) - [🗺️ User Guide for FiftyOne Datasets](https://docs.voxel51.com/user_guide/using_datasets.html#) - [🤗 FiftyOne 🤝 Hub Integration Docs](https://docs.voxel51.com/integrations/huggingface.html#huggingface-hub) - [🤗 FiftyOne 🤝 Transformers Integration Docs](https://docs.voxel51.com/integrations/huggingface.html#transformers-library) - [🧩 FiftyOne Hugging Face Hub Plugin](https://github.com/voxel51/fiftyone-huggingface-plugins) ### Daft https://huggingface.co/docs/hub/datasets-daft.md # Daft [Daft](https://daft.ai/) is a high-performance data engine providing simple and reliable data processing for any modality and scale. Daft has native support for reading from and writing to Hugging Face datasets. ## Getting Started To get started, pip install `daft` with the `huggingface` feature: ```bash pip install 'daft[huggingface]' ``` ## Read Daft is able to read datasets directly from the Hugging Face Hub using the [`daft.read_huggingface()`](https://docs.daft.ai/en/stable/api/io/#daft.read_huggingface) function or via the `hf://datasets/` protocol. ### Reading an Entire Dataset Using [`daft.read_huggingface()`](https://docs.daft.ai/en/stable/api/io/#daft.read_huggingface), you can easily load a dataset. ```python import daft df = daft.read_huggingface("username/dataset_name") ``` This will read the entire dataset into a DataFrame. ### Reading Specific Files Not only can you read entire datasets, but you can also read individual files from a dataset repository. Using a read function that takes in a path (such as [`daft.read_parquet()`](https://docs.daft.ai/en/stable/api/io/#daft.read_parquet), [`daft.read_csv()`](https://docs.daft.ai/en/stable/api/io/#daft.read_csv), or [`daft.read_json()`](https://docs.daft.ai/en/stable/api/io/#daft.read_json)), specify a Hugging Face dataset path via the `hf://datasets/` prefix: ```python import daft # read a specific Parquet file df = daft.read_parquet("hf://datasets/username/dataset_name/file_name.parquet") # or a csv file df = daft.read_csv("hf://datasets/username/dataset_name/file_name.csv") # or a set of Parquet files using a glob pattern df = daft.read_parquet("hf://datasets/username/dataset_name/**/*.parquet") ``` ## Write Daft is able to write Parquet files to a Hugging Face dataset repository using [`daft.DataFrame.write_huggingface`](https://docs.daft.ai/en/stable/api/dataframe/#daft.DataFrame.write_deltalake). Daft supports [Content-Defined Chunking](https://huggingface.co/blog/parquet-cdc) and [Xet](https://huggingface.co/blog/xet-on-the-hub) for faster, deduplicated writes. Basic usage: ```python import daft df: daft.DataFrame = ... df.write_huggingface("username/dataset_name") ``` See the [`DataFrame.write_huggingface`](https://docs.daft.ai/en/stable/api/dataframe/#daft.DataFrame.write_huggingface) API page for more info. ## Authentication The `token` parameter in [`daft.io.HuggingFaceConfig`](https://docs.daft.ai/en/stable/api/config/#daft.io.HuggingFaceConfig) can be used to specify a Hugging Face access token for requests that require authentication (e.g. reading private dataset repositories or writing to a dataset repository). Example of loading a dataset with a specified token: ```python from daft.io import IOConfig, HuggingFaceConfig io_config = IOConfig(hf=HuggingFaceConfig(token="your_token")) df = daft.read_parquet("hf://datasets/username/dataset_name", io_config=io_config) ``` ### Tabby on Spaces https://huggingface.co/docs/hub/spaces-sdks-docker-tabby.md # Tabby on Spaces [Tabby](https://tabby.tabbyml.com) is an open-source, self-hosted AI coding assistant. With Tabby, every team can set up its own LLM-powered code completion server with ease. In this guide, you will learn how to deploy your own Tabby instance and use it for development directly from the Hugging Face website. ## Your first Tabby Space In this section, you will learn how to deploy a Tabby Space and use it for yourself or your organization. ### Deploy Tabby on Spaces You can deploy Tabby on Spaces with just a few clicks: [![Deploy on HF Spaces](https://huggingface.co/datasets/huggingface/badges/raw/main/deploy-to-spaces-lg.svg)](https://huggingface.co/spaces/TabbyML/tabby-template-space?duplicate=true) You need to define the Owner (your personal account or an organization), a Space name, and the Visibility. To secure the api endpoint, we're configuring the visibility as Private. ![Duplicate Space](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/tabby/duplicate-space.png) You’ll see the *Building status*. Once it becomes *Running*, your Space is ready to go. If you don’t see the Tabby Swagger UI, try refreshing the page. ![Swagger UI](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/tabby/swagger-ui.png) > [!TIP] > If you want to customize the title, emojis, and colors of your space, go to "Files and Versions" and edit the metadata of your README.md file. ### Your Tabby Space URL Once Tabby is up and running, for a space link such as https://b-61j.pages.devm/spaces/TabbyML/tabby, the direct URL will be https://tabbyml-tabby.hf.space. This URL provides access to a stable Tabby instance in full-screen mode and serves as the API endpoint for IDE/Editor Extensions to talk with. ### Connect VSCode Extension to Space backend 1. Install the [VSCode Extension](https://marketplace.visualstudio.com/items?itemName=TabbyML.vscode-tabby). 2. Open the file located at `~/.tabby-client/agent/config.toml`. Uncomment both the `[server]` section and the `[server.requestHeaders]` section. * Set the endpoint to the Direct URL you found in the previous step, which should look something like `https://UserName-SpaceName.hf.space`. * As the Space is set to **Private**, it is essential to configure the authorization header for accessing the endpoint. You can obtain a token from the [Access Tokens](https://huggingface.co/settings/tokens) page. ![Agent Config](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/tabby/agent-config.png) 3. You'll notice a ✓ icon indicating a successful connection. ![Tabby Connected](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/tabby/tabby-connected.png) 4. You've complete the setup, now enjoy tabing! ![Code Completion](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/tabby/code-completion.png) You can also utilize Tabby extensions in other IDEs, such as [JetBrains](https://plugins.jetbrains.com/plugin/22379-tabby). ## Feedback and support If you have improvement suggestions or need specific support, please join [Tabby Slack community](https://join.slack.com/t/tabbycommunity/shared_invite/zt-1xeiddizp-bciR2RtFTaJ37RBxr8VxpA) or reach out on [Tabby’s GitHub repository](https://github.com/TabbyML/tabby). ### Using OpenCV in Spaces https://huggingface.co/docs/hub/spaces-using-opencv.md # Using OpenCV in Spaces In order to use OpenCV in your Gradio or Python Spaces, you'll need to make the Space install both the Python and Debian dependencies This means adding `python3-opencv` to the `packages.txt` file, and adding `opencv-python` to the `requirements.txt` file. If those files don't exist, you'll need to create them. To see an example, [see this Gradio project](https://huggingface.co/spaces/templates/gradio_opencv/tree/main). ### WebDataset https://huggingface.co/docs/hub/datasets-webdataset.md # WebDataset [WebDataset](https://github.com/webdataset/webdataset) is a library for writing I/O pipelines for large datasets. Its sequential I/O and sharding features make it especially useful for streaming large-scale datasets to a DataLoader. ## The WebDataset format A WebDataset file is a TAR archive containing a series of data files. All successive data files with the same prefix are considered to be part of the same example (e.g., an image/audio file and its label or metadata): Labels and metadata can be in a `.json` file, in a `.txt` (for a caption, a description), or in a `.cls` (for a class index). A large scale WebDataset is made of many files called shards, where each shard is a TAR archive. Each shard is often ~1GB but the full dataset can be multiple terabytes! ## Multimodal support WebDataset is designed for multimodal datasets, i.e. for image, audio and/or video datasets. Indeed, since media files tend to be quite big, WebDataset's sequential I/O enables large reads and buffering, resulting in the best data loading speed. Here is a non-exhaustive list of supported data formats: - image: jpeg, png, tiff - audio: mp3, m4a, wav, flac - video: mp4, mov, avi - other: npy, npz The full list evolves over time and depends on the implementation. For example, you can find which formats the `webdataset` package supports in the source code [here](https://github.com/webdataset/webdataset/blob/main/src/webdataset/autodecode.py). ## Streaming Streaming TAR archives is fast because it reads contiguous chunks of data. It can be orders of magnitude faster than reading separate data files one by one. WebDataset streaming offers high-speed performance both when reading from disk and from cloud storage, which makes it an ideal format to feed to a DataLoader: For example here is how to stream the [timm/imagenet-12k-wds](https://huggingface.co/datasets/timm/imagenet-12k-wds) dataset directly from Hugging Face: First you need to [Login with your Hugging Face account](/docs/huggingface_hub/quick-start#login), for example using: ``` hf auth login ``` And then you can stream the dataset with WebDataset: ```python >>> import webdataset as wds >>> from huggingface_hub import get_token >>> from torch.utils.data import DataLoader >>> hf_token = get_token() >>> url = "https://huggingface.co/datasets/timm/imagenet-12k-wds/resolve/main/imagenet12k-train-{{0000..1023}}.tar" >>> url = f"pipe:curl -s -L {url} -H 'Authorization:Bearer {hf_token}'" >>> dataset = wds.WebDataset(url).decode() >>> dataloader = DataLoader(dataset, batch_size=64, num_workers=4) ``` ## Shuffle Generally, datasets in WebDataset formats are already shuffled and ready to feed to a DataLoader. But you can still reshuffle the data with WebDataset's approximate shuffling. In addition to shuffling the list of shards, WebDataset uses a buffer to shuffle a dataset without any cost to speed: To shuffle a list of sharded files and randomly sample from the shuffle buffer: ```python >>> buffer_size = 1000 >>> dataset = ( ... wds.WebDataset(url, shardshuffle=True) ... .shuffle(buffer_size) ... .decode() ... ) ``` ### Argilla https://huggingface.co/docs/hub/datasets-argilla.md # Argilla Argilla is a collaboration tool for AI engineers and domain experts who need to build high quality datasets for their projects. ![image](https://github.com/user-attachments/assets/0e6ce1d8-65ca-4211-b4ba-5182f88168a0) Argilla can be used for collecting human feedback for a wide variety of AI projects like traditional NLP (text classification, NER, etc.), LLMs (RAG, preference tuning, etc.), or multimodal models (text to image, etc.). Argilla's programmatic approach lets you build workflows for continuous evaluation and model improvement. The goal of Argilla is to ensure your data work pays off by quickly iterating on the right data and models. ## What do people build with Argilla? The community uses Argilla to create amazing open-source [datasets](https://huggingface.co/datasets?library=library:argilla&sort=trending) and [models](https://huggingface.co/models?other=distilabel). ### Open-source datasets and models Argilla contributed some models and datasets to open-source too. - [Cleaned UltraFeedback dataset](https://huggingface.co/datasets/argilla/ultrafeedback-binarized-preferences-cleaned) used to fine-tune the [Notus](https://huggingface.co/argilla/notus-7b-v1) and [Notux](https://huggingface.co/argilla/notux-8x7b-v1) models. The original UltraFeedback dataset was curated using Argilla UI filters to find and report a bug in the original data generation code. Based on this data curation process, Argilla built this new version of the UltraFeedback dataset and fine-tuned Notus, outperforming Zephyr on several benchmarks. - [distilabeled Intel Orca DPO dataset](https://huggingface.co/datasets/argilla/distilabel-intel-orca-dpo-pairs) used to fine-tune the [improved OpenHermes model](https://huggingface.co/argilla/distilabeled-OpenHermes-2.5-Mistral-7B). This dataset was built by combining human curation in Argilla with AI feedback from distilabel, leading to an improved version of the Intel Orca dataset and outperforming models fine-tuned on the original dataset. ### Examples Use cases AI teams from companies like [the Red Cross](https://510.global/), [Loris.ai](https://loris.ai/) and [Prolific](https://www.prolific.com/) use Argilla to improve the quality and efficiency of AI projects. They shared their experiences in our [AI community meetup](https://lu.ma/embed-checkout/evt-IQtRiSuXZCIW6FB). - AI for good: [the Red Cross presentation](https://youtu.be/ZsCqrAhzkFU?feature=shared) showcases how the Red Cross domain experts and AI team collaborated by classifying and redirecting requests from refugees of the Ukrainian crisis to streamline the support processes of the Red Cross. - Customer support: during [the Loris meetup](https://youtu.be/jWrtgf2w4VU?feature=shared) they showed how their AI team uses unsupervised and few-shot contrastive learning to help them quickly validate and gain labelled samples for a huge amount of multi-label classifiers. - Research studies: [the showcase from Prolific](https://youtu.be/ePDlhIxnuAs?feature=shared) announced their integration with our platform. They use it to actively distribute data collection projects among their annotating workforce. This allows Prolific to quickly and efficiently collect high-quality data for research studies. ## Prerequisites First [login with your Hugging Face account](/docs/huggingface_hub/quick-start#login): ```bash hf auth login ``` Make sure you have `argilla>=2.0.0` installed: ```bash pip install -U argilla ``` Lastly, you will need to deploy the Argilla server and UI, which can be done [easily on the Hugging Face Hub](https://argilla-io.github.io/argilla/latest/getting_started/quickstart/#run-the-argilla-server). ## Importing and exporting datasets and records This guide shows how to import and export your dataset to the Hugging Face Hub. In Argilla, you can import/export two main components of a dataset: - The dataset's complete configuration defined in `rg.Settings`. This is useful if your want to share your feedback task or restore it later in Argilla. - The records stored in the dataset, including `Metadata`, `Vectors`, `Suggestions`, and `Responses`. This is useful if you want to use your dataset's records outside of Argilla. ### Push an Argilla dataset to the Hugging Face Hub You can push a dataset from Argilla to the Hugging Face Hub. This is useful if you want to share your dataset with the community or version control it. You can push the dataset to the Hugging Face Hub using the `rg.Dataset.to_hub` method. ```python import argilla as rg client = rg.Argilla(api_url="", api_key="") dataset = client.datasets(name="my_dataset") dataset.to_hub(repo_id="") ``` #### With or without records The example above will push the dataset's `Settings` and records to the hub. If you only want to push the dataset's configuration, you can set the `with_records` parameter to `False`. This is useful if you're just interested in a specific dataset template or you want to make changes in the dataset settings and/or records. ```python dataset.to_hub(repo_id="", with_records=False) ``` ### Pull an Argilla dataset from the Hugging Face Hub You can pull a dataset from the Hugging Face Hub to Argilla. This is useful if you want to restore a dataset and its configuration. You can pull the dataset from the Hugging Face Hub using the `rg.Dataset.from_hub` method. ```python import argilla as rg client = rg.Argilla(api_url="", api_key="") dataset = rg.Dataset.from_hub(repo_id="") ``` The `rg.Dataset.from_hub` method loads the configuration and records from the dataset repo. If you only want to load records, you can pass a `datasets.Dataset` object to the `rg.Dataset.log` method. This enables you to configure your own dataset and reuse existing Hub datasets. #### With or without records The example above will pull the dataset's `Settings` and records from the hub. If you only want to pull the dataset's configuration, you can set the `with_records` parameter to `False`. This is useful if you're just interested in a specific dataset template or you want to make changes in the dataset settings and/or records. ```python dataset = rg.Dataset.from_hub(repo_id="", with_records=False) ``` With the dataset's configuration you could then make changes to the dataset. For example, you could adapt the dataset's settings for a different task: ```python dataset.settings.questions = [rg.TextQuestion(name="answer")] ``` You could then log the dataset's records using the `load_dataset` method of the `datasets` package and pass the dataset to the `rg.Dataset.log` method. ```python hf_dataset = load_dataset("") dataset.log(hf_dataset) ``` ## 📚 Resources - [🚀 Argilla Docs](https://argilla-io.github.io/argilla/) - [🚀 Argilla Docs - import export guides](https://argilla-io.github.io/argilla/latest/how_to_guides/import_export/) ### Downloading models https://huggingface.co/docs/hub/models-downloading.md # Downloading models ## Integrated libraries If a model on the Hub is tied to a [supported library](./models-libraries), loading the model can be done in just a few lines. For information on accessing the model, you can click on the "Use in _Library_" button on the model page to see how to do so. For example, `distilbert/distilgpt2` shows how to do so with 🤗 Transformers below. ## Using the Hugging Face Client Library You can use the [`huggingface_hub`](https://github.com/huggingface/huggingface_hub) library to create, delete, update and retrieve information from repos. For example, to download the `HuggingFaceH4/zephyr-7b-beta` model from the command line, run ```bash hf download HuggingFaceH4/zephyr-7b-beta ``` See the [CLI download documentation](https://huggingface.co/docs/huggingface_hub/en/guides/cli#download-an-entire-repository) for more information. You can also integrate this into your own library. For example, you can quickly load a Scikit-learn model with a few lines. ```py from huggingface_hub import hf_hub_download import joblib REPO_ID = "YOUR_REPO_ID" FILENAME = "sklearn_model.joblib" model = joblib.load( hf_hub_download(repo_id=REPO_ID, filename=FILENAME) ) ``` ## Using Git Since all models on the Model Hub are Xet-backed Git repositories, you can clone the models locally by [installing git-xet](./xet/using-xet-storage#git-xet) and running: ```bash git xet install git lfs install git clone git@hf.co: # example: git clone git@hf.co:bigscience/bloom ``` If you have write-access to the particular model repo, you'll also have the ability to commit and push revisions to the model. Add your SSH public key to [your user settings](https://huggingface.co/settings/keys) to push changes and/or access private repos. ## Faster downloads If you are running on a machine with high bandwidth, you can speed up downloads by allowing `hf_xet` to run on all CPU cores. `hf_xet` is a Rust-based package leveraging the new [Xet storage backend](https://huggingface.co/docs/hub/en/xet/index) to optimize file transfers with chunk-based deduplication. `hf_xet` is enabled by default but with lower performances to avoid bloating available CPU and bandwidth, which could degrade UX. ```bash pip install -U huggingface_hub HF_XET_HIGH_PERFORMANCE=1 hf download ... ``` ### How to configure SCIM with Okta https://huggingface.co/docs/hub/security-sso-okta-scim.md # How to configure SCIM with Okta This guide explains how to set up SCIM user and group provisioning between Okta and your Hugging Face organization using SCIM. > [!WARNING] > This feature is part of the Enterprise Plus plan. ### Step 1: Get SCIM configuration from Hugging Face 1. Navigate to your organization's settings page on Hugging Face. 2. Go to the **SSO** tab, then click on the **SCIM** sub-tab. 3. Copy the **SCIM Tenant URL**. You will need this for the Okta configuration. 4. Click **Generate an access token**. A new SCIM token will be generated. Copy this token immediately and store it securely, as you will not be able to see it again. ### Step 2: Enter Admin Credentials 1. In Okta, go to **Applications** and select your Hugging Face app. 2. Go to the **General** tab and click **Edit** on App Settings 3. For the Provisioning option select **SCIM**, click **Save** 4. Go to the **Provisioning** tab and click **Edit**. 5. Enter the **SCIM Tenant URL** as the SCIM connector base URL. 6. Enter **userName** for Unique identifier field for users. 7. Select all necessary actions for Supported provisioning actions. 8. Select **HTTP Header** for Authentication Mode. 9. Enter the **Access Token** you generated as the Authorization Bearer Token. 10. Click **Test Connector Configuration** to verify the connection. 11. Save your changes. ### Step 3: Configure Provisioning 1. In the **Provisioning** tab, click **To App** from the side nav. 2. Click **Edit** and check to Enable all the features you need, i.e. Create, Update, Delete Users. 3. Click **Save** at the bottom. ### Step 4: Configure Attribute Mappings 1. While still in the **Provisioning** tab scroll down to Attribute Mappings section 2. The default attribute mappings often require adjustments for robust provisioning. We recommend using the following configuration. You can delete attributes that are not here: ### Step 5: Assign Users or Groups 1. Visit the **Assignments** tab, click **Assign** 2. Click **Assign to People** or **Assign to Groups** 3. After finding the User or Group that needs to be assigned, click **Assign** next to their name 4. In the mapping modal the Username needs to be edited to comply with the following rules. > [!WARNING] > > Only regular characters and `-` are accepted in the Username. > `--` (double dash) is forbidden. > `-` cannot start or end the name. > Digit-only names are not accepted. > Minimum length is 2 and maximum length is 42. > Username has to be unique within your org. > 5. Scroll down and click **Save and Go Back** 6. Click **Done** 7. Confirm that users or groups are created, updated, or deactivated in your Hugging Face organization as expected. ### How to get a user's plan and status in Spaces https://huggingface.co/docs/hub/spaces-get-user-plan.md # How to get a user's plan and status in Spaces From inside a Space's iframe, you can check if a user is logged in or not on the main site, and if they have a PRO subscription or if one of their orgs has a paid subscription. ```js window.addEventListener("message", (event) => { if (event.data.type === "USER_PLAN") { console.log("plan", event.data.plan); } }) window.parent.postMessage({ type: "USER_PLAN_REQUEST" }, "https://huggingface.co"); ``` `event.data.plan` will be of type: ```ts { user: "anonymous", org: undefined } | { user: "pro" | "free", org: undefined | "team" | "enterprise" | "plus" | "academia" } ``` You will get both the user's status (logged out = `"anonymous"`) and their plan. ## Examples - https://huggingface.co/spaces/huggingfacejs/plan ### Image Dataset https://huggingface.co/docs/hub/datasets-image.md # Image Dataset This guide will show you how to configure your dataset repository with image files. You can find accompanying examples of repositories in this [Image datasets examples collection](https://huggingface.co/collections/datasets-examples/image-dataset-6568e7cf28639db76eb92d65). A dataset with a supported structure and [file formats](./datasets-adding#file-formats) automatically has a Dataset Viewer on its page on the Hub. Additional information about your images - such as captions or bounding boxes for object detection - is automatically loaded as long as you include this information in a metadata file (`metadata.csv`/`metadata.jsonl`/`metadata.parquet`). Alternatively, images can be in Parquet files or in TAR archives following the [WebDataset](https://github.com/webdataset/webdataset) format. ## Only images If your dataset only consists of one column with images, you can simply store your image files at the root: ``` my_dataset_repository/ ├── 1.jpg ├── 2.jpg ├── 3.jpg └── 4.jpg ``` or in a subdirectory: ``` my_dataset_repository/ └── images ├── 1.jpg ├── 2.jpg ├── 3.jpg └── 4.jpg ``` Multiple [formats](./datasets-adding#file-formats) are supported at the same time, including PNG, JPEG, TIFF and WebP. ``` my_dataset_repository/ └── images ├── 1.jpg ├── 2.png ├── 3.tiff └── 4.webp ``` If you have several splits, you can put your images into directories named accordingly: ``` my_dataset_repository/ ├── train │   ├── 1.jpg │   └── 2.jpg └── test ├── 3.jpg └── 4.jpg ``` See [File names and splits](./datasets-file-names-and-splits) for more information and other ways to organize data by splits. ## Additional columns If there is additional information you'd like to include about your dataset, like text captions or bounding boxes, add it as a `metadata.csv` file in your repository. This lets you quickly create datasets for different computer vision tasks like [text captioning](https://huggingface.co/tasks/image-to-text) or [object detection](https://huggingface.co/tasks/object-detection). ``` my_dataset_repository/ └── train ├── 1.jpg ├── 2.jpg ├── 3.jpg ├── 4.jpg └── metadata.csv ``` Your `metadata.csv` file must have a `file_name` column which links image files with their metadata: ```csv file_name,text 1.jpg,a drawing of a green pokemon with red eyes 2.jpg,a green and yellow toy with a red nose 3.jpg,a red and white ball with an angry look on its face 4.jpg,a cartoon ball with a smile on its face ``` You can also use a [JSONL](https://jsonlines.org/) file `metadata.jsonl`: ```jsonl {"file_name": "1.jpg","text": "a drawing of a green pokemon with red eyes"} {"file_name": "2.jpg","text": "a green and yellow toy with a red nose"} {"file_name": "3.jpg","text": "a red and white ball with an angry look on its face"} {"file_name": "4.jpg","text": "a cartoon ball with a smile on its face"} ``` And for bigger datasets or if you are interested in advanced data retrieval features, you can use a [Parquet](https://parquet.apache.org/) file `metadata.parquet`. ## Relative paths Metadata file must be located either in the same directory with the images it is linked to, or in any parent directory, like in this example: ``` my_dataset_repository/ └── train ├── images │   ├── 1.jpg │   ├── 2.jpg │   ├── 3.jpg │   └── 4.jpg └── metadata.csv ``` In this case, the `file_name` column must be a full relative path to the images, not just the filename: ```csv file_name,text images/1.jpg,a drawing of a green pokemon with red eyes images/2.jpg,a green and yellow toy with a red nose images/3.jpg,a red and white ball with an angry look on its face images/4.jpg,a cartoon ball with a smile on it's face ``` Metadata files cannot be put in subdirectories of a directory with the images. More generally, any column named `file_name` or `*_file_name` should contain the full relative path to the images. ## Image classification For image classification datasets, you can also use a simple setup: use directories to name the image classes. Store your image files in a directory structure like: ``` my_dataset_repository/ ├── green │   ├── 1.jpg │   └── 2.jpg └── red ├── 3.jpg └── 4.jpg ``` The dataset created with this structure contains two columns: `image` and `label` (with values `green` and `red`). You can also provide multiple splits. To do so, your dataset directory should have the following structure (see [File names and splits](./datasets-file-names-and-splits) for more information): ``` my_dataset_repository/ ├── test │   ├── green │   │   └── 2.jpg │   └── red │   └── 4.jpg └── train ├── green │   └── 1.jpg └── red └── 3.jpg ``` You can disable this automatic addition of the `label` column in the [YAML configuration](./datasets-manual-configuration). If your directory names have no special meaning, set `drop_labels: true` in the README header: ```yaml configs: - config_name: default # Name of the dataset subset, if applicable. drop_labels: true ``` ## Large scale datasets ### WebDataset format The [WebDataset](./datasets-webdataset) format is well suited for large scale image datasets (see [timm/imagenet-12k-wds](https://huggingface.co/datasets/timm/imagenet-12k-wds) for example). It consists of TAR archives containing images and their metadata and is optimized for streaming. It is useful if you have a large number of images and to get streaming data loaders for large scale training. ``` my_dataset_repository/ ├── train-0000.tar ├── train-0001.tar ├── ... └── train-1023.tar ``` To make a WebDataset TAR archive, create a directory containing the images and metadata files to be archived and create the TAR archive using e.g. the `tar` command. The usual size per archive is generally around 1GB. Make sure each image and metadata pair share the same file prefix, for example: ``` train-0000/ ├── 000.jpg ├── 000.json ├── 001.jpg ├── 001.json ├── ... ├── 999.jpg └── 999.json ``` Note that for user convenience and to enable the [Dataset Viewer](./data-studio), every dataset hosted in the Hub is automatically converted to Parquet format up to 5GB. Read more about it in the [Parquet format](./data-studio#access-the-parquet-files) documentation. ### Parquet format Instead of uploading the images and metadata as individual files, you can embed everything inside a [Parquet](https://parquet.apache.org/) file. This is useful if you have a large number of images, if you want to embed multiple image columns, or if you want to store additional information about the images in the same file. Parquet is also useful for storing data such as raw bytes, which is not supported by JSON/CSV. ``` my_dataset_repository/ └── train.parquet ``` Parquet files with image data can be created using `pandas` or the `datasets` library. To create Parquet files with image data in `pandas`, you can use [pandas-image-methods](https://github.com/lhoestq/pandas-image-methods) and `df.to_parquet()`. In `datasets`, you can set the column type to `Image()` and use the `ds.to_parquet(...)` method or `ds.push_to_hub(...)`. You can find a guide on loading image datasets in `datasets` [here](/docs/datasets/image_load). Alternatively you can manually set the image type of Parquet created using other tools. First, make sure your image columns are of type _struct_, with a binary field `"bytes"` for the image data and a string field `"path"` for the image file name or path. Then you should specify the feature types of the columns directly in YAML in the README header, for example: ```yaml dataset_info: features: - name: image dtype: image - name: caption dtype: string ``` Note that Parquet is recommended for small images (<1MB per image) and small row groups (100 rows per row group, which is what `datasets` uses for images). For larger images it is recommended to use the WebDataset format, or to share the original image files (optionally with metadata files, and following the [repositories recommendations and limits](https://huggingface.co/docs/hub/en/storage-limits) for storage and number of files). ### GGUF usage with GPT4All https://huggingface.co/docs/hub/gguf-gpt4all.md # GGUF usage with GPT4All [GPT4All](https://gpt4all.io/) is an open-source LLM application developed by [Nomic](https://nomic.ai/). Version 2.7.2 introduces a brand new, experimental feature called `Model Discovery`. `Model Discovery` provides a built-in way to search for and download GGUF models from the Hub. To get started, open GPT4All and click `Download Models`. From here, you can use the search bar to find a model. After you have selected and downloaded a model, you can go to `Settings` and provide an appropriate prompt template in the GPT4All format (`%1` and `%2` placeholders). Then from the main page, you can select the model from the list of installed models and start a conversation. ### Billing https://huggingface.co/docs/hub/billing.md # Billing At Hugging Face, we build a collaboration platform for the ML community (i.e., the Hub) and monetize by providing advanced features and simple access to compute for AI. Any feedback or support request related to billing is welcome at billing@huggingface.co ## Team and Enterprise subscriptions We offer advanced security and compliance features for organizations through our Team or Enterprise plans, which include [Single Sign-On](./enterprise-sso), [Advanced Access Control](./enterprise-hub-resource-groups) for repositories, control over your data location, higher [storage capacity](./storage-limits) for public and private repositories, and more. Team and Enterprise plans are billed like a typical subscription. They renew automatically, but you can choose to cancel at any time in the organization's billing settings. You can pay for a Team subscription with a credit card or your AWS account, or upgrade to Enterprise via an annual contract. Upon renewal, the number of seats in your subscription will be updated to match the number of members of your organization. Private repository storage above the [included storage](./storage-limits) will be billed along with your subscription renewal. ## PRO subscription The PRO subscription unlocks essential features for serious users, including: - Higher [storage capacity](./storage-limits) for public and private repositories - Higher bandwidth and API [rate limits](./rate-limits) - Included credits for [Inference Providers](/docs/inference-providers/) - Higher tier for ZeroGPU Spaces usage - Ability to create ZeroGPU Spaces and use Dev Mode - Ability to publish Social Posts and Community Blogs - Leverage the [Data Studio](./data-studio) on private datasets - Run and schedule serverless [CPU/ GPU Jobs](https://huggingface.co/docs/huggingface_hub/en/guides/jobs) View the full list of benefits at https://huggingface.co/pro then subscribe over at https://huggingface.co/subscribe/pro Similarly to the Enterprise Hub subscription, PRO subscriptions are billed like a typical subscription. The subscription renews automatically for you. You can choose to cancel the subscription at anytime in your billing settings: https://huggingface.co/settings/billing You can only pay for the PRO subscription with a credit card. The subscription is billed separately from any pay-as-you-go compute usage. Private repository storage above the [included storage](./storage-limits) will be billed along with your subscription renewal. Note: PRO benefits are also included in the [Enterprise subscription](https://huggingface.co/enterprise). ## Pay-as-you-go private storage Above the included 1TB (or 1TB per seat) of private storage in PRO, Team, and Enterprise, private storage is invoiced at **$25/TB/month**, in 1TB increments. It is billed with the renewal invoices of your PRO, Team or Enterprise subscription. ## Compute Services on the Hub We also directly provide compute services with [Spaces](./spaces), [Inference Endpoints](https://huggingface.co/docs/inference-endpoints/index) and [Inference Providers](https://huggingface.co/docs/inference-providers/index). While most of our compute services have a comprehensive free tier, users and organizations can pay to access more powerful hardware accelerators. The billing for our compute services is usage-based, meaning you only pay for what you use. You can monitor your usage at any time from your billing dashboard, located in your user's or organization's settings menu. Compute services usage is billed separately from PRO and Enterprise Hub subscriptions (and potential private storage). Invoices for compute services are edited at the beginning of each month. ## Available payment methods Hugging Face uses [Stripe](https://stripe.com) to securely process your payment information. The only payment method supported for Hugging Face compute services is credit cards. You can add a credit card to your account from your billing settings. ### Billing thresholds & Invoicing When using credit cards as a payment method, you'll be billed for the Hugging Face compute usage each time the accrued usage goes above a billing threshold for your user or organization. On the 1st of every month, Hugging Face edits an invoice for usage accrued during the prior month. Any usage that has yet to be charged will be charged at that time. For example, if your billing threshold is set at $100.00, and you incur $254.00 of usage during a given month, your credit card will be charged a total of three times during the month: - Once for usage between $0 and $100: $100 - Once for usage between $100 and $200: $100 - Once at the end of the month for the remaining $54: $54 Note: this will be detailed in your monthly invoice. You can view invoices and receipts for the last 3 months in your billing dashboard. ## Cloud providers partnerships We partner with cloud providers like [AWS](https://huggingface.co/blog/aws-partnership), [Azure](https://huggingface.co/blog/hugging-face-endpoints-on-azure), and [Google Cloud](https://huggingface.co/blog/llama31-on-vertex-ai) to make it easy for customers to use Hugging Face directly in their cloud of choice. These solutions and usage are billed directly by the cloud provider. Ultimately, we want people to have great options for using Hugging Face wherever they build ML-powered products. You also have the option to link your Hugging Face organization to your AWS account via [AWS Marketplace](https://aws.amazon.com/marketplace/pp/prodview-n6vsyhdjkfng2). Hugging Face compute service usage will then be included in your AWS bill. Read more in our [blog post](https://huggingface.co/blog/aws-marketplace). ## Support FAQ **Q. Why did I get charged $10 when I added my credit card? When will I get this back?** A. This amount is not charged and the hold should clear within a few business days. If you have more questions about the status of the clear, you can contact your bank for more information. **Q. My card was declined after adding to my account. What’s up?** A. Please ensure the card supports 3D-secure authentication and is properly configured for recurring online payments. We do not yet support credit cards issued in India as we’re working on adding system compliance with the latest RBI directives. Until we add support for Indian credit cards, you can: * Link an organization account to an AWS account in order to access pay-as-you-go features (Endpoints, Spaces, AutoTrain): [Hugging Face Platform on the AWS Marketplace: Pay with your AWS Account](https://huggingface.co/blog/aws-marketplace) * Use a credit card issued in another country **Q. When am I going to get my invoice for pay as you go services?** A. We bill in arrears and issue invoices for the prior month’s usage - typically the first of the month. So if you incurred billing usage in January, you’ll see the final payment process and invoiced February 1st. **Q. Why did you charge me multiple times during the month?** A. If you’re a new HF account using our premium pay as you go services, we’ll process a few billing threshold payments. Don’t worry, you’ll get an invoice for the total usage incurred for the month at the end of the billing period that will include these processed thresholds payments. For more information see https://huggingface.co/docs/hub/billing#billing-thresholds--invoicing. **Q. I need copies of my past invoices, where can I find these?** A. You can access up to the previous 3 months from the current month in your billing settings: https://huggingface.co/settings/billing. Click on the “End-of-period Invoice” link under that month’s “Payments & Invoices” and you’ll be able to download the invoice and the receipt. As an example:, if it’s currently January, you’ll be able to access the previous months’ invoices: December, November, and October. You can also check your email as we’ll send a copy of the invoice / receipt to the email address on the account. **Q. I need to update my credit card in my account. What to do?** A. Head to https://huggingface.co/settings/billing/payment and update your payment method at anytime. **Q. Oh no! My payment failed, what do I do to avoid a service interruption?** A. You can pay your bill with another payment method by clicking on the “pay online” link in the unpaid invoice. Click on the “End-of-period Invoice” link under that month’s “Payments & Invoices” and you’ll be able to pay online. You can also update your credit card at https://huggingface.co/settings/billing/payment. **Subscriptions** **Q. I need to pause my PRO subscription for a bit, where can I do this?** A. You can cancel your subscription at anytime here: https://huggingface.co/settings/billing/subscription. Drop us a line at billing@huggingface.co with your feedback. **Q. My org has a Team or Enterprise subscription and I need to update the number of seats. How can I do this?** A. The number of seats will automatically be adjusted at the time of the subscription renewal to reflect any increases in the number of members in the organization during the previous period. There’s no need to update the subscribed number of seats during the month or year as it’s a flat fee subscription. ### THE LANDSCAPE OF ML DOCUMENTATION TOOLS https://huggingface.co/docs/hub/model-card-landscape-analysis.md # THE LANDSCAPE OF ML DOCUMENTATION TOOLS The development of the model cards framework in 2018 was inspired by the major documentation framework efforts of Data Statements for Natural Language Processing ([Bender & Friedman, 2018](https://aclanthology.org/Q18-1041/)) and Datasheets for Datasets ([Gebru et al., 2018](https://www.fatml.org/media/documents/datasheets_for_datasets.pdf)). Since model cards were proposed, a number of other tools have been proposed for documenting and evaluating various aspects of the machine learning development cycle. These tools, including model cards and related documentation efforts proposed prior to model cards, can be contextualised with regard to their focus (e.g., on which part of the ML system lifecycle does the tool focus?) and their intended audiences (e.g., who is the tool designed for?). In Figures 1-2 below, we summarise several prominent documentation tools along these dimensions, provide contextual descriptions of each tool, and link to examples. We broadly classify the documentation tools as belong to the following groups: * **Data-focused**, including documentation tools focused on datasets used in the machine learning system lifecycle * **Models-and-methods-focused**, including documentation tools focused on machine learning models and methods; and * **Systems-focused**, including documentation tools focused on ML systems, including models, methods, datasets, APIs, and non AI/ML components that interact with each other as part of an ML system These groupings are not mutually exclusive; they do include overlapping aspects of the ML system lifecycle. For example, **system cards** focus on documenting ML systems that may include multiple models and datasets, and thus might include content that overlaps with data-focused or model-focused documentation tools. The tools described are a non-exhaustive list of documentation tools for the ML system lifecycle. In general, we included tools that were: * Focused on documentation of some (or multiple) aspects of the ML system lifecycle * Included the release of a template intended for repeated use, adoption, and adaption ## Summary of ML Documentation Tools ### Figure 1 | **Stage of ML System Lifecycle** | **Tool** | **Brief Description** | **Examples** | |:--------------------------------: |-------------------------------------------------------------------------------------------------------------------------------------------------------------------- |------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | DATA | ***Datasheets*** [(Gebru et al., 2018)](https://www.fatml.org/media/documents/datasheets_for_datasets.pdf) | “We recommend that every dataset be accompanied with a datasheet documenting its motivation, creation, composition, intended uses, distribution, maintenance, and other information.” | See, for example, [Ivy Lee’s repo](https://github.com/ivylee/model-cards-and-datasheets) with examples | | DATA | ***Data Statements*** [(Bender & Friedman, 2018)(Bender et al., 2021)](https://techpolicylab.uw.edu/wp-content/uploads/2021/11/Data_Statements_Guide_V2.pdf) | “A data statement is a characterization of a dataset that provides context to allow developers and users to better understand how experimental results might generalize, how software might be appropriately deployed, and what biases might be reflected in systems built on the software.” | See [Data Statements for NLP Workshop](https://techpolicylab.uw.edu/events/event/data-statements-for-nlp/) | | DATA | ***Dataset Nutrition Labels*** [(Holland et al., 2018)](https://huggingface.co/papers/1805.03677) | “The Dataset Nutrition Label…is a diagnostic framework that lowers the barrier to standardized data analysis by providing a distilled yet comprehensive overview of dataset “ingredients” before AI model development.” | See [The Data Nutrition Label](https://datanutrition.org/labels/) | | DATA | ***Data Cards for NLP*** [(McMillan-Major et al., 2021)](https://huggingface.co/papers/2108.07374) | “We present two case studies of creating documentation templates and guides in natural language processing (NLP): the Hugging Face (HF) dataset hub[^1] and the benchmark for Generation and its Evaluation and Metrics (GEM). We use the term data card to refer to documentation for datasets in both cases. | See [(McMillan-Major et al., 2021)](https://huggingface.co/papers/2108.07374) | | DATA | ***Dataset Development Lifecycle Documentation Framework*** [(Hutchinson et al., 2021)](https://dl.acm.org/doi/pdf/10.1145/3442188.3445918) | “We introduce a rigorous framework for dataset development transparency that supports decision-making and accountability. The framework uses the cyclical, infrastructural and engineering nature of dataset development to draw on best practices from the software development lifecycle.” | See [(Hutchinson et al., 2021)](https://dl.acm.org/doi/pdf/10.1145/3442188.3445918), Appendix A for templates | | DATA | ***Data Cards*** [(Pushkarna et al., 2021)](https://huggingface.co/papers/2204.01075) | “Data Cards are structured summaries of essential facts about various aspects of ML datasets needed by stakeholders across a dataset’s lifecycle for responsible AI development. These summaries provide explanations of processes and rationales that shape the data and consequently the models.” | See the [Data Cards Playbook github](https://github.com/PAIR-code/datacardsplaybook/) | | DATA | ***CrowdWorkSheets*** [(Díaz et al., 2022)](https://huggingface.co/papers/2206.08931) | “We introduce a novel framework, CrowdWorkSheets, for dataset developers to facilitate transparent documentation of key decisions points at various stages of the data annotation pipeline: task formulation, selection of annotators, plat- form and infrastructure choices, dataset analysis and evaluation, and dataset release and maintenance.” | See [(Díaz et al., 2022)](hhttps://huggingface.co/papers/2206.08931) | | MODELS AND METHODS | ***Model Cards*** [Mitchell et al. (2018)](https://huggingface.co/papers/1810.03993) | “Model cards are short documents accompanying trained machine learning models that provide benchmarked evaluation in a variety of conditions…that are relevant to the intended application domains. Model cards also disclose the context in which models are intended to be used, details of the performance evaluation procedures, and other relevant information.” | See https://huggingface.co/models, the [Model Card Guidebook](https://huggingface.co/docs/hub/model-card-guidebook), and [Model Card Examples](https://huggingface.co/docs/hub/model-card-appendix#model-card-examples) | | MODELS AND METHODS | ***Value Cards*** [Shen et al. (2021)](https://dl.acm.org/doi/abs/10.1145/3442188.3445971) | “We present Value Cards, a deliberation-driven toolkit for bringing computer science students and practitioners the awareness of the social impacts of machine learning-based decision making systems….Value Cards encourages the investigations and debates towards different ML performance metrics and their potential trade-offs.” | See [Shen et al. (2021)](https://dl.acm.org/doi/abs/10.1145/3442188.3445971), Section 3.3 | | MODELS AND METHODS | ***Method Cards*** [Adkins et al. (2022)](https://dl.acm.org/doi/pdf/10.1145/3491101.3519724) | “We propose method cards to guide ML engineers through the process of model development…The information comprises both prescriptive and descriptive elements, putting the main focus on ensuring that ML engineers are able to use these methods properly.” | See [Adkins et al. (2022)](https://dl.acm.org/doi/pdf/10.1145/3491101.3519724), Appendix A | | MODELS AND METHODS | ***Consumer Labels for ML Models*** [Seifert et al. (2019)](https://ris.utwente.nl/ws/portalfiles/portal/158031484/Seifert2019_cogmi_consumer_labels_preprint.pdf) | “We propose to issue consumer labels for trained and published ML models. These labels primarily target machine learning lay persons, such as the operators of an ML system, the executors of decisions, and the decision subjects themselves” | See [Seifert et al. (2019)](https://ris.utwente.nl/ws/portalfiles/portal/158031484/Seifert2019_cogmi_consumer_labels_preprint.pdf) | | SYSTEMS | ***Factsheets*** [Arnold et al. (2019)](https://huggingface.co/papers/1808.07261) | “A FactSheet will contain sections on all relevant attributes of an AI service, such as intended use, performance, safety, and security. Performance will include appropriate accuracy or risk measures along with timing information.” | See [IBM’s AI Factsheets 360](https://aifs360.res.ibm.com) and [Hind et al., (2020)](https://dl.acm.org/doi/abs/10.1145/3334480.3383051) | | SYSTEMS | ***System Cards*** [Procope et al. (2022)](https://ai.facebook.com/research/publications/system-level-transparency-of-machine-learning) | “System Cards aims to increase the transparency of ML systems by providing stakeholders with an overview of different components of an ML system, how these components interact, and how different pieces of data and protected information are used by the system.” | See [Meta’s Instagram Feed Ranking System Card](https://ai.facebook.com/tools/system-cards/instagram-feed-ranking/) | | SYSTEMS | ***Reward Reports for RL*** [Gilbert et al. (2022)](https://huggingface.co/papers/2204.10817) | “We sketch a framework for documenting deployed learning systems, which we call Reward Reports…We outline Reward Reports as living documents that track updates to design choices and assumptions behind what a particular automated system is optimizing for. They are intended to track dynamic phenomena arising from system deployment, rather than merely static properties of models or data.” | See https://rewardreports.github.io | | SYSTEMS | ***Robustness Gym*** [Goel et al. (2021)](https://huggingface.co/papers/2101.04840) | “We identify challenges with evaluating NLP systems and propose a solution in the form of Robustness Gym (RG), a simple and extensible evaluation toolkit that unifies 4 standard evaluation paradigms: subpopulations, transformations, evaluation sets, and adversarial attacks.” | See https://github.com/robustness-gym/robustness-gym | | SYSTEMS | ***ABOUT ML*** [Raji and Yang, (2019)](https://huggingface.co/papers/1912.06166) | “ABOUT ML (Annotation and Benchmarking on Understanding and Transparency of Machine Learning Lifecycles) is a multi-year, multi-stakeholder initiative led by PAI. This initiative aims to bring together a diverse range of perspectives to develop, test, and implement machine learning system documentation practices at scale.” | See [ABOUT ML’s resources library](https://partnershiponai.org/about-ml-resources-library/) | ### DATA-FOCUSED DOCUMENTATION TOOLS Several proposed documentation tools focus on datasets used in the ML system lifecycle, including to train, develop, validate, finetune, and evaluate machine learning models as part of continuous cycles. These tools generally focus on the many aspects of the data lifecycle (perhaps for a particular dataset, group of datasets, or more broadly), including how the data was assembled, collected, annotated and how it should be used. * Extending the concept of datasheets in the electronics industry, [Gebru et al. (2018)](https://www.fatml.org/media/documents/datasheets_for_datasets.pdf) propose datasheets for datasets to document details related to a dataset’s creation, potential uses, and associated concerns. * [Bender and Friedman (2018)](https://aclanthology.org/Q18-1041/) propose data statements for natural language processing. [Bender, Friedman and McMillan-Major (2021)](https://techpolicylab.uw.edu/wp-content/uploads/2021/11/Data_Statements_Guide_V2.pdf) update the original data statements framework and provide resources including a guide for writing data statements and translating between the first version of the schema and the newer version[^2]. * [Holland et al. (2018)](https://huggingface.co/papers/1805.03677) propose data nutrition labels, akin to nutrition facts for foodstuffs and nutrition labels for privacy disclosures, as a tool for analyzing and making decisions about datasets. The Data Nutrition Label team released an updated design of and interface for the label in 2020 ([Chmielinski et al., 2020)](https://huggingface.co/papers/2201.03954)). * [McMillan-Major et al. (2021)](https://huggingface.co/papers/2108.07374) describe the development process and resulting templates for **data cards for NLP** in the form of data cards on the Hugging Face Hub[^3] and data cards for datasets that are part of the NLP benchmark for Generation and its Evaluation Metrics (GEM) environment[^4]. * [Hutchinson et al. (2021)](https://dl.acm.org/doi/pdf/10.1145/3442188.3445918) describe the need for comprehensive dataset documentation, and drawing on software development practices, provide templates for documenting several aspects of the dataset development lifecycle (for the purposes of Tables 1 and 2, we refer to their framework as the **Dataset Development Lifecycle Documentation Framework**). * [Pushkarna et al. (2021)](https://huggingface.co/papers/2204.01075) propose the data cards as part of the **data card playbook**, a human-centered documentation tool focused on datasets used in industry and research. ### MODEL-AND-METHOD-FOCUSED DOCUMENTATION TOOLS Another set of documentation tools can be thought of as focusing on machine learning models and machine learning methods. These include: * [Mitchell et al. (2018)](https://huggingface.co/papers/1810.03993) propose **model cards** for model reporting to accompany trained ML models and document issues related to evaluation, use, and other issues * [Shen et al. (2021)](https://dl.acm.org/doi/abs/10.1145/3442188.3445971) propose **value cards** for teaching students and practitioners about values related to ML models * [Seifert et al. (2019)](https://ris.utwente.nl/ws/portalfiles/portal/158031484/Seifert2019_cogmi_consumer_labels_preprint.pdf) propose **consumer labels for ML models** to help non-experts using or affected by the model understand key issues related to the model. * [Adkins et al. (2022)](https://dl.acm.org/doi/pdf/10.1145/3491101.3519724) analyse aspects of descriptive documentation tools – which they consider to include **model cards** and data sheets – and argue for increased prescriptive tools for ML engineers. They propose method cards, focused on ML methods, and design primarily with technical stakeholders like model developers and reviewers in mind. * They envision the relationship between model cards and method cards, in part, by stating: “The sections and prompts we propose…[in the method card template] focus on ML methods that are sufficient to produce a proper ML model with defined input, output, and task. Examples for these are object detection methods such as Single-shot Detectors and language modelling methods such as Generative Pre-trained Transformers (GPT). *It is possible to create Model Cards for the models created using these methods*.” * They also state “While Model Cards and FactSheets put main focus on documenting existing models, Method Cards focus more on the underlying methodical and algorithmic choices that need to be considered when creating and training these models. *As a rough analogy, if Model Cards and FactSheets provide nutritional information about cooked meals, Method Cards provide the recipes*.” ### SYSTEM-FOCUSED DOCUMENTATION TOOLS Rather than focusing on particular models, datasets, or methods, system-focused documentation tools look at how models interact with each other, with datasets, methods, and with other ML components to form ML systems. * [Procope et al. (2022)](https://ai.facebook.com/research/publications/system-level-transparency-of-machine-learning) propose system cards to document and explain AI systems – potentially including multiple ML models, AI tools, and non-AI technologies – that work together to accomplish tasks. * [Arnold et al. (2019)](https://huggingface.co/papers/1808.07261) extend the idea of declarations of conformity for consumer products to AI services, proposing FactSheets to document aspects of “AI services” which are typically accessed through APIs and may be composed of multiple different ML models. [Hind et al. (2020)](https://dl.acm.org/doi/abs/10.1145/3334480.3383051) share reflections on building factsheets. * [Gilbert et al. (2022)](https://huggingface.co/papers/2204.10817) propose **Reward Reports for Reinforcement Learning** systems, recognizing the dynamic nature of ML systems and the need for documentation efforts to incorporate considerations of post-deployment performance, especially for reinforcement learning systems. * [Goel et al. (2021)](https://huggingface.co/papers/2101.04840) develop **Robustness Gym**, an evaluation toolkit for testing several aspects of deep neural networks in real-world systems, allowing for comparison across evaluation paradigms. * Through the [ABOUT ML project](https://partnershiponai.org/workstream/about-ml/) ([Raji and Yang, 2019](https://huggingface.co/papers/1912.06166)), the Partnership on AI is coordinating efforts across groups of stakeholders in the machine learning community to develop comprehensive, scalable documentation tools for ML systems. ## THE EVOLUTION OF MODEL CARDS Since the proposal for model cards by Mitchell et al. in 2018, model cards have been adopted and adapted by various organisations, including by major technology companies and startups developing and hosting machine learning models[^5], researchers describing new techniques[^6], and government stakeholders evaluating models for various projects[^7]. Model cards also appear as part of AI Ethics educational toolkits, and numerous organisations and developers have created implementations for automating or semi-automating the creation of model cards. Appendix A provides a set of examples of model cards for various types of ML models created by different organisations (including model cards for large language models), model card generation tools, and model card educational tools. ### MODEL CARDS ON THE HUGGING FACE HUB Since 2018, new platforms and mediums for hosting and sharing model cards have also emerged. For example, particularly relevant to this project, Hugging Face hosts model cards on the Hugging Face Hub as README files in the repositories associated with ML models. As a result, model cards figure as a prominent form of documentation for users of models on the Hugging Face Hub. As part of our analysis of model cards, we developed and proposed model cards for several dozen ML models on the Hugging Face Hub, using the Hub’s Pull Request (PR) and Discussion features to gather feedback on model cards, verify information included in model cards, and publish model cards for models on the Hugging Face Hub. At the time of writing of this guide book, all of Hugging Face’s models on the Hugging Face Hub have an associated model card on the Hub[^8]. The high number of models uploaded to the Hugging Face Hub (101,041 models at the point of writing), enabled us to explore the content within model cards on the hub: We began by analysing language model, model cards, in order to identify patterns (e.g repeated sections and subsections, with the aim of answering initial questions such as: 1) How many of these models have model cards? 2) What percent of downloads had an associated model card? From our analysis of all the models on the hub, we noticed that the most downloads come from top 200 models. With a continued focus on large language models, ordered by most downloaded and only models with model cards to begin with, we noted the most recurring sections within their respective model cards. While some headings within model cards may differ between models, we grouped components/the theme of each section within each model cards and then mapped them to section headings that were the most recurring (mostly found in the top 200 downloaded models and with the aid/guidance of the Bloom model card) > [!TIP] > [Checkout the User Studies](./model-cards-user-studies) > [!TIP] > [See Appendix](./model-card-appendix) [^1]: For each tool, descriptions are excerpted from the linked paper listed in the second column. [^2]: See https://techpolicylab.uw.edu/data-statements/ . [^3]: See https://techpolicylab.uw.edu/data-statements/ . [^4]: See https://techpolicylab.uw.edu/data-statements/ . [^5]: See, e.g., the Hugging Face Hub, Google Cloud’s Model Cards https://modelcards.withgoogle.com/about . [^6]: See Appendix A. [^7]: See GSA / US Census Bureau Collaboration on Model Card Generator. [^8]: By “Hugging Face models,” we mean models shared by Hugging Face, not another organisation, on the Hub. Formally, these are models without a ‘/’ in their model ID. --- **Please cite as:** Ozoani, Ezi and Gerchick, Marissa and Mitchell, Margaret. Model Card Guidebook. Hugging Face, 2022. https://huggingface.co/docs/hub/en/model-card-guidebook ### Managing organizations https://huggingface.co/docs/hub/organizations-managing.md # Managing organizations ## Creating an organization Visit the [New Organization](https://hf.co/organizations/new) form to create an organization. ## Managing members New members can be added to an organization by visiting the **Organization settings** and clicking on the **Members** tab. There, you'll be able to generate an invite link, add members individually, or send out email invitations in bulk. If the **Allow requests to join from the organization page** setting is enabled, you'll also be able to approve or reject any pending requests on the **Members** page. You can also revoke a user's membership or change their role on this page. ## Organization domain name Under the **Account** tab in the Organization settings, you can set an **Organization email domain**. Specifying a domain will allow any user with a matching email address on the Hugging Face Hub to join your organization. ## Leaving an organization Users can leave an organization visiting their [organization settings](https://huggingface.co/settings/organizations) and clicking **Leave Organization** next to the organization they want to leave. Organization administrators can always remove users as explained above. ### Using Keras at Hugging Face https://huggingface.co/docs/hub/keras.md # Using Keras at Hugging Face Keras is an open-source multi-backend deep learning framework, with support for JAX, TensorFlow, and PyTorch. You can find more details about it on [keras.io](https://keras.io/). ## Exploring Keras in the Hub You can list `keras` models on the Hub by filtering by library name on the [models page](https://huggingface.co/models?library=keras&sort=downloads). Keras models on the Hub come up with useful features when uploaded directly from the Keras library: 1. A generated model card with a description, a plot of the model, and more. 2. A download count to monitor the popularity of a model. 3. A code snippet to quickly get started with the model. ## Using existing models Keras is deeply integrated with the Hugging Face Hub. This means you can load and save models on the Hub directly from the library. To do that, you need to install a recent version of Keras and `huggingface_hub`. The `huggingface_hub` library is a lightweight Python client used by Keras to interact with the Hub. ``` pip install -U keras huggingface_hub ``` Once you have the library installed, you just need to use the regular `keras.saving.load_model` method by passing as argument a Hugging Face path. An HF path is a `repo_id` prefixed by `hf://` e.g. `"hf://keras-io/weather-prediction"`. Read more about `load_model` in [Keras documentation](https://keras.io/api/models/model_saving_apis/model_saving_and_loading/#load_model-function). ```py import keras model = keras.saving.load_model("hf://Wauplin/mnist_example") ``` If you want to see how to load a specific model, you can click **Use this model** on the model page to get a working code snippet! ## Sharing your models Similarly to `load_model`, you can save and share a `keras` model on the Hub using `model.save()` with an HF path: ```py model = ... model.save("hf://your-username/your-model-name") ``` If the repository does not exist on the Hub, it will be created for you. The uploaded model contains a model card, a plot of the model, the `metadata.json` and `config.json` files, and a `model.weights.h5` file containing the model weights. By default, the repository will contain a minimal model card. Check out the [Model Card guide](https://huggingface.co/docs/hub/model-cards) to learn more about model cards and how to complete them. You can also programmatically update model cards using `huggingface_hub.ModelCard` (see [guide](https://huggingface.co/docs/huggingface_hub/guides/model-cards)). > [!TIP] > You might be already familiar with `.keras` files. In fact, a `.keras` file is simply a zip file containing the `.json` and `model.weights.h5` files. When pushed to the Hub, the model is saved as an unzipped folder in order to let you navigate through the files. Note that if you manually upload a `.keras` file to a model repository on the Hub, the repository will automatically be tagged as `keras` but you won't be able to load it using `keras.saving.load_model`. ## Additional resources * Keras Developer [Guides](https://keras.io/guides/). * Keras [examples](https://keras.io/examples/). ### Adding a Sign-In with HF button to your Space https://huggingface.co/docs/hub/spaces-oauth.md # Adding a Sign-In with HF button to your Space You can enable a built-in sign-in flow in your Space by seamlessly creating and associating an [OAuth/OpenID connect](https://developer.okta.com/blog/2019/10/21/illustrated-guide-to-oauth-and-oidc) app so users can log in with their HF account. This enables new use cases for your Space. For instance, when combined with [Persistent Storage](https://huggingface.co/docs/hub/spaces-storage), a generative AI Space could allow users to log in to access their previous generations, only accessible to them. > [!TIP] > This guide will take you through the process of integrating a *Sign-In with HF* button into any Space. If you're seeking a fast and simple method to implement this in a **Gradio** Space, take a look at its [built-in integration](https://www.gradio.app/guides/sharing-your-app#o-auth-login-via-hugging-face). > [!TIP] > You can also use the HF OAuth flow to create a "Sign in with HF" flow in any website or App, outside of Spaces. [Read our general OAuth page](./oauth). ## Create an OAuth app All you need to do is add `hf_oauth: true` to your Space's metadata inside your `README.md` file. Here's an example of metadata for a Gradio Space: ```yaml title: Gradio Oauth Test emoji: 🏆 colorFrom: pink colorTo: pink sdk: gradio sdk_version: 3.40.0 python_version: 3.10.6 app_file: app.py hf_oauth: true # optional, default duration is 8 hours/480 minutes. Max duration is 30 days/43200 minutes. hf_oauth_expiration_minutes: 480 # optional, see "Scopes" below. "openid profile" is always included. hf_oauth_scopes: - read-repos - write-repos - manage-repos - inference-api # optional, restrict access to members of specific organizations hf_oauth_authorized_org: ORG_NAME hf_oauth_authorized_org: - ORG_NAME1 - ORG_NAME2 ``` You can check out the [configuration reference docs](./spaces-config-reference) for more information. This will add the following [environment variables](https://huggingface.co/docs/hub/spaces-overview#helper-environment-variables) to your space: - `OAUTH_CLIENT_ID`: the client ID of your OAuth app (public) - `OAUTH_CLIENT_SECRET`: the client secret of your OAuth app - `OAUTH_SCOPES`: scopes accessible by your OAuth app. - `OPENID_PROVIDER_URL`: The URL of the OpenID provider. The OpenID metadata will be available at [`{OPENID_PROVIDER_URL}/.well-known/openid-configuration`](https://huggingface.co/.well-known/openid-configuration). As for any other environment variable, you can use them in your code by using `os.getenv("OAUTH_CLIENT_ID")`, for example. ## Redirect URLs You can use any redirect URL you want, as long as it targets your Space. Note that `SPACE_HOST` is [available](https://huggingface.co/docs/hub/spaces-overview#helper-environment-variables) as an environment variable. For example, you can use `https://{SPACE_HOST}/login/callback` as a redirect URI. ## Scopes The following scopes are always included for Spaces: - `openid`: Get the ID token in addition to the access token. - `profile`: Get the user's profile information (username, avatar, etc.) Those scopes are optional and can be added by setting `hf_oauth_scopes` in your Space's metadata: - `email`: Get the user's email address. - `read-billing`: Know whether the user has a payment method set up. - `read-repos`: Get read access to the user's personal repos. - `contribute-repos`: Can create repositories and access those created by this app. Cannot access any other repositories unless additional permissions are granted. - `write-repos`: Get write/read access to the user's personal repos. - `manage-repos`: Get full access to the user's personal repos. Also grants repo creation and deletion. - `inference-api`: Get access to the [Inference Providers](https://huggingface.co/docs/inference-providers/index), you will be able to make inference requests on behalf of the user. - `jobs`: Run [jobs](https://huggingface.co/docs/huggingface_hub/main/en/guides/jobs) - `webhooks`: Manage [webhooks](https://huggingface.co/docs/huggingface_hub/main/en/guides/webhooks) - `write-discussions`: Open discussions and Pull Requests on behalf of the user as well as interact with discussions (including reactions, posting/editing comments, closing discussions, ...). To open Pull Requests on private repos, you need to request the `read-repos` scope as well. ## Accessing organization resources By default, the oauth app does not need to access organization resources. But some scopes like `read-repos` or `read-billing` apply to organizations as well. The user can select which organizations to grant access to when authorizing the app. If you require access to a specific organization, you can add `orgIds=ORG_ID` as a query parameter to the OAuth authorization URL. You have to replace `ORG_ID` with the organization ID, which is available in the `organizations.sub` field of the userinfo response. ## Adding the button to your Space You now have all the information to add a "Sign-in with HF" button to your Space. Some libraries ([Python](https://github.com/lepture/authlib), [NodeJS](https://github.com/panva/node-openid-client)) can help you implement the OpenID/OAuth protocol. Gradio and huggingface.js also provide **built-in support**, making implementing the Sign-in with HF button a breeze; you can check out the associated guides with [gradio](https://www.gradio.app/guides/sharing-your-app#o-auth-login-via-hugging-face) and with [huggingface.js](https://huggingface.co/docs/huggingface.js/hub/README#oauth-login). Basically, you need to: - Redirect the user to `https://huggingface.co/oauth/authorize?redirect_uri={REDIRECT_URI}&scope=openid%20profile&client_id={CLIENT_ID}&state={STATE}`, where `STATE` is a random string that you will need to verify later. - Handle the callback on `/auth/callback` or `/login/callback` (or your own custom callback URL) and verify the `state` parameter. - Use the `code` query parameter to get an access token and id token from `https://huggingface.co/oauth/token` (POST request with `client_id`, `code`, `grant_type=authorization_code` and `redirect_uri` as form data, and with `Authorization: Basic {base64(client_id:client_secret)}` as a header). > [!WARNING] > You should use `target=_blank` on the button to open the sign-in page in a new tab, unless you run the space outside its `iframe`. Otherwise, you might encounter issues with cookies on some browsers. ## Examples: - [Gradio test app](https://huggingface.co/spaces/Wauplin/gradio-oauth-test) - [Hugging Chat (NodeJS/SvelteKit)](https://huggingface.co/spaces/huggingchat/chat-ui) - [Inference Widgets (Auth.js/SvelteKit)](https://huggingface.co/spaces/huggingfacejs/inference-widgets), uses the `inference-api` scope to make inference requests on behalf of the user. - [Client-Side in a Static Space (huggingface.js)](https://huggingface.co/spaces/huggingfacejs/client-side-oauth) - very simple JavaScript example. JS Code example: ```js import { oauthLoginUrl, oauthHandleRedirectIfPresent } from "@huggingface/hub"; const oauthResult = await oauthHandleRedirectIfPresent(); if (!oauthResult) { // If the user is not logged in, redirect to the login page window.location.href = await oauthLoginUrl(); } // You can use oauthResult.accessToken, oauthResult.userInfo among other things console.log(oauthResult); ``` ### Appendix https://huggingface.co/docs/hub/model-card-appendix.md # Appendix ## Appendix A: User Study _Full text responses to key questions_ ### How would you define model cards? ***Insight: Respondents had generally similar views of what model cards are: documentation focused on issues like training, use cases, and bias/limitations*** * Model cards are model descriptions, both of how they were trained, their use cases, and potential biases and limitations * Documents describing the essential features of a model in order for the reader/user to understand the artefact he/she has in front, the background/training, how it can be used, and its technical/ethical limitations. * They serve as a living artefact of models to document them. Model cards contain information that go from a high level description of what the specific model can be used to, to limitations, biases, metrics, and much more. They are used primarily to understand what the model does. * Model cards are to models what GitHub READMEs are to GitHub projects. It tells people all the information they need to know about the model. If you don't write one, nobody will use your model. * From what I understand, a model card uses certain benchmarks (geography, culture, sex, etc) to define both a model's usability and limitations. It's essentially a model's 'nutrition facts label' that can show how a model was created and educates others on its reusability. * Model cards are the metadata and documentation about the model, everything I need to know to use the model properly: info about the model, what paper introduced it, what dataset was it trained on or fine-tuned on, whom does it belong to, are there known risks and limitations with this model, any useful technical info. * IMO model cards are a brief presentation of a model which includes: * short summary of the architectural particularities of the model * describing the data it was trained on * what is the performance on reference datasets (accuracy and speed metrics if possible) * limitations * how to use it in the context of the Transformers library * source (original article, Github repo,...) * Easily accessible documentation that any background can read and learn about critical model components and social impact ### What do you like about model cards? * They are interesting to teach people about new models * As a non-technical guy, the possibility of getting to know the model, to understand the basics of it, it's an opportunity for the author to disclose its innovation in a transparent & explainable (i.e. trustworthy) way. * I like interactive model cards with visuals and widgets that allow me to try the model without running any code. * What I like about good model cards is that you can find all the information you need about that particular model. * Model cards are revolutionary to the world of AI ethics. It's one of the first tangible steps in mitigating/educating on biases in machine learning. They foster greater awareness and accountability! * Structured, exhaustive, the more info the better. * It helps to get an understanding of what the model is good (or bad) at. * Conciseness and accessibility ### What do you dislike about model cards? * Might get to technical and/or dense * They contain lots of information for different audiences (researchers, engineers, non engineers), so it's difficult to explore model cards with an intended use cases. * [NOTE: this comment could be addressed with toggle views for different audiences] * Good ones are time consuming to create. They are hard to test to make sure the information is up to date. Often times, model cards are formatted completely differently - so you have to sort of figure out how that certain individual has structured theirs. * [NOTE: this comment helps demonstrate the value of a standardized format and automation tools to make it easier to create model cards] * Without the help of the community to pitch in supplemental evals, model cards might be subject to inherent biases that the developer might not be aware of. It's early days for them, but without more thorough evaluations, a model card's information might be too limited. * Empty model cards. No license information - customers need that info and generally don't have it. * They are usually either too concise or too verbose. * writing them lol bless you ### Other key new insights * Model cards are best filled out when done by people with different roles: Technical specifications can generally only be filled out by the developers; ethical considerations throughout are generally best informed by people who tend to work on ethical issues. * Model users care a lot about licences -- specifically, whether a model can legally be used for a specific task. ## Appendix B: Landscape Analysis _Overview of the state of model documentation in Machine Learning_ ### MODEL CARD EXAMPLES Examples of model cards and closely-related variants include: * Google Cloud: [Face Detection](https://modelcards.withgoogle.com/face-detection), [Object Detection](https://modelcards.withgoogle.com/object-detection) * Google Research: [ML Kit Vision Models](https://developers.google.com/s/results/ml-kit?q=%22Model%20Card%22), [Face Detection](https://sites.google.com/view/perception-cv4arvr/blazeface), [Conversation AI](https://github.com/conversationai/perspectiveapi/tree/main/model-cards) * OpenAI: [GPT-3](https://github.com/openai/gpt-3/blob/master/model-card.md), [GPT-2](https://github.com/openai/gpt-2/blob/master/model_card.md), [DALL-E dVAE](https://github.com/openai/DALL-E/blob/master/model_card.md), [CLIP](https://github.com/openai/CLIP-featurevis/blob/master/model-card.md) * [NVIDIA Model Cards](https://catalog.ngc.nvidia.com/models?filters=&orderBy=weightPopularASC&query=) * [Salesforce Model Cards](https://blog.salesforceairesearch.com/model-cards-for-ai-model-transparency/) * [Allen AI Model Cards](https://github.com/allenai/allennlp-models/tree/main/allennlp_models/modelcards) * [Co:here AI Model Cards](https://docs.cohere.ai/responsible-use/) * [Duke PULSE Model Card](https://arxiv.org/pdf/2003.03808.pdf) * [Stanford Dynasent](https://github.com/cgpotts/dynasent/blob/main/dynasent_modelcard.md) * [GEM Model Cards](https://gem-benchmark.com/model_cards) * Parl.AI: [Parl.AI sample model cards](https://github.com/facebookresearch/ParlAI/tree/main/docs/sample_model_cards), [BlenderBot 2.0 2.7B](https://github.com/facebookresearch/ParlAI/blob/main/parlai/zoo/blenderbot2/model_card.md) * [Perspective API Model Cards](https://github.com/conversationai/perspectiveapi/tree/main/model-cards) * See https://github.com/ivylee/model-cards-and-datasheets for more examples! ### MODEL CARDS FOR LARGE LANGUAGE MODELS Large language models are often released with associated documentation. Large language models that have an associated model card (or related documentation tool) include: * [Big Science BLOOM model card](https://huggingface.co/bigscience/bloom) * [GPT-2 Model Card](https://github.com/openai/gpt-2/blob/master/model_card.md) * [GPT-3 Model Card](https://github.com/openai/gpt-3/blob/master/model-card.md) * [DALL-E 2 Preview System Card](https://github.com/openai/dalle-2-preview/blob/main/system-card.md) * [OPT-175B model card](https://arxiv.org/pdf/2205.01068.pdf) ### MODEL CARD GENERATION TOOLS Tools for programmatically or interactively generating model cards include: * [Salesforce Model Card Creation](https://help.salesforce.com/s/articleView?id=release-notes.rn_bi_edd_model_card.htm&type=5&release=232) * [TensorFlow Model Card Toolkit](https://ai.googleblog.com/2020/07/introducing-model-card-toolkit-for.html) * [Python library](https://pypi.org/project/model-card-toolkit/) * [GSA / US Census Bureau Collaboration on Model Card Generator](https://bias.xd.gov/resources/model-card-generator/) * [Parl.AI Auto Generation Tool](https://parl.ai/docs/tutorial_model_cards.html) * [VerifyML Model Card Generation Web Tool](https://www.verifyml.com) * [RMarkdown Template for Model Card as part of vetiver package](https://cran.r-project.org/web/packages/vetiver/vignettes/model-card.html) * [Databaseline ML Cards toolkit](https://databaseline.tech/ml-cards/) ### MODEL CARD EDUCATIONAL TOOLS Tools for understanding model cards and understanding how to create model cards include: * [Hugging Face Hub docs](https://huggingface.co/course/chapter4/4?fw=pt) * [Perspective API](https://developers.perspectiveapi.com/s/about-the-api-model-cards) * [Kaggle](https://www.kaggle.com/code/var0101/model-cards/tutorial) * [Code.org](https://studio.code.org/s/aiml-2021/lessons/8) * [UNICEF](https://unicef.github.io/inventory/data/model-card/) --- **Please cite as:** Ozoani, Ezi and Gerchick, Marissa and Mitchell, Margaret. Model Card Guidebook. Hugging Face, 2022. https://huggingface.co/docs/hub/en/model-card-guidebook ### Advanced Security https://huggingface.co/docs/hub/enterprise-hub-advanced-security.md # Advanced Security > [!WARNING] > This feature is part of the Team & Enterprise plans. Enterprise Hub organizations can improve their security with advanced security controls for both members and repositories. ## Members Security Configure additional security settings to protect your organization: - **Two-Factor Authentication (2FA)**: Require all organization members to enable 2FA for enhanced account security. - **User Approval**: For organizations with a verified domain name, require admin approval for new users with matching email addresses. This adds a verified badge to your organization page. ## Repository Visibility Controls Manage the default visibility of repositories in your organization: - **Public by default**: New repositories are created with public visibility - **Private by default**: New repositories are created with private visibility. Note that changing this setting will not affect existing repositories. - **Private only**: Enforce private visibility for all new repositories, with only organization admins able to change visibility settings These settings help organizations maintain control of their ownership while enabling collaboration when needed. ### DuckDB https://huggingface.co/docs/hub/datasets-duckdb.md # DuckDB [DuckDB](https://github.com/duckdb/duckdb) is an in-process SQL [OLAP](https://en.wikipedia.org/wiki/Online_analytical_processing) database management system. You can use the Hugging Face paths (`hf://`) to access data on the Hub: The [DuckDB CLI](https://duckdb.org/docs/api/cli/overview.html) (Command Line Interface) is a single, dependency-free executable. There are also other APIs available for running DuckDB, including Python, C++, Go, Java, Rust, and more. For additional details, visit their [clients](https://duckdb.org/docs/api/overview.html) page. > [!TIP] > For installation details, visit the [installation page](https://duckdb.org/docs/installation). Starting from version `v0.10.3`, the DuckDB CLI includes native support for accessing datasets on the Hugging Face Hub via URLs with the `hf://` scheme. Here are some features you can leverage with this powerful tool: - Query public datasets and your own gated and private datasets - Analyze datasets and perform SQL operations - Combine datasets and export it to different formats - Conduct vector similarity search on embedding datasets - Implement full-text search on datasets For a complete list of DuckDB features, visit the DuckDB [documentation](https://duckdb.org/docs/). To start the CLI, execute the following command in the installation folder: ```bash ./duckdb ``` ## Forging the Hugging Face URL To access Hugging Face datasets, use the following URL format: ```plaintext hf://datasets/{my-username}/{my-dataset}/{path_to_file} ``` - **my-username**, the user or organization of the dataset, e.g. `ibm` - **my-dataset**, the dataset name, e.g: `duorc` - **path_to_parquet_file**, the parquet file path which supports glob patterns, e.g `**/*.parquet`, to query all parquet files > [!TIP] > You can query auto-converted Parquet files using the @~parquet branch, which corresponds to the `refs/convert/parquet` revision. For more details, refer to the documentation at https://huggingface.co/docs/datasets-server/en/parquet#conversion-to-parquet. > > To reference the `refs/convert/parquet` revision of a dataset, use the following syntax: > > ```plaintext > hf://datasets/{my-username}/{my-dataset}@~parquet/{path_to_file} > ``` > > Here is a sample URL following the above syntax: > > ```plaintext > hf://datasets/ibm/duorc@~parquet/ParaphraseRC/test/0000.parquet > ``` Let's start with a quick demo to query all the rows of a dataset: ```sql FROM 'hf://datasets/ibm/duorc/ParaphraseRC/*.parquet' LIMIT 3; ``` Or using traditional SQL syntax: ```sql SELECT * FROM 'hf://datasets/ibm/duorc/ParaphraseRC/*.parquet' LIMIT 3; ``` In the following sections, we will cover more complex operations you can perform with DuckDB on Hugging Face datasets. ### Evidence on Spaces https://huggingface.co/docs/hub/spaces-sdks-docker-evidence.md # Evidence on Spaces **Evidence** is an open-source framework designed for building data-driven applications, reports, and dashboards using SQL and Markdown. With Evidence, you can quickly create decision-support tools, reports, and interactive dashboards without relying on traditional drag-and-drop business intelligence (BI) platforms. Evidence enables you to: - Write reports and dashboards directly in Markdown with SQL-backed components. - Integrate data from multiple sources, including SQL databases and APIs. - Use templated pages to automatically generate multiple pages based on a single template. - Deploy reports seamlessly to various hosting solutions. Visit [Evidence’s documentation](https://docs.evidence.dev/) for guides, examples, and best practices for using Evidence to create data products. ## Deploy Evidence on Spaces You can deploy Evidence on Hugging Face Spaces with just a few clicks: Once created, the Space will display `Building` status. Refresh the page if the status doesn't automatically update to `Running`. Your Evidence app will automatically be deployed on Hugging Face Spaces. ## Editing your Evidence app from the CLI To edit your app, clone the Space and edit the files locally. ```bash git clone https://huggingface.co/spaces/your-username/your-space-name cd your-space-name npm install npm run sources npm run dev ``` You can then modify pages/index.md to change the content of your app. ## Editing your Evidence app from VS Code The easiest way to develop with Evidence is using the [VS Code Extension](https://marketplace.visualstudio.com/items?itemName=Evidence.evidence-vscode): 1. Install the extension from the VS Code Marketplace 2. Open the Command Palette (Ctrl/Cmd + Shift + P) and enter `Evidence: Copy Existing Project` 3. Paste the URL of the Hugging Face Spaces Evidence app you'd like to copy (e.g. `https://huggingface.co/spaces/your-username/your-space-name`) and press Enter 4. Select the folder you'd like to clone the project to and press Enter 5. Press `Start Evidence` in the bottom status bar Check out the docs for [alternative install methods](https://docs.evidence.dev/getting-started/install-evidence), Github Codespaces, and alongside dbt. ## Learning More - [Docs](https://docs.evidence.dev/) - [Github](https://github.com/evidence-dev/evidence) - [Slack Community](https://slack.evidence.dev/) - [Evidence Home Page](https://www.evidence.dev) ### Audit Logs https://huggingface.co/docs/hub/audit-logs.md # Audit Logs > [!WARNING] > This feature is part of the Team & Enterprise plans. Audit Logs enable organization admins to easily review actions taken by members, including organization membership, repository settings and billing changes. ## Accessing Audit Logs Audit Logs are accessible through your organization settings. Each log entry includes: - Who performed the action - What type of action was taken - A description of the change - Location and anonymized IP address - Date and time of the action You can also download the complete audit log as a JSON file for further analysis. ## What Events Are Tracked? ### Organization Management & Security - Core organization changes - Creation, deletion, and restoration - Name changes and settings updates - Security management - Security token rotation - Token approval system (enabling/disabling, authorization requests, approvals, denials, revocations) - SSO events (logins and joins) ### Membership and Access Control - Member lifecycle - Invitations (sending, accepting) and automatic joins - Adding and removing members - Role changes and departures - Join settings - Domain-based access - Automatic join configurations ### Content and Resource Management - Repository administration - Core actions (creation, deletion, moving, duplication) - Settings and configuration changes - Enabling/disabling repositories - DOI management - Resource group assignments - Collections - Creation and deletion events - Repository security - Secrets management (individual and bulk) - Variables handling (individual and bulk) - Spaces configuration - Storage modifications - Hardware settings - Sleep time adjustments ### Billing and AWS Integration - Payment management - Payment methods (adding/removing) - Customer account creation - AWS integration setup and removal - Subscription lifecycle - Starting and renewing - Updates and cancellations - Cancellation reversals ### Resource Groups - Administrative actions - Creation and deletion - Settings modifications - Member management - Adding and removing users - Role assignments and changes ### Combine datasets and export https://huggingface.co/docs/hub/datasets-duckdb-combine-and-export.md # Combine datasets and export In this section, we'll demonstrate how to combine two datasets and export the result. The first dataset is in CSV format, and the second dataset is in Parquet format. Let's start by examining our datasets: The first will be [TheFusion21/PokemonCards](https://huggingface.co/datasets/TheFusion21/PokemonCards): ```bash FROM 'hf://datasets/TheFusion21/PokemonCards/train.csv' LIMIT 3; ┌─────────┬──────────────────────┬─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┬────────────┬───────┬─────────────────┐ │ id │ image_url │ caption │ name │ hp │ set_name │ │ varchar │ varchar │ varchar │ varchar │ int64 │ varchar │ ├─────────┼──────────────────────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┼────────────┼───────┼─────────────────┤ │ pl3-1 │ https://images.pok… │ A Basic, SP Pokemon Card of type Darkness with the title Absol G and 70 HP of rarity Rare Holo from the set Supreme Victors. It has … │ Absol G │ 70 │ Supreme Victors │ │ ex12-1 │ https://images.pok… │ A Stage 1 Pokemon Card of type Colorless with the title Aerodactyl and 70 HP of rarity Rare Holo evolved from Mysterious Fossil from … │ Aerodactyl │ 70 │ Legend Maker │ │ xy5-1 │ https://images.pok… │ A Basic Pokemon Card of type Grass with the title Weedle and 50 HP of rarity Common from the set Primal Clash and the flavor text: It… │ Weedle │ 50 │ Primal Clash │ └─────────┴──────────────────────┴─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┴────────────┴───────┴─────────────────┘ ``` And the second one will be [wanghaofan/pokemon-wiki-captions](https://huggingface.co/datasets/wanghaofan/pokemon-wiki-captions): ```bash FROM 'hf://datasets/wanghaofan/pokemon-wiki-captions/data/*.parquet' LIMIT 3; ┌──────────────────────┬───────────┬──────────┬──────────────────────────────────────────────────────────────┬────────────────────────────────────────────────────────────────────────────────────────────────────┐ │ image │ name_en │ name_zh │ text_en │ text_zh │ │ struct(bytes blob,… │ varchar │ varchar │ varchar │ varchar │ ├──────────────────────┼───────────┼──────────┼──────────────────────────────────────────────────────────────┼────────────────────────────────────────────────────────────────────────────────────────────────────┤ │ {'bytes': \x89PNG\… │ abomasnow │ 暴雪王 │ Grass attributes,Blizzard King standing on two feet, with … │ 草属性,双脚站立的暴雪王,全身白色的绒毛,淡紫色的眼睛,几缕长条装的毛皮盖着它的嘴巴 │ │ {'bytes': \x89PNG\… │ abra │ 凯西 │ Super power attributes, the whole body is yellow, the head… │ 超能力属性,通体黄色,头部外形类似狐狸,尖尖鼻子,手和脚上都有三个指头,长尾巴末端带着一个褐色圆环 │ │ {'bytes': \x89PNG\… │ absol │ 阿勃梭鲁 │ Evil attribute, with white hair, blue-gray part without ha… │ 恶属性,有白色毛发,没毛发的部分是蓝灰色,头右边类似弓的角,红色眼睛 │ └──────────────────────┴───────────┴──────────┴──────────────────────────────────────────────────────────────┴────────────────────────────────────────────────────────────────────────────────────────────────────┘ ``` Now, let's try to combine these two datasets by joining on the `name` column: ```bash SELECT a.image_url , a.caption AS card_caption , a.name , a.hp , b.text_en as wiki_caption FROM 'hf://datasets/TheFusion21/PokemonCards/train.csv' a JOIN 'hf://datasets/wanghaofan/pokemon-wiki-captions/data/*.parquet' b ON LOWER(a.name) = b.name_en LIMIT 3; ┌──────────────────────┬──────────────────────┬────────────┬───────┬──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐ │ image_url │ card_caption │ name │ hp │ wiki_caption │ │ varchar │ varchar │ varchar │ int64 │ varchar │ ├──────────────────────┼──────────────────────┼────────────┼───────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤ │ https://images.pok… │ A Stage 1 Pokemon … │ Aerodactyl │ 70 │ A Pokémon with rock attributes, gray body, blue pupils, purple inner wings, two sharp claws on the wings, jagged teeth, and an arrow-like … │ │ https://images.pok… │ A Basic Pokemon Ca… │ Weedle │ 50 │ Insect-like, caterpillar-like in appearance, with a khaki-yellow body, seven pairs of pink gastropods, a pink nose, a sharp poisonous need… │ │ https://images.pok… │ A Basic Pokemon Ca… │ Caterpie │ 50 │ Insect attributes, caterpillar appearance, green back, white abdomen, Y-shaped red antennae on the head, yellow spindle-shaped tail, two p… │ └──────────────────────┴──────────────────────┴────────────┴───────┴──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘ ``` We can export the result to a Parquet file using the `COPY` command: ```bash COPY (SELECT a.image_url , a.caption AS card_caption , a.name , a.hp , b.text_en as wiki_caption FROM 'hf://datasets/TheFusion21/PokemonCards/train.csv' a JOIN 'hf://datasets/wanghaofan/pokemon-wiki-captions/data/*.parquet' b ON LOWER(a.name) = b.name_en) TO 'output.parquet' (FORMAT PARQUET); ``` Let's validate the new Parquet file: ```bash SELECT COUNT(*) FROM 'output.parquet'; ┌──────────────┐ │ count_star() │ │ int64 │ ├──────────────┤ │ 9460 │ └──────────────┘ ``` > [!TIP] > You can also export to [CSV](https://duckdb.org/docs/guides/file_formats/csv_export), [Excel](https://duckdb.org/docs/guides/file_formats/excel_export > ) and [JSON](https://duckdb.org/docs/guides/file_formats/json_export > ) formats. Finally, let's push the resulting dataset to the Hub. You can use the Hub UI, the `huggingface_hub` client library and more to upload your Parquet file, see more information [here](./datasets-adding). And that's it! You've successfully combined two datasets, exported the result, and uploaded it to the Hugging Face Hub. ### Advanced Compute Options https://huggingface.co/docs/hub/advanced-compute-options.md # Advanced Compute Options > [!WARNING] > This feature is part of the Team & Enterprise plans. Enterprise Hub organizations gain access to advanced compute options to accelerate their machine learning journey. ## Host ZeroGPU Spaces in your organization ZeroGPU is a dynamic GPU allocation system that optimizes AI deployment on Hugging Face Spaces. By automatically allocating and releasing NVIDIA H200 GPU slices (70GB VRAM) as needed, organizations can efficiently serve their AI applications without dedicated GPU instances. **Key benefits for organizations** - **Free GPU Access**: Access powerful NVIDIA H200 GPUs at no additional cost through dynamic allocation - **Enhanced Resource Management**: Host up to 50 ZeroGPU Spaces for efficient team-wide AI deployment - **Simplified Deployment**: Easy integration with PyTorch-based models, Gradio apps, and other Hugging Face libraries - **Enterprise-Grade Infrastructure**: Access to high-performance NVIDIA H200 GPUs with 70GB VRAM per workload [Learn more about ZeroGPU →](https://huggingface.co/docs/hub/spaces-zerogpu) ### Repository Settings https://huggingface.co/docs/hub/repositories-settings.md # Repository Settings ## Private repositories You can choose a repository's visibility when you create it, and any repository that you own can have its visibility toggled between *public* and *private* in the **Settings** tab. Unless your repository is owned by an [organization](./organizations), you are the only user that can make changes to your repo or upload any code. Setting your visibility to *private* will: - Ensure your repo does not show up in other users' search results. - Other users who visit the URL of your private repo will receive a `404 - Repo not found` error. - Other users will not be able to clone your repo. ## Renaming or transferring a repo If you own a repository, you will be able to visit the **Settings** tab to manage its name and transfer ownership. Transferring or renaming a repo will automatically redirect the old URL to the new location, and will preserve download counts and likes. There are limitations that depend on [your access level permissions](./organizations-security). Moving can be used in these use cases ✅ - Renaming a repository within the same user. - Renaming a repository within the same organization. You must have "write" or "admin" rights in the organization. - Transferring repository from user to an organization. You must be a member of the organization and have "contributor" rights, at least. - Transferring a repository from an organization to yourself. You must have "admin" rights in the organization. - Transferring a repository from a source organization to another target organization. You must have "admin" rights in the source organization **and** at least "contributor" rights in the target organization. Moving does not work in the following cases ❌ - Transferring a repository from an organization to another user who is not yourself. - Transferring a repository from a source organization to another target organization if the user does not have both "admin" rights in the source organization **and** at least "contributor" rights in the target organization. - Transferring a repository from user A to user B. If these are use cases you need help with, please send us an email at **website at huggingface.co**. ## Disabling Discussions / Pull Requests You can disable all discussions and Pull Requests. Once disabled, all community and contribution features won't be available anymore. This action can be reverted without losing any previous discussions or Pull Requests. ### Hub Rate limits https://huggingface.co/docs/hub/rate-limits.md # Hub Rate limits To protect our platform's integrity and ensure availability to as many AI community members as possible, we enforce rate limits on all requests made to the Hugging Face Hub. We define different rate limits for distinct classes of requests. We distinguish three main buckets: - **Hub APIs** - e.g. model or dataset search, repo creation, user management, etc. All endpoints that belong to this bucket are documented in [Hub API Endpoints](./api). - **Resolvers** - They're all the URLs that contain a `/resolve/` segment in their path, which serve user-generated content from the Hub. Concretely, those are the URLs that are constructed by open source libraries (transformers, datasets, vLLM, llama.cpp, …) or AI applications (LM Studio, Jan, ollama, …) to download model/dataset files from HF. - Specifically, this is the ["Resolve a file" endpoint](https://lnkd.in/eesDKirG) documented in our OpenAPI spec. - Resolve requests are heavily used by the community, and since we optimize our infrastructure to serve them with maximum efficiency, the rate limits for Resolvers are the highest. - **Pages** - All the Web pages we host on huggingface.co. - Usually Web browsing requests are made by humans, hence rate limits don't need to be as high as the above mentioned programmatic endpoints. > [!TIP] > All values are defined over 5-minute windows, which allows for some level of "burstiness" from an application or developer's point of view. If you, your organization, or your application need higher rate limits, we encourage you to upgrade your account to PRO, Team, or Enterprise. We prioritize support requests from PRO, Team, and Enterprise customers – see built-in limits in [Rate limit Tiers](#rate-limit-tiers). ## Billing dashboard At any point, you can check your rate limit status on your (or your org’s) Billing page: https://huggingface.co/settings/billing ![dashboard for rate limits](https://cdn-uploads.huggingface.co/production/uploads/5dd96eb166059660ed1ee413/0pzQQyuVG3c9tWjCqrX9Y.png) On the right side, you will see three gauges, one for each bucket of Requests. Each bucket presents the number of current (last 5 minutes) requests, and the number of allowed requests based on your user account or organization plan. Whenever you exceed the limit in the past 5 minutes (the view is updated in real-time), the bar will turn red. Note: You can use the context switcher to easily switch between your user account and your orgs. ## HTTP Headers Whenever you or your organization hits a rate limit, you will receive a **429** `Too Many Requests` HTTP error. We implement the mechanism described in the [IETF draft (Version 9)](https://datatracker.ietf.org/doc/draft-ietf-httpapi-ratelimit-headers/) titled “RateLimit HTTP header fields for HTTP” (also known as `draft-ietf-httpapi-ratelimit-headers`). The goal is to define standardized HTTP headers that servers can use to advertise quota / rate-limit policies and communicate current usage / limits to clients so that they can avoid being throttled. Precisely, we implement the following headers: | Header | Purpose / Meaning | | ---------------------- | --------------------------------------------------------------------------------------------------------------------------------------- | | **`RateLimit`** | The total allowed rate limit for the current window. “How many requests (of this type) you’re allowed to perform.” | | **`RateLimit-Policy`** | Carries the rate limit policy itself (e.g. “100 requests per 5 minutes”). It’s informative; shows what policy the client is subject to. | A set of examples is as follows: | Header | Example | | ---------------------- | ----------------------------------------------------------------------------------------------------- | | **`RateLimit`** | `"api\|pages\|resolvers";r=[remaining];t=[seconds remaining until reset]` | | **`RateLimit-Policy`** | `"fixed window";"api\|\pages\|resolvers";q=[total allowed for window];w=[window duration in seconds]` | ## Rate limit Tiers Here are the current rate limits (in September '25) based on your plan: | Plan | API | Resolvers | Pages | | ------------------------------------------------------------------------- | -------- | --------- | ------ | | Anonymous user (per IP address) | 500 \* | 3,000 \* | 100 \* | | Free user | 1,000 \* | 5,000 \* | 200 \* | | PRO user | 2,500 | 12,000 | 400 | | Team organization | 3,000 | 20,000 | 400 | | Enterprise organization | 6,000 | 50,000 | 600 | | Enterprise Plus organization | 10,000 | 100,000 | 1,000 | | Enterprise Plus organization When Organization IP Ranges are defined | 100,000 | 500,000 | 10,000 | | Academia Hub organization | 2,500 | 12,000 | 400 | \* Anonymous and Free users are subject to change over time depending on platform health 🤞 > [!NOTE] > All quotas are calculated over 5-minute fixed windows. Note: For organizations, rate limits are applied individually to each member, not shared among members. ## What if I get rate-limited First, make sure you always pass a `HF_TOKEN`, and it is passed downstream to all libraries or applications that download _stuff_ from the Hub. This is the number one reason users get rate limited and is a very easy fix. Despite passing `HF_TOKEN` if you are still rate limited, you can: - spread out your requests over longer periods of time - replace Hub API calls with Resolver calls, whenever possible (Resolver rate limits are much higher and much more optimized). - upgrade to PRO, Team, or Enterprise. ## Granular user action Rate limits In addition to those main classes of rate limits, we enforce limits on certain specific kinds of user actions, like: - repo creation - repo commits - discussions and comments - moderation actions - etc. We don't currently document the rate limits for those specific actions, given they tend to change over time more often. If you get quota errors, we encourage you to upgrade your account to PRO, Team, or Enterprise. Feel free to get in touch with us via the support team. ### Pickle Scanning https://huggingface.co/docs/hub/security-pickle.md # Pickle Scanning Pickle is a widely used serialization format in ML. Most notably, it is the default format for PyTorch model weights. There are dangerous arbitrary code execution attacks that can be perpetrated when you load a pickle file. We suggest loading models from users and organizations you trust, relying on signed commits, and/or loading models from TF or Jax formats with the `from_tf=True` auto-conversion mechanism. We also alleviate this issue by displaying/"vetting" the list of imports in any pickled file, directly on the Hub. Finally, we are experimenting with a new, simple serialization format for weights called [`safetensors`](https://github.com/huggingface/safetensors). ## What is a pickle? From the [official docs](https://docs.python.org/3/library/pickle.html) : > The `pickle` module implements binary protocols for serializing and de-serializing a Python object structure. What this means is that pickle is a serializing protocol, something you use to efficiently share data amongst parties. We call a pickle the binary file that was generated while pickling. At its core, the pickle is basically a stack of instructions or opcodes. As you probably have guessed, it’s not human readable. The opcodes are generated when pickling and read sequentially at unpickling. Based on the opcode, a given action is executed. Here’s a small example: ```python import pickle import pickletools var = "data I want to share with a friend" # store the pickle data in a file named 'payload.pkl' with open('payload.pkl', 'wb') as f: pickle.dump(var, f) # disassemble the pickle # and print the instructions to the command line with open('payload.pkl', 'rb') as f: pickletools.dis(f) ``` When you run this, it will create a pickle file and print the following instructions in your terminal: ```python 0: \x80 PROTO 4 2: \x95 FRAME 48 11: \x8c SHORT_BINUNICODE 'data I want to share with a friend' 57: \x94 MEMOIZE (as 0) 58: . STOP highest protocol among opcodes = 4 ``` Don’t worry too much about the instructions for now, just know that the [pickletools](https://docs.python.org/3/library/pickletools.html) module is very useful for analyzing pickles. It allows you to read the instructions in the file ***without*** executing any code. Pickle is not simply a serialization protocol, it allows more flexibility by giving the ability to users to run python code at de-serialization time. Doesn’t sound good, does it? ## Why is it dangerous? As we’ve stated above, de-serializing pickle means that code can be executed. But this comes with certain limitations: you can only reference functions and classes from the top level module; you cannot embed them in the pickle file itself. Back to the drawing board: ```python import pickle import pickletools class Data: def __init__(self, important_stuff: str): self.important_stuff = important_stuff d = Data("42") with open('payload.pkl', 'wb') as f: pickle.dump(d, f) ``` When we run this script we get the `payload.pkl` again. When we check the file’s contents: ```bash # cat payload.pkl __main__Data)}important_stuff42sb.% # hexyl payload.pkl ┌────────┬─────────────────────────┬─────────────────────────┬────────┬────────┐ │00000000│ 80 04 95 33 00 00 00 00 ┊ 00 00 00 8c 08 5f 5f 6d │ו×30000┊000ו__m│ │00000010│ 61 69 6e 5f 5f 94 8c 04 ┊ 44 61 74 61 94 93 94 29 │ain__×ו┊Data×××)│ │00000020│ 81 94 7d 94 8c 0f 69 6d ┊ 70 6f 72 74 61 6e 74 5f │××}×וim┊portant_│ │00000030│ 73 74 75 66 66 94 8c 02 ┊ 34 32 94 73 62 2e │stuff×ו┊42×sb. │ └────────┴─────────────────────────┴─────────────────────────┴────────┴────────┘ ``` We can see that there isn’t much in there, a few opcodes and the associated data. You might be thinking, so what’s the problem with pickle? Let’s try something else: ```python from fickling.pickle import Pickled import pickle # Create a malicious pickle data = "my friend needs to know this" pickle_bin = pickle.dumps(data) p = Pickled.load(pickle_bin) p.insert_python_exec('print("you\'ve been pwned !")') with open('payload.pkl', 'wb') as f: p.dump(f) # innocently unpickle and get your friend's data with open('payload.pkl', 'rb') as f: data = pickle.load(f) print(data) ``` Here we’re using the [fickling](https://github.com/trailofbits/fickling) library for simplicity. It allows us to add pickle instructions to execute code contained in a string via the `exec` function. This is how you circumvent the fact that you cannot define functions or classes in your pickles: you run exec on python code saved as a string. When you run this, it creates a `payload.pkl` and prints the following: ``` you've been pwned ! my friend needs to know this ``` If we check the contents of the pickle file, we get: ```bash # cat payload.pkl c__builtin__ exec (Vprint("you've been pwned !") tR my friend needs to know this.% # hexyl payload.pkl ┌────────┬─────────────────────────┬─────────────────────────┬────────┬────────┐ │00000000│ 63 5f 5f 62 75 69 6c 74 ┊ 69 6e 5f 5f 0a 65 78 65 │c__built┊in___exe│ │00000010│ 63 0a 28 56 70 72 69 6e ┊ 74 28 22 79 6f 75 27 76 │c_(Vprin┊t("you'v│ │00000020│ 65 20 62 65 65 6e 20 70 ┊ 77 6e 65 64 20 21 22 29 │e been p┊wned !")│ │00000030│ 0a 74 52 80 04 95 20 00 ┊ 00 00 00 00 00 00 8c 1c │_tR×•× 0┊000000ו│ │00000040│ 6d 79 20 66 72 69 65 6e ┊ 64 20 6e 65 65 64 73 20 │my frien┊d needs │ │00000050│ 74 6f 20 6b 6e 6f 77 20 ┊ 74 68 69 73 94 2e │to know ┊this×. │ └────────┴─────────────────────────┴─────────────────────────┴────────┴────────┘ ``` Basically, this is what’s happening when you unpickle: ```python # ... opcodes_stack = [exec_func, "malicious argument", "REDUCE"] opcode = stack.pop() if opcode == "REDUCE": arg = opcodes_stack.pop() callable = opcodes_stack.pop() opcodes_stack.append(callable(arg)) # ... ``` The instructions that pose a threat are `STACK_GLOBAL`, `GLOBAL` and `REDUCE`. `REDUCE` is what tells the unpickler to execute the function with the provided arguments and `*GLOBAL` instructions are telling the unpickler to `import` stuff. To sum up, pickle is dangerous because: - when importing a python module, arbitrary code can be executed - you can import builtin functions like `eval` or `exec`, which can be used to execute arbitrary code - when instantiating an object, the constructor may be called This is why it is stated in most docs using pickle, do not unpickle data from untrusted sources. ## Mitigation Strategies ***Don’t use pickle*** Sound advice Luc, but pickle is used profusely and isn’t going anywhere soon: finding a new format everyone is happy with and initiating the change will take some time. So what can we do for now? ### Load files from users and organizations you trust On the Hub, you have the ability to [sign your commits with a GPG key](./security-gpg). This does **not** guarantee that your file is safe, but it does guarantee the origin of the file. If you know and trust user A and the commit that includes the file on the Hub is signed by user A’s GPG key, it’s pretty safe to assume that you can trust the file. ### Load model weights from TF or Flax TensorFlow and Flax checkpoints are not affected, and can be loaded within PyTorch architectures using the `from_tf` and `from_flax` kwargs for the `from_pretrained` method to circumvent this issue. E.g.: ```python from transformers import AutoModel model = AutoModel.from_pretrained("google-bert/bert-base-cased", from_flax=True) ``` ### Use your own serialization format - [MsgPack](https://msgpack.org/index.html) - [Protobuf](https://developers.google.com/protocol-buffers) - [Cap'n'proto](https://capnproto.org/) - [Avro](https://avro.apache.org/) - [safetensors](https://github.com/huggingface/safetensors) This last format, `safetensors`, is a simple serialization format that we are working on and experimenting with currently! Please help or contribute if you can 🔥. ### Improve `torch.load/save` There's an open discussion in progress at PyTorch on having a [Safe way of loading only weights from *.pt file by default](https://github.com/pytorch/pytorch/issues/52181) – please chime in there! ### Hub’s Security Scanner #### What we have now We have created a security scanner that scans every file pushed to the Hub and runs security checks. At the time of writing, it runs two types of scans: - ClamAV scans - Pickle Import scans For ClamAV scans, files are run through the open-source antivirus [ClamAV](https://www.clamav.net). While this covers a good amount of dangerous files, it doesn’t cover pickle exploits. We have implemented a Pickle Import scan, which extracts the list of imports referenced in a pickle file. Every time you upload a `pytorch_model.bin` or any other pickled file, this scan is run. On the hub the list of imports will be displayed next to each file containing imports. If any import looks suspicious, it will be highlighted. We get this data thanks to [`pickletools.genops`](https://docs.python.org/3/library/pickletools.html#pickletools.genops) which allows us to read the file without executing potentially dangerous code. Note that this is what allows to know if, when unpickling a file, it will `REDUCE` on a potentially dangerous function that was imported by `*GLOBAL`. ***Disclaimer***: this is not 100% foolproof. It is your responsibility as a user to check if something is safe or not. We are not actively auditing python packages for safety, the safe/unsafe imports lists we have are maintained in a best-effort manner. Please contact us if you think something is not safe, and we flag it as such, by sending us an email to website at huggingface.co #### Potential solutions One could think of creating a custom [Unpickler](https://docs.python.org/3/library/pickle.html#pickle.Unpickler) in the likes of [this one](https://github.com/facebookresearch/CrypTen/blob/main/crypten/common/serial.py). But as we can see in this [sophisticated exploit](https://ctftime.org/writeup/16723), this won’t work. Thankfully, there is always a trace of the `eval` import, so reading the opcodes directly should allow to catch malicious usage. The current solution I propose is creating a file resembling a `.gitignore` but for imports. This file would be a whitelist of imports that would make a `pytorch_model.bin` file flagged as dangerous if there are imports not included in the whitelist. One could imagine having a regex-ish format where you could allow all numpy submodules for instance via a simple line like: `numpy.*`. ## Further Reading [pickle - Python object serialization - Python 3.10.6 documentation](https://docs.python.org/3/library/pickle.html#what-can-be-pickled-and-unpickled) [Dangerous Pickles - Malicious Python Serialization](https://intoli.com/blog/dangerous-pickles/) [GitHub - trailofbits/fickling: A Python pickling decompiler and static analyzer](https://github.com/trailofbits/fickling) [Exploiting Python pickles](https://davidhamann.de/2020/04/05/exploiting-python-pickle/) [cpython/pickletools.py at 3.10 · python/cpython](https://github.com/python/cpython/blob/3.10/Lib/pickletools.py) [cpython/pickle.py at 3.10 · python/cpython](https://github.com/python/cpython/blob/3.10/Lib/pickle.py) [CrypTen/serial.py at main · facebookresearch/CrypTen](https://github.com/facebookresearch/CrypTen/blob/main/crypten/common/serial.py) [CTFtime.org / Balsn CTF 2019 / pyshv1 / Writeup](https://ctftime.org/writeup/16723) [Rehabilitating Python's pickle module](https://github.com/moreati/pickle-fuzz) ### Handling Spaces Dependencies in Gradio Spaces https://huggingface.co/docs/hub/spaces-dependencies.md # Handling Spaces Dependencies in Gradio Spaces ## Default dependencies The default Gradio Spaces environment comes with several pre-installed dependencies: * The [`huggingface_hub`](https://huggingface.co/docs/huggingface_hub/index) client library allows you to manage your repository and files on the Hub with Python and programmatically access the Inference API from your Space. If you choose to instantiate the model in your app with the Inference API, you can benefit from the built-in acceleration optimizations. This option also consumes less computing resources, which is always nice for the environment! 🌎 Refer to this [page](https://huggingface.co/docs/huggingface_hub/how-to-inference) for more information on how to programmatically access the Inference API. * [`requests`](https://docs.python-requests.org/en/master/) is useful for calling third-party APIs from your app. * [`datasets`](https://github.com/huggingface/datasets) allows you to fetch or display any dataset from the Hub inside your app. * [`gradio`](https://github.com/gradio-app/gradio). You can optionally require a specific version using [`sdk_version` in the `README.md` file](spaces-config-reference). * Common Debian packages, such as `ffmpeg`, `cmake`, `libsm6`, and few others. ## Adding your own dependencies If you need other Python packages to run your app, add them to a **requirements.txt** file at the root of the repository. The Spaces runtime engine will create a custom environment on-the-fly. You can also add a **pre-requirements.txt** file describing dependencies that will be installed before your main dependencies. It can be useful if you need to update pip itself. Debian dependencies are also supported. Add a **packages.txt** file at the root of your repository, and list all your dependencies in it. Each dependency should be on a separate line, and each line will be read and installed by `apt-get install`. ### Hub API Endpoints https://huggingface.co/docs/hub/api.md # Hub API Endpoints We have open endpoints that you can use to retrieve information from the Hub as well as perform certain actions such as creating model, dataset or Space repos. We offer a wrapper Python client, [`huggingface_hub`](https://github.com/huggingface/huggingface_hub), and a JS client, [`huggingface.js`](https://github.com/huggingface/huggingface.js), that allow easy access to these endpoints. We also provide [webhooks](./webhooks) to receive real-time incremental info about repos. Enjoy! The base URL for those endpoints below is `https://huggingface.co`. For example, to construct the `/api/models` call below, one can call the URL [https://huggingface.co/api/models](https://huggingface.co/api/models) ## The Hub API Playground Want to try out our API? Try it out now on our OpenAPI-based [Playground](https://huggingface.co/spaces/huggingface/openapi)! You can also access the OpenAPI specification directly: [https://huggingface.co/.well-known/openapi.json](https://huggingface.co/.well-known/openapi.json) All API calls are subject to the HF-wide [Rate limits](./rate-limits). Upgrade your account if you need elevated, large-scale access. > [!NOTE] > The rest of this page is a partial list of some of our API endpoints. But note that the exhaustive reference is our [OpenAPI documentation](https://huggingface.co/spaces/huggingface/openapi). > It is much more complete and guaranteed to always be up-to-date. ## Repo listing API The following endpoints help get information about models, datasets, and Spaces stored on the Hub. > [!TIP] > When making API calls to retrieve information about repositories, the createdAt attribute indicates the time when the respective repository was created. It's important to note that there is a unique value, 2022-03-02T23:29:04.000Z assigned to all repositories that were created before we began storing creation dates. ### GET /api/models Get information from all models in the Hub. The response is paginated, use the [`Link` header](https://docs.github.com/en/rest/guides/using-pagination-in-the-rest-api?apiVersion=2022-11-28#link-header) to get the next pages. You can specify additional parameters to have more specific results. - `search`: Filter based on substrings for repos and their usernames, such as `resnet` or `microsoft` - `author`: Filter models by an author or organization, such as `huggingface` or `microsoft` - `filter`: Filter based on tags, such as `text-classification` or `spacy`. - `sort`: Property to use when sorting, such as `downloads` or `author`. - `direction`: Direction in which to sort, such as `-1` for descending, and anything else for ascending. - `limit`: Limit the number of models fetched. - `full`: Whether to fetch most model data, such as all tags, the files, etc. - `config`: Whether to also fetch the repo config. Payload: ```js params = { "search":"search", "author":"author", "filter":"filter", "sort":"sort", "direction":"direction", "limit":"limit", "full":"full", "config":"config" } ``` This is equivalent to `huggingface_hub.list_models()`. ### GET /api/models/{repo_id} or /api/models/{repo_id}/revision/{revision} Get all information for a specific model. This is equivalent to `huggingface_hub.model_info(repo_id, revision)`. ### GET /api/models-tags-by-type Gets all the available model tags hosted in the Hub. This is equivalent to `huggingface_hub.get_model_tags()`. ### GET /api/datasets Get information from all datasets in the Hub. The response is paginated, use the [`Link` header](https://docs.github.com/en/rest/guides/using-pagination-in-the-rest-api?apiVersion=2022-11-28#link-header) to get the next pages. You can specify additional parameters to have more specific results. - `search`: Filter based on substrings for repos and their usernames, such as `pets` or `microsoft` - `author`: Filter datasets by an author or organization, such as `huggingface` or `microsoft` - `filter`: Filter based on tags, such as `task_categories:text-classification` or `languages:en`. - `sort`: Property to use when sorting, such as `downloads` or `author`. - `direction`: Direction in which to sort, such as `-1` for descending, and anything else for ascending. - `limit`: Limit the number of datasets fetched. - `full`: Whether to fetch most dataset data, such as all tags, the files, etc. Payload: ```js params = { "search":"search", "author":"author", "filter":"filter", "sort":"sort", "direction":"direction", "limit":"limit", "full":"full", "config":"config" } ``` This is equivalent to `huggingface_hub.list_datasets()`. ### GET /api/datasets/{repo_id} or /api/datasets/{repo_id}/revision/{revision} Get all information for a specific dataset. - `full`: Whether to fetch most dataset data, such as all tags, the files, etc. Payload: ```js params = {"full": "full"} ``` This is equivalent to `huggingface_hub.dataset_info(repo_id, revision)`. ### GET /api/datasets/{repo_id}/parquet Get the list of auto-converted parquet files. Append the subset and the split to the URL to get the list of files for a specific subset and split: - `GET /api/datasets/{repo_id}/parquet/{subset}` - `GET /api/datasets/{repo_id}/parquet/{subset}/{split}` ### GET /api/datasets/{repo_id}/parquet/{subset}/{split}/{n}.parquet Get the nth shard of the auto-converted parquet files, for a specific subset (also called "config") and split. ### GET /api/datasets/{repo_id}/croissant Get the Croissant metadata. More details at https://huggingface.co/docs/datasets-server/croissant. ### GET /api/datasets-tags-by-type Gets all the available dataset tags hosted in the Hub. This is equivalent to `huggingface_hub.get_dataset_tags()`. ### GET /api/spaces Get information from all Spaces in the Hub. The response is paginated, use the [`Link` header](https://docs.github.com/en/rest/guides/using-pagination-in-the-rest-api?apiVersion=2022-11-28#link-header) to get the next pages. You can specify additional parameters to have more specific results. - `search`: Filter based on substrings for repos and their usernames, such as `resnet` or `microsoft` - `author`: Filter models by an author or organization, such as `huggingface` or `microsoft` - `filter`: Filter based on tags, such as `text-classification` or `spacy`. - `sort`: Property to use when sorting, such as `downloads` or `author`. - `direction`: Direction in which to sort, such as `-1` for descending, and anything else for ascending. - `limit`: Limit the number of models fetched. - `full`: Whether to fetch most model data, such as all tags, the files, etc. Payload: ```js params = { "search":"search", "author":"author", "filter":"filter", "sort":"sort", "direction":"direction", "limit":"limit", "full":"full", "config":"config" } ``` This is equivalent to `huggingface_hub.list_spaces()`. ### GET /api/spaces/{repo_id} or /api/spaces/{repo_id}/revision/{revision} Get all information for a specific model. This is equivalent to `huggingface_hub.space_info(repo_id, revision)`. ## Repo API The following endpoints manage repository settings like creating and deleting a repository. ### POST /api/repos/create Create a repository. It's a model repo by default. Parameters: - `type`: Type of repo (dataset or space; model by default). - `name`: Name of repo. - `organization`: Name of organization (optional). - `private`: Whether the repo is private. - `sdk`: When the type is `space` (gradio, docker or static) Payload: ```js payload = { "type":"model", "name":"name", "organization": "organization", "private":"private", "sdk": "sdk" } ``` This is equivalent to `huggingface_hub.create_repo()`. ### DELETE /api/repos/delete Delete a repository. It's a model repo by default. Parameters: - `type`: Type of repo (dataset or space; model by default). - `name`: Name of repo. - `organization`: Name of organization (optional). Payload: ```js payload = { "type": "model", "name": "name", "organization": "organization", } ``` This is equivalent to `huggingface_hub.delete_repo()`. ### PUT /api/repos/{repo_type}/{repo_id}/settings Update repo visibility. Payload: ```js payload = { "private": "private", } ``` This is equivalent to `huggingface_hub.update_repo_settings()`. ### POST /api/repos/move Move a repository (rename within the same namespace or transfer from user to organization). Parameters: - `fromRepo`: repo to rename. - `toRepo`: new name of the repo. - `type`: Type of repo (dataset or space; model by default). Payload: ```js payload = { "fromRepo" : "namespace/repo_name", "toRepo" : "namespace2/repo_name2", "type": "model", } ``` This is equivalent to `huggingface_hub.move_repo()`. ## User API The following endpoint gets information about a user. ### GET /api/whoami-v2 Get username and organizations the user belongs to. Payload: ```js headers = { "authorization" : "Bearer $token" } ``` This is equivalent to `huggingface_hub.whoami()`. ## Organization API The following endpoints handle organization like getting overview of an organization, listing members and followers. ### GET /api/organizations/{organization_name}/overview Get the organization overview. Payload: ```js headers = { "authorization" : "Bearer $token" } ``` This is equivalent to `huggingface_hub.get_organization_overview()`. ### GET /api/organizations/{organization_name}/members Get the organization members. Payload: ```js headers = { "authorization" : "Bearer $token" } ``` This is equivalent to `huggingface_hub.list_organization_members()`. ### GET /api/organizations/{organization_name}/followers Get the organization followers. Payload: ```js headers = { "authorization" : "Bearer $token" } ``` this is equivalent to `huggingface_hub.list_organization_followers()`. ## Resource Groups API The following endpoints manage resource groups. Resource groups is an Enterprise feature. ### GET /api/organizations/{name}/resource-groups Get all resource groups in an organization that the authenticated user has access to view. ### GET /api/organizations/{name}/resource-groups/{resourceGroupId} Get detailed information about a specific resource group. ### POST /api/organizations/{name}/resource-groups Create a new resource group in the organization. Parameters: - `name`: Name of the resource group (required) - `description`: Description of the resource group (optional) - `users`: List of users and their roles in the resource group (optional) - `repos`: List of repositories (optional) - `autoJoin`: Settings for automatic user joining (optional) Payload: ```js payload = { "name": "name", "description": "description", "users": [ { "user": "username", "role": "admin" // or "write" or "read" } ], "repos": [ { "type": "dataset", "name": "huggingface/repo" } ] } ``` ### PATCH /api/organizations/{name}/resource-groups/{resourceGroupId} Update a resource group's metadata. Parameters: - `name`: New name for the resource group (optional) - `description`: New description for the resource group (optional) Payload: ```js payload = { "name": "name", "description": "description" } ``` ### POST /api/organizations/{name}/resource-groups/{resourceGroupId}/settings Update a resource group's settings. Payload: ```js payload = { "autoJoin": { "enabled": true, "role": "read" // or "write" or "admin" } } ``` ### DELETE /api/organizations/{name}/resource-groups/{resourceGroupId} Delete a resource group. ### POST /api/organizations/{name}/resource-groups/{resourceGroupId}/users Add users to a resource group. Payload: ```js payload = { "users": [ { "user": "username", "role": "admin" // or "write" or "read" } ] } ``` ### DELETE /api/organizations/{name}/resource-groups/{resourceGroupId}/users/{username} Remove a user from a resource group. ### PATCH /api/organizations/{name}/resource-groups/{resourceGroupId}/users/{username} Update a user's role in a resource group. Payload: ```js payload = { "role": "admin" // or "write" or "read" } ``` ### POST /api/(models|spaces|datasets)/{namespace}/{repo}/resource-group Update resource group's repository. Payload: ```js payload = { "resourceGroupId": "6771d4700000000000000000" // (allow `null` for removing the repo's resource group) } ``` ### GET /api/(models|spaces|datasets)/{namespace}/{repo}/resource-group Get detailed repository's resource group ## Paper Pages API The following endpoint gets information about a paper. ### GET /api/papers/{arxiv_id} Get the API equivalent of the Paper page, i.e., metadata like authors, summary, and discussion comments. ### GET /api/arxiv/{arxiv_id}/repos Get all the models, datasets, and Spaces that refer to a paper. ### GET /api/daily_papers Get the daily papers curated by AK and the community. It's the equivalent of [https://huggingface.co/papers](https://huggingface.co/papers). To filter on a particular date, simply pass the date like so: https://huggingface.co/api/daily_papers?date=2025-03-31. ## Collections API Use Collections to group repositories from the Hub (Models, Datasets, Spaces and Papers) on a dedicated page. You can learn more about it in the Collections [guide](./collections). Collections can also be managed using the Python client (see [guide](https://huggingface.co/docs/huggingface_hub/main/en/guides/collections)). ### POST /api/collections Create a new collection on the Hub with a title, a description (optional) and a first item (optional). An item is defined by a type (`model`, `dataset`, `space` or `paper`) and an id (repo_id or paper_id on the Hub). Payload: ```js payload = { "title": "My cool models", "namespace": "username_or_org", "description": "Here is a shortlist of models I've trained.", "item" : { "type": "model", "id": "username/cool-model", } "private": false, } ``` This is equivalent to `huggingface_hub.create_collection()`. ### GET /api/collections/{namespace}/{slug}-{id} Return information about a collection. This is equivalent to `huggingface_hub.get_collection()`. ### GET /api/collections List collections from the Hub, based on some criteria. The supported parameters are: - `owner` (string): filter collections created by a specific user or organization. - `item` (string): filter collections containing a specific item. Value must be the item_type and item_id concatenated. Example: `"models/teknium/OpenHermes-2.5-Mistral-7B"`, `"datasets/rajpurkar/squad"` or `"papers/2311.12983"`. - `sort` (string): sort the returned collections. Supported values are `"lastModified"`, `"trending"` (default) and `"upvotes"`. - `limit` (int): maximum number (100) of collections per page. - `q` (string): filter based on substrings for titles & descriptions. If no parameter is set, all collections are returned. The response is paginated. To get all collections, you must follow the [`Link` header](https://docs.github.com/en/rest/guides/using-pagination-in-the-rest-api?apiVersion=2022-11-28#link-header). > [!WARNING] > When listing collections, the item list per collection is truncated to 4 items maximum. To retrieve all items from a collection, you need to make an additional call using its collection slug. Payload: ```js params = { "owner": "TheBloke", "item": "models/teknium/OpenHermes-2.5-Mistral-7B", "sort": "lastModified", "limit" : 1, } ``` This is equivalent to `huggingface_hub.list_collections()`. ### PATCH /api/collections/{namespace}/{slug}-{id} Update the metadata of a collection on the Hub. You can't add or modify the items of the collection with this method. All fields of the payload are optional. Payload: ```js payload = { "title": "My cool models", "description": "Here is a shortlist of models I've trained.", "private": false, "position": 0, // position of the collection on your profile "theme": "green", } ``` This is equivalent to `huggingface_hub.update_collection_metadata()`. ### DELETE /api/collections/{namespace}/{slug}-{id} Return a collection. This is a non-revertible operation. A deleted collection cannot be restored. This is equivalent to `huggingface_hub.delete_collection()`. ### POST /api/collections/{namespace}/{slug}-{id}/item Add an item to a collection. An item is defined by a type (`model`, `dataset`, `space` or `paper`) and an id (repo_id or paper_id on the Hub). A note can also be attached to the item (optional). Payload: ```js payload = { "item" : { "type": "model", "id": "username/cool-model", } "note": "Here is the model I trained on ...", } ``` This is equivalent to `huggingface_hub.add_collection_item()`. ### PATCH /api/collections/{namespace}/{slug}-{id}/items/{item_id} Update an item in a collection. You must know the item object id which is different from the repo_id/paper_id provided when adding the item to the collection. The `item_id` can be retrieved by fetching the collection. You can update the note attached to the item or the position of the item in the collection. Both fields are optional. ```js payload = { "position": 0, "note": "Here is the model I trained on ...", } ``` This is equivalent to `huggingface_hub.update_collection_item()`. ### DELETE /api/collections/{namespace}/{slug}-{id}/items/{item_id} Remove an item from a collection. You must know the item object id which is different from the repo_id/paper_id provided when adding the item to the collection. The `item_id` can be retrieved by fetching the collection. This is equivalent to `huggingface_hub.delete_collection_item()`. ### Storage limits https://huggingface.co/docs/hub/storage-limits.md # Storage limits At Hugging Face we aim to provide the AI community with significant volumes of **free storage space for public repositories**. We bill for storage space for **private repositories**, above a free tier (see table below). > [!TIP] > Storage limits and policies apply to both model and dataset repositories on the Hub. We [optimize our infrastructure](https://huggingface.co/blog/xethub-joins-hf) continuously to [scale our storage](https://x.com/julien_c/status/1821540661973160339) for the coming years of growth in AI and Machine learning. We do have mitigations in place to prevent abuse of free public storage, and in general we ask users and organizations to make sure any uploaded large model or dataset is **as useful to the community as possible** (as represented by numbers of likes or downloads, for instance). Finally, upgrade to a paid Organization or User (PRO) account to unlock higher limits. ## Storage plans | Type of account | Public storage | Private storage | | ------------------------ | ------------------------------------------------------------------ | ---------------------------- | | Free user or org | Best-effort\* 🙏 usually up to 5TB for impactful work | 100GB | | PRO | Up to 10TB included\* ✅ grants available for impactful work† | 1TB + pay-as-you-go | | Team Organizations | 12TB base + 1TB per seat ✅ | 1TB per seat + pay-as-you-go | | Enterprise Organizations | 500TB base + 1TB per seat 🏆 | 1TB per seat + pay-as-you-go | 💡 [Team or Enterprise Organizations](https://huggingface.co/enterprise) include 1TB of private storage per seat in the subscription: for example, if your organization has 40 members, then you have 40TB of included private storage. \* We aim to continue providing the AI community with generous free storage space for public repositories. Beyond the first few gigabytes, please use this resource responsibly by uploading content that offers genuine value to other users. If you need substantial storage space, you will need to upgrade to [PRO, Team or Enterprise](https://huggingface.co/pricing). † We work with impactful community members to ensure it is as easy as possible for them to unlock large storage limits. If your models or datasets consistently get many likes and downloads and you hit limits, get in touch. ### Pay-as-you-go price Above the included 1TB (or 1TB per seat) of private storage in [PRO](https://huggingface.co/subscribe/pro) and [Team or Enterprise Organizations](https://huggingface.co/enterprise), private storage is invoiced at **$25/TB/month**, in 1TB increments. See our [billing doc](./billing) for more details. ## Repository limitations and recommendations In parallel to storage limits at the account (user or organization) level, there are some limitations to be aware of when dealing with a large amount of data in a specific repo. Given the time it takes to stream the data, getting an upload/push to fail at the end of the process or encountering a degraded experience, be it on hf.co or when working locally, can be very annoying. In the following section, we describe our recommendations on how to best structure your large repos. ### Recommendations We gathered a list of tips and recommendations for structuring your repo. If you are looking for more practical tips, check out [this guide](https://huggingface.co/docs/huggingface_hub/main/en/guides/upload#tips-and-tricks-for-large-uploads) on how to upload large amount of data using the Python library. | Characteristic | Recommended | Tips | | ---------------- | ------------------ | ------------------------------------------------------ | | Repo size | - | contact us for large repos (TBs of data) | | Files per repo | ``` For example: ```bash git log --all -p -S 68d45e234eb4a928074dfd868cead0219ab85354cc53d20e772753c6bb9169d3 commit 5af368743e3f1d81c2a846f7c8d4a028ad9fb021 Date: Sun Apr 28 02:01:18 2024 +0200 Update LayerNorm tensor names to weight and bias diff --git a/model.safetensors b/model.safetensors index a090ee7..e79c80e 100644 --- a/model.safetensors +++ b/model.safetensors @@ -1,3 +1,3 @@ version https://git-lfs.github.com/spec/v1 -oid sha256:68d45e234eb4a928074dfd868cead0219ab85354cc53d20e772753c6bb9169d3 +oid sha256:0bb7a1683251b832d6f4644e523b325adcf485b7193379f5515e6083b5ed174b size 440449768 commit 0a6aa9128b6194f4f3c4db429b6cb4891cdb421b (origin/pr/28) Date: Wed Nov 16 15:15:39 2022 +0000 Adding `safetensors` variant of this model (#15) - Adding `safetensors` variant of this model (18c87780b5e54825a2454d5855a354ad46c5b87e) Co-authored-by: Nicolas Patry diff --git a/model.safetensors b/model.safetensors new file mode 100644 index 0000000..a090ee7 --- /dev/null +++ b/model.safetensors @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:68d45e234eb4a928074dfd868cead0219ab85354cc53d20e772753c6bb9169d3 +size 440449768 commit 18c87780b5e54825a2454d5855a354ad46c5b87e (origin/pr/15) Date: Thu Nov 10 09:35:55 2022 +0000 Adding `safetensors` variant of this model diff --git a/model.safetensors b/model.safetensors new file mode 100644 index 0000000..a090ee7 --- /dev/null +++ b/model.safetensors @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:68d45e234eb4a928074dfd868cead0219ab85354cc53d20e772753c6bb9169d3 +size 440449768 ``` ### Model Card Guidebook https://huggingface.co/docs/hub/model-card-guidebook.md # Model Card Guidebook Model cards are an important documentation and transparency framework for machine learning models. We believe that model cards have the potential to serve as *boundary objects*, a single artefact that is accessible to users who have different backgrounds and goals when interacting with model cards – including developers, students, policymakers, ethicists, those impacted by machine learning models, and other stakeholders. We recognize that developing a single artefact to serve such multifaceted purposes is difficult and requires careful consideration of potential users and use cases. Our goal as part of the Hugging Face science team over the last several months has been to help operationalize model cards towards that vision, taking into account these challenges, both at Hugging Face and in the broader ML community. To work towards that goal, it is important to recognize the thoughtful, dedicated efforts that have helped model cards grow into what they are today, from the adoption of model cards as a standard practice at many large organisations to the development of sophisticated tools for hosting and generating model cards. Since model cards were proposed by Mitchell et al. (2018), the landscape of machine learning documentation has expanded and evolved. A plethora of documentation tools and templates for data, models, and ML systems have been proposed and have developed – reflecting the incredible work of hundreds of researchers, impacted community members, advocates, and other stakeholders. Important discussions about the relationship between ML documentation and theories of change in responsible AI have created continued important discussions, and at times, divergence. We also recognize the challenges facing model cards, which in some ways mirror the challenges facing machine learning documentation and responsible AI efforts more generally, and we see opportunities ahead to help shape both model cards and the ecosystems in which they function positively in the months and years ahead. Our work presents a view of where we think model cards stand right now and where they could go in the future, at Hugging Face and beyond. This work is a “snapshot” of the current state of model cards, informed by a landscape analysis of the many ways ML documentation artefacts have been instantiated. It represents one perspective amongst multiple about both the current state and more aspirational visions of model cards. In this blog post, we summarise our work, including a discussion of the broader, growing landscape of ML documentation tools, the diverse audiences for and opinions about model cards, and potential new templates for model card content. We also explore and develop model cards for machine learning models in the context of the Hugging Face Hub, using the Hub’s features to collaboratively create, discuss, and disseminate model cards for ML models. With the launch of this Guidebook, we introduce several new resources and connect together previous work on Model Cards: 1) An updated Model Card template, released in the `huggingface_hub` library [modelcard_template.md file](https://github.com/huggingface/huggingface_hub/blob/main/src/huggingface_hub/templates/modelcard_template.md), drawing together Model Card work in academia and throughout the industry. 2) An [Annotated Model Card Template](./model-card-annotated), which details how to fill the card out. 3) A [Model Card Creator Tool](https://huggingface.co/spaces/huggingface/Model_Cards_Writing_Tool), to ease card creation without needing to program, and to help teams share the work of different sections. 4) A [User Study](./model-cards-user-studies) on Model Card usage at Hugging Face 5) A [Landscape Analysis and Literature Review](./model-card-landscape-analysis) of the state of the art in model documentation. We also include an [Appendix](./model-card-appendix) with further details from this work. --- **Please cite as:** Ozoani, Ezi and Gerchick, Marissa and Mitchell, Margaret. Model Card Guidebook. Hugging Face, 2022. https://huggingface.co/docs/hub/en/model-card-guidebook ### Disk usage on Spaces https://huggingface.co/docs/hub/spaces-storage.md # Disk usage on Spaces Every Space comes with a small amount of disk storage. This disk space is ephemeral, meaning its content will be lost if your Space restarts or is stopped. If you need to persist data with a longer lifetime than the Space itself, you can: - [Subscribe to a persistent storage upgrade](#persistent-storage) - [Use a dataset as a data store](#dataset-storage) ## Persistent storage You can upgrade your Space to have access to persistent disk space from the **Settings** tab. You can choose the storage tier of your choice to access disk space that persists across restarts of your Space. Persistent storage acts like traditional disk storage mounted on `/data`. That means you can `read` and `write to` this storage from your Space as you would with a traditional hard drive or SSD. Persistent disk space can be upgraded to a larger tier at will, though it cannot be downgraded to a smaller tier. If you wish to use a smaller persistent storage tier, you must delete your current (larger) storage first. If you are using Hugging Face open source libraries, you can make your Space restart faster by setting the environment variable `HF_HOME` to `/data/.huggingface`. Libraries like `transformers`, `diffusers`, `datasets` and others use that environment variable to cache any assets downloaded from the Hugging Face Hub. Setting this variable to the persistent storage path will make sure that cached resources do not need to be re-downloaded when the Space is restarted. > [!WARNING] > WARNING: all data stored in the storage is lost when you delete it. ### Persistent storage specs Here are the specifications for each of the different upgrade options: | **Tier** | **Disk space** | **Persistent** | **Monthly Price** | |------------------ |------------------ |------------------ |---------------------- | | Free tier | 50GB | No (ephemeral) | Free! | | Small | 20GB | Yes | $5 | | Medium | 150 GB | Yes | $25 | | Large | 1TB | Yes | $100 | ### Billing Billing of Spaces is based on hardware usage and is computed by the minute: you get charged for every minute the Space runs on the requested hardware, regardless of whether the Space is used. Persistent storage upgrades are billed until deleted, even when the Space is not running and regardless of Space status or running state. Additional information about billing can be found in the [dedicated Hub-wide section](./billing). ## Dataset storage If you need to persist data that lives longer than your Space, you could use a [dataset repo](./datasets). You can find an example of persistence [here](https://huggingface.co/spaces/Wauplin/space_to_dataset_saver), which uses the [`huggingface_hub` library](https://huggingface.co/docs/huggingface_hub/index) for programmatically uploading files to a dataset repository. This Space example along with [this guide](https://huggingface.co/docs/huggingface_hub/main/en/guides/upload#scheduled-uploads) will help you define which solution fits best your data type. Visit the [`datasets` library](https://huggingface.co/docs/datasets/index) documentation and the [`huggingface_hub` client library](https://huggingface.co/docs/huggingface_hub/index) documentation for more information on how to programmatically interact with dataset repos. ### Audio Dataset https://huggingface.co/docs/hub/datasets-audio.md # Audio Dataset This guide will show you how to configure your dataset repository with audio files. You can find accompanying examples of repositories in this [Audio datasets examples collection](https://huggingface.co/collections/datasets-examples/audio-dataset-66aca0b73e8f69e3d069e607). A dataset with a supported structure and [file formats](./datasets-adding#file-formats) automatically has a Dataset Viewer on its page on the Hub. --- Additional information about your audio files - such as transcriptions - is automatically loaded as long as you include this information in a metadata file (`metadata.csv`/`metadata.jsonl`/`metadata.parquet`). Alternatively, audio files can be in Parquet files or in TAR archives following the [WebDataset](https://github.com/webdataset/webdataset) format. ## Only audio files If your dataset only consists of one column with audio, you can simply store your audio files at the root: ```plaintext my_dataset_repository/ ├── 1.wav ├── 2.wav ├── 3.wav └── 4.wav ``` or in a subdirectory: ```plaintext my_dataset_repository/ └── audio ├── 1.wav ├── 2.wav ├── 3.wav └── 4.wav ``` Multiple [formats](./datasets-adding#file-formats) are supported at the same time, including AIFF, FLAC, MP3, OGG and WAV. ```plaintext my_dataset_repository/ └── audio ├── 1.aiff ├── 2.ogg ├── 3.mp3 └── 4.flac ``` If you have several splits, you can put your audio files into directories named accordingly: ```plaintext my_dataset_repository/ ├── train │   ├── 1.wav │   └── 2.wav └── test ├── 3.wav └── 4.wav ``` See [File names and splits](./datasets-file-names-and-splits) for more information and other ways to organize data by splits. ## Additional columns If there is additional information you'd like to include about your dataset, like the transcription, add it as a `metadata.csv` file in your repository. This lets you quickly create datasets for different audio tasks like [text-to-speech](https://huggingface.co/tasks/text-to-speech) or [automatic speech recognition](https://huggingface.co/tasks/automatic-speech-recognition). ```plaintext my_dataset_repository/ ├── 1.wav ├── 2.wav ├── 3.wav ├── 4.wav └── metadata.csv ``` Your `metadata.csv` file must have a `file_name` column which links image files with their metadata: ```csv file_name,animal 1.wav,cat 2.wav,cat 3.wav,dog 4.wav,dog ``` You can also use a [JSONL](https://jsonlines.org/) file `metadata.jsonl`: ```jsonl {"file_name": "1.wav","text": "cat"} {"file_name": "2.wav","text": "cat"} {"file_name": "3.wav","text": "dog"} {"file_name": "4.wav","text": "dog"} ``` And for bigger datasets or if you are interested in advanced data retrieval features, you can use a [Parquet](https://parquet.apache.org/) file `metadata.parquet`. ## Relative paths Metadata file must be located either in the same directory with the audio files it is linked to, or in any parent directory, like in this example: ```plaintext my_dataset_repository/ └── test ├── audio │   ├── 1.wav │   ├── 2.wav │   ├── 3.wav │   └── 4.wav └── metadata.csv ``` In this case, the `file_name` column must be a full relative path to the audio files, not just the filename: ```csv file_name,animal audio/1.wav,cat audio/2.wav,cat audio/3.wav,dog audio/4.wav,dog ``` Metadata files cannot be put in subdirectories of a directory with the audio files. More generally, any column named `file_name` or `*_file_name` should contain the full relative path to the audio files. In this example, the `test` directory is used to setup the name of the training split. See [File names and splits](./datasets-file-names-and-splits) for more information. ## Audio classification For audio classification datasets, you can also use a simple setup: use directories to name the audio classes. Store your audio files in a directory structure like: ```plaintext my_dataset_repository/ ├── cat │   ├── 1.wav │   └── 2.wav └── dog ├── 3.wav └── 4.wav ``` The dataset created with this structure contains two columns: `audio` and `label` (with values `cat` and `dog`). You can also provide multiple splits. To do so, your dataset directory should have the following structure (see [File names and splits](./datasets-file-names-and-splits) for more information): ```plaintext my_dataset_repository/ ├── test │   ├── cat │   │   └── 2.wav │   └── dog │   └── 4.wav └── train ├── cat │   └── 1.wav └── dog └── 3.wav ``` You can disable this automatic addition of the `label` column in the [YAML configuration](./datasets-manual-configuration). If your directory names have no special meaning, set `drop_labels: true` in the README header: ```yaml configs: - config_name: default # Name of the dataset subset, if applicable. drop_labels: true ``` ## Large scale datasets ### WebDataset format The [WebDataset](./datasets-webdataset) format is well suited for large scale audio datasets (see [AlienKevin/sbs_cantonese](https://huggingface.co/datasets/AlienKevin/sbs_cantonese) for example). It consists of TAR archives containing audio files and their metadata and is optimized for streaming. It is useful if you have a large number of audio files and to get streaming data loaders for large scale training. ```plaintext my_dataset_repository/ ├── train-0000.tar ├── train-0001.tar ├── ... └── train-1023.tar ``` To make a WebDataset TAR archive, create a directory containing the audio files and metadata files to be archived and create the TAR archive using e.g. the `tar` command. The usual size per archive is generally around 1GB. Make sure each audio file and metadata pair share the same file prefix, for example: ```plaintext train-0000/ ├── 000.flac ├── 000.json ├── 001.flac ├── 001.json ├── ... ├── 999.flac └── 999.json ``` Note that for user convenience and to enable the [Dataset Viewer](./data-studio), every dataset hosted in the Hub is automatically converted to Parquet format up to 5GB. Read more about it in the [Parquet format](./data-studio#access-the-parquet-files) documentation. ### Parquet format Instead of uploading the audio files and metadata as individual files, you can embed everything inside a [Parquet](https://parquet.apache.org/) file. This is useful if you have a large number of audio files, if you want to embed multiple audio columns, or if you want to store additional information about the audio in the same file. Parquet is also useful for storing data such as raw bytes, which is not supported by JSON/CSV. ```plaintext my_dataset_repository/ └── train.parquet ``` Parquet files with audio data can be created using `pandas` or the `datasets` library. To create Parquet files with audio data in `pandas`, you can use [pandas-audio-methods](https://github.com/lhoestq/pandas-audio-methods) and `df.to_parquet()`. In `datasets`, you can set the column type to `Audio()` and use the `ds.to_parquet(...)` method or `ds.push_to_hub(...)`. You can find a guide on loading audio datasets in `datasets` [here](/docs/datasets/audio_load). Alternatively you can manually set the audio type of Parquet created using other tools. First, make sure your audio columns are of type _struct_, with a binary field `"bytes"` for the audio data and a string field `"path"` for the audio file name or path. Then you should specify the feature types of the columns directly in YAML in the README header, for example: ```yaml dataset_info: features: - name: audio dtype: audio - name: caption dtype: string ``` Note that Parquet is recommended for small audio files (<1MB per audio file) and small row groups (100 rows per row group, which is what `datasets` uses for audio). For larger audio files it is recommended to use the WebDataset format, or to share the original audio files (optionally with metadata files). ### Use AI Models Locally https://huggingface.co/docs/hub/local-apps.md # Use AI Models Locally You can run AI models from the Hub locally on your machine. This means that you can benefit from these advantages: - **Privacy**: You won't be sending your data to a remote server. - **Speed**: Your hardware is the limiting factor, not the server or connection speed. - **Control**: You can configure models to your liking. - **Cost**: You can run models locally without paying for an API provider. ## How to Use Local Apps Local apps are applications that can run Hugging Face models directly on your machine. To get started: 1. **Enable local apps** in your [Local Apps settings](https://huggingface.co/settings/local-apps). ![Local Apps](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/local-apps/settings.png) 1. **Choose a supported model** from the Hub by searching for it. You can filter by `app` in the `Other` section of the navigation bar: ![Local Apps](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/local-apps/search_llamacpp.png) 3. **Select the local app** from the "Use this model" dropdown on the model page. ![Local Apps](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/local-apps/button.png) 4. **Copy and run** the provided command in your terminal. ![Local Apps](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/local-apps/command.png) ## Supported Local Apps The best way to check if a local app is supported is to go to the Local Apps settings and see if the app is listed. Here is a quick overview of some of the most popular local apps: > [!TIP] > 👨‍💻 To use these local apps, copy the snippets from the model card as above. > > 👷 If you're building a local app, you can learn about integrating with the Hub in [this guide](https://huggingface.co/docs/hub/en/models-adding-libraries). ### Llama.cpp Llama.cpp is a high-performance C/C++ library for running LLMs locally with optimized inference across lots of different hardware, including CPUs, CUDA and Metal. **Advantages:** - Extremely fast performance for CPU-based models on multiple CPU families - Low resource usage - Multiple interface options (CLI, server, Python library) - Hardware-optimized for CPUs and GPUs To use Llama.cpp, navigate to the model card and click "Use this model" and copy the command. ```sh # Load and run the model: ./llama-server -hf unsloth/gpt-oss-20b-GGUF:Q4_K_M ``` ### Ollama Ollama is an application that lets you run large language models locally on your computer with a simple command-line interface. **Advantages:** - Easy installation and setup - Direct integration with Hugging Face Hub To use Ollama, navigate to the model card and click "Use this model" and copy the command. ```sh ollama run hf.co/unsloth/gpt-oss-20b-GGUF:Q4_K_M ``` ### Jan Jan is an open-source ChatGPT alternative that runs entirely offline with a user-friendly interface. **Advantages:** - User-friendly GUI - Chat with documents and files - OpenAI-compatible API server, so you can run models and use them from other apps To use Jan, navigate to the model card and click "Use this model". Jan will open and you can start chatting through the interface. ### LM Studio LM Studio is a desktop application that provides an easy way to download, run, and experiment with local LLMs. **Advantages:** - Intuitive graphical interface - Built-in model browser - Developer tools and APIs - Free for personal and commercial use Navigate to the model card and click "Use this model". LM Studio will open and you can start chatting through the interface. ### Gradio Spaces https://huggingface.co/docs/hub/spaces-sdks-gradio.md # Gradio Spaces **Gradio** provides an easy and intuitive interface for running a model from a list of inputs and displaying the outputs in formats such as images, audio, 3D objects, and more. Gradio now even has a [Plot output component](https://gradio.app/docs/#o_plot) for creating data visualizations with Matplotlib, Bokeh, and Plotly! For more details, take a look at the [Getting started](https://gradio.app/getting_started/) guide from the Gradio team. Selecting **Gradio** as the SDK when [creating a new Space](https://huggingface.co/new-space) will initialize your Space with the latest version of Gradio by setting the `sdk` property to `gradio` in your `README.md` file's YAML block. If you'd like to change the Gradio version, you can edit the `sdk_version` property. Visit the [Gradio documentation](https://gradio.app/docs/) to learn all about its features and check out the [Gradio Guides](https://gradio.app/guides/) for some handy tutorials to help you get started! ## Your First Gradio Space: Hot Dog Classifier In the following sections, you'll learn the basics of creating a Space, configuring it, and deploying your code to it. We'll create a **Hot Dog Classifier** Space with Gradio that'll be used to demo the [julien-c/hotdog-not-hotdog](https://huggingface.co/julien-c/hotdog-not-hotdog) model, which can detect whether a given picture contains a hot dog 🌭 You can find a completed version of this hosted at [NimaBoscarino/hotdog-gradio](https://huggingface.co/spaces/NimaBoscarino/hotdog-gradio). ## Create a new Gradio Space We'll start by [creating a brand new Space](https://huggingface.co/new-space) and choosing **Gradio** as our SDK. Hugging Face Spaces are Git repositories, meaning that you can work on your Space incrementally (and collaboratively) by pushing commits. Take a look at the [Getting Started with Repositories](./repositories-getting-started) guide to learn about how you can create and edit files before continuing. ## Add the dependencies For the **Hot Dog Classifier** we'll be using a [🤗 Transformers pipeline](https://huggingface.co/docs/transformers/pipeline_tutorial) to use the model, so we need to start by installing a few dependencies. This can be done by creating a **requirements.txt** file in our repository, and adding the following dependencies to it: ``` transformers torch ``` The Spaces runtime will handle installing the dependencies! ## Create the Gradio interface To create the Gradio app, make a new file in the repository called **app.py**, and add the following code: ```python import gradio as gr from transformers import pipeline pipeline = pipeline(task="image-classification", model="julien-c/hotdog-not-hotdog") def predict(input_img): predictions = pipeline(input_img) return input_img, {p["label"]: p["score"] for p in predictions} gradio_app = gr.Interface( predict, inputs=gr.Image(label="Select hot dog candidate", sources=['upload', 'webcam'], type="pil"), outputs=[gr.Image(label="Processed Image"), gr.Label(label="Result", num_top_classes=2)], title="Hot Dog? Or Not?", ) if __name__ == "__main__": gradio_app.launch() ``` This Python script uses a [🤗 Transformers pipeline](https://huggingface.co/docs/transformers/pipeline_tutorial) to load the [julien-c/hotdog-not-hotdog](https://huggingface.co/julien-c/hotdog-not-hotdog) model, which is used by the Gradio interface. The Gradio app will expect you to upload an image, which it'll then classify as *hot dog* or *not hot dog*. Once you've saved the code to the **app.py** file, visit the **App** tab to see your app in action! ## Embed Gradio Spaces on other webpages You can embed a Gradio Space on other webpages by using either Web Components or the HTML `` tag. Check out [our documentation](./spaces-embed) or the [Gradio documentation](https://gradio.app/sharing_your_app/#embedding-hosted-spaces) for more details. ### Hugging Face Dataset Upload Decision Guide https://huggingface.co/docs/hub/datasets-upload-guide-llm.md # Hugging Face Dataset Upload Decision Guide > [!TIP] > This guide is primarily designed for LLMs to help users upload datasets to the Hugging Face Hub in the most compatible format. Users can also reference this guide to understand the upload process and best practices. > Decision guide for uploading datasets to Hugging Face Hub. Optimized for Dataset Viewer compatibility and integration with the Hugging Face ecosystem. ## Overview Your goal is to help a user upload a dataset to the Hugging Face Hub. Ideally, the dataset should be compatible with the Dataset Viewer (and thus the `load_dataset` function) to ensure easy access and usability. You should aim to meet the following criteria: | **Criteria** | Description | Priority | | ---------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------- | | **Respect repository limits** | Ensure the dataset adheres to Hugging Face's storage limits for file sizes, repository sizes, and file counts. See the Critical Constraints section below for specific limits. | Required | | **Use hub-compatible formats** | Use Parquet format when possible (best compression, rich typing, large dataset support). For smaller datasets (10k)** | Use upload_large_folder to avoid Git limitations | `api.upload_large_folder(folder_path="./data", repo_id="username/dataset", repo_type="dataset")` | | **Streaming large media** | WebDataset format for efficient streaming | Create .tar shards, then `upload_large_folder()` | | **Scientific data (HDF5, NetCDF)** | Convert to Parquet with Array features | See [Scientific Data](#scientific-data) section | | **Custom/proprietary formats** | Document thoroughly if conversion impossible | `upload_large_folder()` with comprehensive README | ## Upload Workflow 0. ✓ **Gather dataset information** (if needed): - What type of data? (images, text, audio, CSV, etc.) - How is it organized? (folder structure, single file, multiple files) - What's the approximate size? - What format are the files in? - Any special requirements? (e.g., streaming, private access) - Check for existing README or documentation files that describe the dataset 1. ✓ **Authenticate**: - CLI: `hf auth login` - Or use token: `HfApi(token="hf_...")` or set `HF_TOKEN` environment variable 2. ✓ **Identify your data type**: Check the [Quick Reference](#quick-reference-by-data-type) table above 3. ✓ **Choose upload method**: - **Small files (100GB or >10k files - **Custom formats**: Convert to hub-compatible format if possible, otherwise document thoroughly 4. ✓ **Test locally** (if using built-in loader): ```python # Validate your dataset loads correctly before uploading dataset = load_dataset("loader_name", data_dir="./your_data") print(dataset) ``` 5. ✓ **Upload to Hub**: ```python # Basic upload dataset.push_to_hub("username/dataset-name") # With options for large datasets dataset.push_to_hub( "username/dataset-name", max_shard_size="5GB", # Control memory usage private=True # For private datasets ) ``` 6. ✓ **Verify your upload**: - Check Dataset Viewer: `https://huggingface.co/datasets/username/dataset-name` - Test loading: `load_dataset("username/dataset-name")` - If viewer shows errors, check the [Troubleshooting](#common-issues--solutions) section ## Common Conversion Patterns When built-in loaders don't match your data structure, use the datasets library as a compatibility layer. Convert your data to a Dataset object, then use `push_to_hub()` for maximum flexibility and Dataset Viewer compatibility. ### From DataFrames If you already have your data working in pandas, polars, or other dataframe libraries, you can convert directly: ```python # From pandas DataFrame import pandas as pd from datasets import Dataset df = pd.read_csv("your_data.csv") dataset = Dataset.from_pandas(df) dataset.push_to_hub("username/dataset-name") # From polars DataFrame (direct method) import polars as pl from datasets import Dataset df = pl.read_csv("your_data.csv") dataset = Dataset.from_polars(df) # Direct conversion dataset.push_to_hub("username/dataset-name") # From PyArrow Table (useful for scientific data) import pyarrow as pa from datasets import Dataset # If you have a PyArrow table table = pa.table({'data': [1, 2, 3], 'labels': ['a', 'b', 'c']}) dataset = Dataset(table) dataset.push_to_hub("username/dataset-name") # For Spark/Dask dataframes, see https://huggingface.co/docs/hub/datasets-libraries ``` ## Custom Format Conversion When built-in loaders don't match your data format, convert to Dataset objects following these principles: ### Design Principles **1. Prefer wide/flat structures over joins** - Denormalize relational data into single rows for better usability - Include all relevant information in each example - Lean towards bigger but more usable data - Hugging Face's infrastructure uses advanced deduplication (XetHub) and Parquet optimizations to handle redundancy efficiently **2. Use configs for logical dataset variations** - Beyond train/test/val splits, use configs for different subsets or views of your data - Each config can have different features or data organization - Example: language-specific configs, task-specific views, or data modalities ### Conversion Methods **Small datasets (fits in memory) - use `Dataset.from_dict()`**: ```python # Parse your custom format into a dictionary data_dict = { "text": ["example1", "example2"], "label": ["positive", "negative"], "score": [0.9, 0.2] } # Create dataset with appropriate features from datasets import Dataset, Features, Value, ClassLabel features = Features({ 'text': Value('string'), 'label': ClassLabel(names=['negative', 'positive']), 'score': Value('float32') }) dataset = Dataset.from_dict(data_dict, features=features) dataset.push_to_hub("username/dataset") ``` **Large datasets (memory-efficient) - use `Dataset.from_generator()`**: ```python def data_generator(): # Parse your custom format progressively for item in parse_large_file("data.custom"): yield { "text": item["content"], "label": item["category"], "embedding": item["vector"] } # Specify features for Dataset Viewer compatibility from datasets import Features, Value, ClassLabel, List features = Features({ 'text': Value('string'), 'label': ClassLabel(names=['cat1', 'cat2', 'cat3']), 'embedding': List(feature=Value('float32'), length=768) }) dataset = Dataset.from_generator(data_generator, features=features) dataset.push_to_hub("username/dataset", max_shard_size="1GB") ``` **Tip**: For large datasets, test with a subset first by adding a limit to your generator or using `.select(range(100))` after creation. ### Using Configs for Dataset Variations ```python # Push different configurations of your dataset dataset_en = Dataset.from_dict(english_data, features=features) dataset_en.push_to_hub("username/multilingual-dataset", config_name="english") dataset_fr = Dataset.from_dict(french_data, features=features) dataset_fr.push_to_hub("username/multilingual-dataset", config_name="french") # Users can then load specific configs dataset = load_dataset("username/multilingual-dataset", "english") ``` ### Multi-modal Examples **Text + Audio (speech recognition)**: ```python def speech_generator(): for audio_file in Path("audio/").glob("*.wav"): transcript_file = audio_file.with_suffix(".txt") yield { "audio": str(audio_file), "text": transcript_file.read_text().strip(), "speaker_id": audio_file.stem.split("_")[0] } features = Features({ 'audio': Audio(sampling_rate=16000), 'text': Value('string'), 'speaker_id': Value('string') }) dataset = Dataset.from_generator(speech_generator, features=features) dataset.push_to_hub("username/speech-dataset") ``` **Multiple images per example**: ```python # Before/after images, medical imaging, etc. data = { "image_before": ["img1_before.jpg", "img2_before.jpg"], "image_after": ["img1_after.jpg", "img2_after.jpg"], "treatment": ["method_A", "method_B"] } features = Features({ 'image_before': Image(), 'image_after': Image(), 'treatment': ClassLabel(names=['method_A', 'method_B']) }) dataset = Dataset.from_dict(data, features=features) dataset.push_to_hub("username/before-after-images") ``` **Note**: For text + images, consider using ImageFolder with metadata.csv which handles this automatically. ## Essential Features Features define the schema and data types for your dataset columns. Specifying correct features ensures: - Proper data handling and type conversion - Dataset Viewer functionality (e.g., image/audio previews) - Efficient storage and loading - Clear documentation of your data structure For complete feature documentation, see: [Dataset Features](https://huggingface.co/docs/datasets/about_dataset_features) ### Feature Types Overview **Basic Types**: - `Value`: Scalar values - `string`, `int64`, `float32`, `bool`, `binary`, and other numeric types - `ClassLabel`: Categorical data with named classes - `Sequence`: Lists of any feature type - `LargeList`: For very large lists **Media Types** (enable Dataset Viewer previews): - `Image()`: Handles various image formats, returns PIL Image objects - `Audio(sampling_rate=16000)`: Audio with array data and optional sampling rate - `Video()`: Video files - `Pdf()`: PDF documents with text extraction **Array Types** (for tensors/scientific data): - `Array2D`, `Array3D`, `Array4D`, `Array5D`: Fixed or variable-length arrays - Example: `Array2D(shape=(224, 224), dtype='float32')` - First dimension can be `None` for variable length **Translation Types**: - `Translation`: For translation pairs with fixed languages - `TranslationVariableLanguages`: For translations with varying language pairs **Note**: New feature types are added regularly. Check the documentation for the latest additions. ## Upload Methods **Dataset objects (use push_to_hub)**: Use when you've loaded/converted data using the datasets library ```python dataset.push_to_hub("username/dataset", max_shard_size="5GB") ``` **Pre-existing files (use upload_large_folder)**: Use when you have hub-compatible files (e.g., Parquet files) already prepared and organized ```python from huggingface_hub import HfApi api = HfApi() api.upload_large_folder(folder_path="./data", repo_id="username/dataset", repo_type="dataset", num_workers=16) ``` **Important**: Before using `upload_large_folder`, verify the files meet repository limits: - Check folder structure if you have file access: ensure no folder contains >10k files - Ask the user to confirm: "Are your files in a hub-compatible format (Parquet/CSV/JSON) and organized appropriately?" - For non-standard formats, consider converting to Dataset objects first to ensure compatibility ## Validation **Consider small reformatting**: If data is close to a built-in loader format, suggest minor changes: - Rename columns (e.g., 'filename' → 'file_name' for ImageFolder) - Reorganize folders (e.g., move images into class subfolders) - Rename files to match expected patterns (e.g., 'data.csv' → 'train.csv') **Pre-upload**: - Test locally: `load_dataset("imagefolder", data_dir="./data")` - Verify features work correctly: ```python # Test first example print(dataset[0]) # For images: verify they load if 'image' in dataset.features: dataset[0]['image'] # Should return PIL Image # Check dataset size before upload print(f"Size: {len(dataset)} examples") ``` - Check metadata.csv has 'file_name' column - Verify relative paths, no leading slashes - Ensure no folder >10k files **Post-upload**: - Check viewer: `https://huggingface.co/datasets/username/dataset` - Test loading: `load_dataset("username/dataset")` - Verify features preserved: `print(dataset.features)` ## Common Issues → Solutions | Issue | Solution | | -------------------------- | ------------------------------------ | | "Repository not found" | Run `hf auth login` | | Memory errors | Use `max_shard_size="500MB"` | | Dataset viewer not working | Wait 5-10min, check README.md config | | Timeout errors | Use `multi_commits=True` | | Files >50GB | Split into smaller files | | "File not found" | Use relative paths in metadata | ## Dataset Viewer Configuration **Note**: This section is primarily for datasets uploaded directly to the Hub (via UI or `upload_large_folder`). Datasets uploaded with `push_to_hub()` typically configure the viewer automatically. ### When automatic detection works The Dataset Viewer automatically detects standard structures: - Files named: `train.csv`, `test.json`, `validation.parquet` - Directories named: `train/`, `test/`, `validation/` - Split names with delimiters: `test-data.csv` ✓ (not `testdata.csv` ✗) ### Manual configuration For custom structures, add YAML to your README.md: ```yaml --- configs: - config_name: default # Required even for single config! data_files: - split: train path: "data/train/*.parquet" - split: test path: "data/test/*.parquet" --- ``` Multiple configurations example: ```yaml --- configs: - config_name: english data_files: "en/*.parquet" - config_name: french data_files: "fr/*.parquet" --- ``` ### Common viewer issues - **No viewer after upload**: Wait 5-10 minutes for processing - **"Config names error"**: Add `config_name` field (required!) - **Files not detected**: Check naming patterns (needs delimiters) - **Viewer disabled**: Remove `viewer: false` from README YAML ## Quick Templates ```python # ImageFolder with metadata dataset = load_dataset("imagefolder", data_dir="./images") dataset.push_to_hub("username/dataset") # Memory-efficient upload dataset.push_to_hub("username/dataset", max_shard_size="500MB") # Multiple CSV files dataset = load_dataset('csv', data_files={'train': 'train.csv', 'test': 'test.csv'}) dataset.push_to_hub("username/dataset") ``` ## Documentation **Core docs**: [Adding datasets](https://huggingface.co/docs/hub/datasets-adding) | [Dataset viewer](https://huggingface.co/docs/hub/datasets-viewer) | [Storage limits](https://huggingface.co/docs/hub/storage-limits) | [Upload guide](https://huggingface.co/docs/datasets/upload_dataset) ## Dataset Cards Remind users to add a dataset card (README.md) with: - Dataset description and usage - License information - Citation details See [Dataset Cards guide](https://huggingface.co/docs/hub/datasets-cards) for details. --- ## Appendix: Special Cases ### WebDataset Structure For streaming large media datasets: - Create 1-5GB tar shards - Consistent internal structure - Upload with `upload_large_folder` ### Scientific Data - HDF5/NetCDF → Convert to Parquet with Array features - Time series → Array2D(shape=(None, n)) - Complex metadata → Store as JSON strings ### Community Resources For very specialized or bespoke formats: - Search the Hub for similar datasets: `https://huggingface.co/datasets` - Ask for advice on the [Hugging Face Forums](https://discuss.huggingface.co/c/datasets/10) - Join the [Hugging Face Discord](https://hf.co/join/discord) for real-time help - Many domain-specific formats already have examples on the Hub ### Use Ollama with any GGUF Model on Hugging Face Hub https://huggingface.co/docs/hub/ollama.md # Use Ollama with any GGUF Model on Hugging Face Hub ![cover](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/ollama/cover.png) 🆕 You can now also run private GGUFs from the Hugging Face Hub. Ollama is an application based on llama.cpp to interact with LLMs directly through your computer. You can use any GGUF quants created by the community ([bartowski](https://huggingface.co/bartowski), [MaziyarPanahi](https://huggingface.co/MaziyarPanahi) and [many more](https://huggingface.co/models?pipeline_tag=text-generation&library=gguf&sort=trending)) on Hugging Face directly with Ollama, without creating a new `Modelfile`. At the time of writing there are 45K public GGUF checkpoints on the Hub, you can run any of them with a single `ollama run` command. We also provide customisations like choosing quantization type, system prompt and more to improve your overall experience. Getting started is as simple as: 1. Enable `ollama` under your [Local Apps settings](https://huggingface.co/settings/local-apps). 2. On a model page, choose `ollama` from `Use this model` dropdown. For example: [bartowski/Llama-3.2-1B-Instruct-GGUF](https://huggingface.co/bartowski/Llama-3.2-1B-Instruct-GGUF). The snippet would be in format: ```sh ollama run hf.co/{username}/{repository} ``` Please note that you can use both `hf.co` and `huggingface.co` as the domain name. Here are some models you can try: ```sh ollama run hf.co/bartowski/Llama-3.2-1B-Instruct-GGUF ollama run hf.co/mlabonne/Meta-Llama-3.1-8B-Instruct-abliterated-GGUF ollama run hf.co/arcee-ai/SuperNova-Medius-GGUF ollama run hf.co/bartowski/Humanish-LLama3-8B-Instruct-GGUF ``` ## Custom Quantization By default, the `Q4_K_M` quantization scheme is used, when it's present inside the model repo. If not, we default to picking one reasonable quant type present inside the repo. To select a different scheme, simply: 1. From `Files and versions` tab on a model page, open GGUF viewer on a particular GGUF file. 2. Choose `ollama` from `Use this model` dropdown. The snippet would be in format (quantization tag added): ```sh ollama run hf.co/{username}/{repository}:{quantization} ``` For example: ```sh ollama run hf.co/bartowski/Llama-3.2-3B-Instruct-GGUF:IQ3_M ollama run hf.co/bartowski/Llama-3.2-3B-Instruct-GGUF:Q8_0 # the quantization name is case-insensitive, this will also work ollama run hf.co/bartowski/Llama-3.2-3B-Instruct-GGUF:iq3_m # you can also directly use the full filename as a tag ollama run hf.co/bartowski/Llama-3.2-3B-Instruct-GGUF:Llama-3.2-3B-Instruct-IQ3_M.gguf ``` ## Custom Chat Template and Parameters By default, a template will be selected automatically from a list of commonly used templates. It will be selected based on the built-in `tokenizer.chat_template` metadata stored inside the GGUF file. If your GGUF file doesn't have a built-in template or if you want to customize your chat template, you can create a new file called `template` in the repository. The template must be a Go template, not a Jinja template. Here's an example: ``` {{ if .System }} {{ .System }} {{ end }}{{ if .Prompt }} {{ .Prompt }} {{ end }} {{ .Response }} ``` To know more about the Go template format, please refer to [this documentation](https://github.com/ollama/ollama/blob/main/docs/template.md) You can optionally configure a system prompt by putting it into a new file named `system` in the repository. To change sampling parameters, create a file named `params` in the repository. The file must be in JSON format. For the list of all available parameters, please refer to [this documentation](https://github.com/ollama/ollama/blob/main/docs/modelfile.md#parameter). ## Run Private GGUFs from the Hugging Face Hub You can run private GGUFs from your personal account or from an associated organisation account in two simple steps: 1. Copy your Ollama SSH key, you can do so via: `cat ~/.ollama/id_ed25519.pub | pbcopy` 2. Add the corresponding key to your Hugging Face account by going to [your account settings](https://huggingface.co/settings/keys) and clicking on `Add new SSH key`. 3. That's it! You can now run private GGUFs from the Hugging Face Hub: `ollama run hf.co/{username}/{repository}`. ## References - https://github.com/ollama/ollama/blob/main/docs/README.md - https://huggingface.co/docs/hub/en/gguf ### User access tokens https://huggingface.co/docs/hub/security-tokens.md # User access tokens ## What are User Access Tokens? User Access Tokens are the preferred way to authenticate an application or notebook to Hugging Face services. You can manage your access tokens in your [settings](https://huggingface.co/settings/tokens). Access tokens allow applications and notebooks to perform specific actions specified by the scope of the roles shown in the following: - `fine-grained`: tokens with this role can be used to provide fine-grained access to specific resources, such as a specific model or models in a specific organization. This type of token is useful in production environments, as you can use your own token without sharing access to all your resources. - `read`: tokens with this role can only be used to provide read access to repositories you could read. That includes public and private repositories that you, or an organization you're a member of, own. Use this role if you only need to read content from the Hugging Face Hub (e.g. when downloading private models or doing inference). - `write`: tokens with this role additionally grant write access to the repositories you have write access to. Use this token if you need to create or push content to a repository (e.g., when training a model or modifying a model card). Note that Organization API Tokens have been deprecated: If you are a member of an organization with read/write/admin role, then your User Access Tokens will be able to read/write the resources according to the token permission (read/write) and organization membership (read/write/admin). ## How to manage User Access Tokens? To create an access token, go to your settings, then click on the [Access Tokens tab](https://huggingface.co/settings/tokens). Click on the **New token** button to create a new User Access Token. Select a role and a name for your token and voilà - you're ready to go! You can delete and refresh User Access Tokens by clicking on the **Manage** button. ## How to use User Access Tokens? There are plenty of ways to use a User Access Token to access the Hugging Face Hub, granting you the flexibility you need to build awesome apps on top of it. User Access Tokens can be: - used **in place of a password** to access the Hugging Face Hub with git or with basic authentication. - passed as a **bearer token** when calling [Inference Providers](https://huggingface.co/docs/inference-providers). - used in the Hugging Face Python libraries, such as `transformers` or `datasets`: ```python from transformers import AutoModel access_token = "hf_..." model = AutoModel.from_pretrained("private/model", token=access_token) ``` > [!WARNING] > Try not to leak your token! Though you can always rotate it, anyone will be able to read or write your private repos in the meantime which is 💩 ### Best practices We recommend you create one access token per app or usage. For instance, you could have a separate token for: * A local machine. * A Colab notebook. * An awesome custom inference server. This way, you can invalidate one token without impacting your other usages. We also recommend using only fine-grained tokens for production usage. The impact, if leaked, will be reduced, and they can be shared among your organization without impacting your account. For example, if your production application needs read access to a gated model, a member of your organization can request access to the model and then create a fine-grained token with read access to that model. This token can then be used in your production application without giving it access to all your private models. ### Using SpanMarker at Hugging Face https://huggingface.co/docs/hub/span_marker.md # Using SpanMarker at Hugging Face [SpanMarker](https://github.com/tomaarsen/SpanMarkerNER) is a framework for training powerful Named Entity Recognition models using familiar encoders such as BERT, RoBERTa and DeBERTa. Tightly implemented on top of the 🤗 Transformers library, SpanMarker can take good advantage of it. As a result, SpanMarker will be intuitive to use for anyone familiar with Transformers. ## Exploring SpanMarker in the Hub You can find `span_marker` models by filtering at the left of the [models page](https://huggingface.co/models?library=span-marker). All models on the Hub come with these useful features: 1. An automatically generated model card with a brief description. 2. An interactive widget you can use to play with the model directly in the browser. 3. An Inference API that allows you to make inference requests. ## Installation To get started, you can follow the [SpanMarker installation guide](https://tomaarsen.github.io/SpanMarkerNER/install.html). You can also use the following one-line install through pip: ``` pip install -U span_marker ``` ## Using existing models All `span_marker` models can easily be loaded from the Hub. ```py from span_marker import SpanMarkerModel model = SpanMarkerModel.from_pretrained("tomaarsen/span-marker-bert-base-fewnerd-fine-super") ``` Once loaded, you can use [`SpanMarkerModel.predict`](https://tomaarsen.github.io/SpanMarkerNER/api/span_marker.modeling.html#span_marker.modeling.SpanMarkerModel.predict) to perform inference. ```py model.predict("Amelia Earhart flew her single engine Lockheed Vega 5B across the Atlantic to Paris.") ``` ```json [ {"span": "Amelia Earhart", "label": "person-other", "score": 0.7629689574241638, "char_start_index": 0, "char_end_index": 14}, {"span": "Lockheed Vega 5B", "label": "product-airplane", "score": 0.9833564758300781, "char_start_index": 38, "char_end_index": 54}, {"span": "Atlantic", "label": "location-bodiesofwater", "score": 0.7621214389801025, "char_start_index": 66, "char_end_index": 74}, {"span": "Paris", "label": "location-GPE", "score": 0.9807717204093933, "char_start_index": 78, "char_end_index": 83} ] ``` If you want to load a specific SpanMarker model, you can click `Use in SpanMarker` and you will be given a working snippet! ## Additional resources * SpanMarker [repository](https://github.com/tomaarsen/SpanMarkerNER) * SpanMarker [docs](https://tomaarsen.github.io/SpanMarkerNER) ### Spaces Changelog https://huggingface.co/docs/hub/spaces-changelog.md # Spaces Changelog ## [2025-04-30] - Deprecate Streamlit SDK - Streamlit is no longer provided as a default built-in SDK option. Streamlit applications are now created using the Docker template. ## [2023-07-28] - Upstream Streamlit frontend for `>=1.23.0` - Streamlit SDK uses the upstream packages published on PyPI for `>=1.23.0`, so the newly released versions are available from the day of release. ## [2023-05-30] - Add support for Streamlit 1.23.x and 1.24.0 - Added support for Streamlit `1.23.0`, `1.23.1`, and `1.24.0`. - Since `1.23.0`, the Streamlit frontend has been changed to the upstream version from the HF-customized one. ## [2023-05-30] - Add support for Streamlit 1.22.0 - Added support for Streamlit `1.22.0`. ## [2023-05-15] - The default Streamlit version - The default Streamlit version is set as `1.21.0`. ## [2023-04-12] - Add support for Streamlit up to 1.19.0 - Support for `1.16.0`, `1.17.0`, `1.18.1`, and `1.19.0` is added and the default SDK version is set as `1.19.0`. ## [2023-03-28] - Bug fix - Fixed a bug causing inability to scroll on iframe-embedded or directly accessed Streamlit apps, which was reported at https://discuss.huggingface.co/t/how-to-add-scroll-bars-to-a-streamlit-app-using-space-direct-embed-url/34101. The patch has been applied to Streamlit>=1.18.1. ## [2022-12-15] - Spaces supports Docker Containers - Read more doc about: [Docker Spaces](./spaces-sdks-docker) ## [2022-12-14] - Ability to set a custom `sleep` time - Read more doc here: [Spaces sleep time](./spaces-gpus#sleep-time) ## [2022-12-07] - Add support for Streamlit 1.15 - Announcement : https://twitter.com/osanseviero/status/1600881584214638592. ## [2022-06-07] - Add support for Streamlit 1.10.0 - The new multipage apps feature is working out-of-the-box on Spaces. - Streamlit blogpost : https://blog.streamlit.io/introducing-multipage-apps. ## [2022-05-23] - Spaces speedup and reactive system theme - All Spaces using Gradio 3+ and Streamlit 1.x.x have a significant speedup in loading. - System theme is now reactive inside the app. If the user changes to dark mode, it automatically changes. ## [2022-05-21] - Default Debian packages and Factory Reboot - Spaces environments now come with pre-installed popular packages (`ffmpeg`, `libsndfile1`, etc.). - This way, most of the time, you don't need to specify any additional package for your Space to work properly. - The `packages.txt` file can still be used if needed. - Added factory reboot button to Spaces, which allows users to do a full restart avoiding cached requirements and freeing GPU memory. ## [2022-05-17] - Add support for Streamlit 1.9.0 - All `1.x.0` versions are now supported (up to `1.9.0`). ## [2022-05-16] - Gradio 3 is out! - This is the default version when creating a new Space, don't hesitate to [check it out](https://huggingface.co/blog/gradio-blocks). ## [2022-03-04] - SDK version lock - The `sdk_version` field is now automatically pre-filled at Space creation time. - It ensures that your Space stays on the same SDK version after an updatE. ## [2022-03-02] - Gradio version pinning - The `sdk_version` configuration field now works with the Gradio SDK. ## [2022-02-21] - Python versions - You can specify the version of Python that you want your Space to run on. - Only Python 3 versions are supported. ## [2022-01-24] - Automatic model and dataset linking from Spaces - We attempt to automatically extract model and dataset repo ids used in your code - You can always manually define them with `models` and `datasets` in your YAML. ## [2021-10-20] - Add support for Streamlit 1.0 - We now support all versions between 0.79.0 and 1.0.0 ## [2021-09-07] - Streamlit version pinning - You can now choose which version of Streamlit will be installed within your Space ## [2021-09-06] - Upgrade Streamlit to `0.84.2` - Supporting Session State API - [Streamlit changelog](https://github.com/streamlit/streamlit/releases/tag/0.84.0) ## [2021-08-10] - Upgrade Streamlit to `0.83.0` - [Streamlit changelog](https://github.com/streamlit/streamlit/releases/tag/0.83.0) ## [2021-08-04] - Debian packages - You can now add your `apt-get` dependencies into a `packages.txt` file ## [2021-08-03] - Streamlit components - Add support for [Streamlit components](https://streamlit.io/components) ## [2021-08-03] - Flax/Jax GPU improvements - For GPU-activated Spaces, make sure Flax / Jax runs smoothly on GPU ## [2021-08-02] - Upgrade Streamlit to `0.82.0` - [Streamlit changelog](https://github.com/streamlit/streamlit/releases/tag/0.82.0) ## [2021-08-01] - Raw logs available - Add link to raw logs (build and container) from the space repository (viewable by users with write access to a Space) ### Tokens Management https://huggingface.co/docs/hub/enterprise-hub-tokens-management.md # Tokens Management > [!WARNING] > This feature is part of the Team & Enterprise plans. Tokens Management enables organization administrators to oversee access tokens within their organization, ensuring secure access to organization resources. ## Viewing and Managing Access Tokens The token listing feature displays all access tokens within your organization. Administrators can: - Monitor token usage and identify or prevent potential security risks: - Unauthorized access to private resources ("leaks") - Overly broad access scopes - Suboptimal token hygiene (e.g., tokens that have not been rotated in a long time) - Identify and revoke inactive or unused tokens Fine-grained tokens display their specific permissions: ## Token Policy Enterprise organization administrators can enforce the following policies: | **Policy** | **Unscoped (Read/Write) Access Tokens** | **Fine-Grained Tokens** | | ------------------------------------------------- | --------------------------------------- | ----------------------------------------------------------- | | **Allow access via User Access Tokens (default)** | Authorized | Authorized | | **Only access via fine-grained tokens** | Unauthorized | Authorized | | **Do not require administrator approval** | Unauthorized | Authorized | | **Require administrator approval** | Unauthorized | Unauthorized without an approval (except for admin-created) | ## Reviewing Token Authorization When token policy is set to "Require administrator approval", organization administrators can review details of all fine-grained tokens accessing organization-owned resources and revoke access if needed. Administrators receive email notifications for token authorization requests. When a token is revoked or denied, the user who created the token receives an email notification. ### Inference Providers https://huggingface.co/docs/hub/models-inference.md # Inference Providers Hugging Face's model pages have pay-as-you-go inference for thousands of models, so you can try them all out right in the browser. Service is powered by Inference Providers and includes a free-tier. Inference Providers give developers streamlined, unified access to hundreds of machine learning models, powered by the best serverless inference partners. 👉 **For complete documentation, visit the [Inference Providers Documentation](https://huggingface.co/docs/inference-providers)**. ## Inference Providers on the Hub Inference Providers is deeply integrated with the Hugging Face Hub, and you can use it in a few different ways: - **Interactive Widgets** - Test models directly on model pages with interactive widgets that use Inference Providers under the hood. Check out the [DeepSeek-R1-0528 model page](https://huggingface.co/deepseek-ai/DeepSeek-R1-0528) for an example. - **Inference Playground** - Easily test and compare chat completion models with your prompts. Check out the [Inference Playground](https://huggingface.co/playground) to get started. - **Search** - Filter models by inference provider on the [models page](https://huggingface.co/models?inference_provider=all) to find models available through specific providers. - **Data Studio** - Use AI to explore datasets on the Hub. Check out [Data Studio](https://huggingface.co/datasets/fka/awesome-chatgpt-prompts/viewer?views%5B%5D=train) on your favorite dataset. ## Build with Inference Providers You can integrate Inference Providers into your own applications using our SDKs or HTTP clients. Here's a quick start with Python and JavaScript, for more details, check out the [Inference Providers Documentation](https://huggingface.co/docs/inference-providers). You can use our Python SDK to interact with Inference Providers. ```python from huggingface_hub import InferenceClient import os client = InferenceClient( api_key=os.environ["HF_TOKEN"], provider="auto", # Automatically selects best provider ) # Chat completion completion = client.chat.completions.create( model="deepseek-ai/DeepSeek-V3-0324", messages=[{"role": "user", "content": "A story about hiking in the mountains"}] ) # Image generation image = client.text_to_image( prompt="A serene lake surrounded by mountains at sunset, photorealistic style", model="black-forest-labs/FLUX.1-dev" ) ``` Or, you can just use the OpenAI API compatible client. ```python import os from openai import OpenAI client = OpenAI( base_url="https://router.huggingface.co/v1", api_key=os.environ["HF_TOKEN"], ) completion = client.chat.completions.create( model="deepseek-ai/DeepSeek-V3-0324", messages=[ { "role": "user", "content": "A story about hiking in the mountains" } ], ) ``` > [!WARNING] > The OpenAI API compatible client is not supported for image generation. You can use our JavaScript SDK to interact with Inference Providers. ```javascript import { InferenceClient } from "@huggingface/inference"; const client = new InferenceClient(process.env.HF_TOKEN); const chatCompletion = await client.chatCompletion({ provider: "auto", // Automatically selects best provider model: "deepseek-ai/DeepSeek-V3-0324", messages: [{ role: "user", content: "Hello!" }] }); const imageBlob = await client.textToImage({ model: "black-forest-labs/FLUX.1-dev", inputs: "A serene lake surrounded by mountains at sunset, photorealistic style", }); ``` Or, you can just use the OpenAI API compatible client. ```javascript import { OpenAI } from "openai"; const client = new OpenAI({ baseURL: "https://router.huggingface.co/v1", apiKey: process.env.HF_TOKEN, }); const completion = await client.chat.completions.create({ model: "meta-llama/Llama-3.1-8B-Instruct", messages: [{ role: "user", content: "A story about hiking in the mountains" }], }); ``` > [!WARNING] > The OpenAI API compatible client is not supported for image generation. You'll need a Hugging Face token with inference permissions. Create one at [Settings > Tokens](https://huggingface.co/settings/tokens/new?ownUserPermissions=inference.serverless.write&tokenType=fineGrained). ### How Inference Providers works To dive deeper into Inference Providers, check out the [Inference Providers Documentation](https://huggingface.co/docs/inference-providers). Here are some key resources: - **[Quick Start](https://huggingface.co/docs/inference-providers)** - **[Pricing & Billing Guide](https://huggingface.co/docs/inference-providers/pricing)** - **[Hub Integration Details](https://huggingface.co/docs/inference-providers/hub-integration)** ### What was the HF-Inference API? HF-Inference API is one of the providers available through Inference Providers. It was previously called "Inference API (serverless)" and is powered by [Inference Endpoints](https://huggingface.co/docs/inference-endpoints/index) under the hood. For more details about the HF-Inference provider specifically, check out its [dedicated page](https://huggingface.co/docs/inference-providers/providers/hf-inference). ### Hugging Face MCP Server https://huggingface.co/docs/hub/hf-mcp-server.md # Hugging Face MCP Server The Hugging Face MCP (Model Context Protocol) Server connects your MCP‑compatible AI assistant (for example Codex, Cursor, VS Code extensions, Zed, ChatGPT or Claude Desktop) directly to the Hugging Face Hub. Once connected, your assistant can search and explore Hub resources and use community tools, all from within your editor, chat or CLI. ## What you can do - Search and explore Hub resources: models, datasets, Spaces, and papers. - Run community tools via MCP‑compatible Gradio apps hosted on [Spaces](https://hf.co/spaces). - Bring results back into your assistant with metadata, links, and context. ## Get started 1. Open your MCP settings: visit https://huggingface.co/settings/mcp while logged in. 2. Pick your client: select your MCP‑compatible client (for example Cursor, VS Code, Zed, Claude Desktop). The page shows client‑specific instructions and a ready‑to‑copy configuration snippet. 3. Paste and restart: copy the snippet into your client’s MCP configuration, save, and restart/reload the client. You should see “Hugging Face” (or similar) listed as a connected MCP server in your client. > [!TIP] > The settings page generates the exact configuration your client expects. Use it rather than writing config by hand. ![MCP Settings Example](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hf-mcp-settings.png) ## Using the server After connecting, ask your assistant to use the Hugging Face tools. Example prompts: - “Search Hugging Face models for Qwen 3 Quantizations.” - “Find a Space that can transcribe audio files.” - “Show datasets about weather time‑series.” - “Create a 1024 x 1024 image of a cat ghibli style.” Your assistant will call MCP tools exposed by the Hugging Face MCP Server (including Spaces) and return results (titles, owners, downloads, links, and so on). You can then open the resource on the Hub or continue iterating in the same chat. ![HF MCP with Spaces in VS Code](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hf-mcp-vscode.png) ## Add community tools (Spaces) You can extend your setup with MCP‑compatible Gradio spaces built by the community: - Explore Spaces with MCP support [here](https://huggingface.co/spaces?filter=mcp-server). - Add the relevant space in your MCP settings on Hugging Face [here](https://huggingface.co/settings/mcp). Gradio MCP apps expose their functions as tools (with arguments and descriptions) so your assistant can call them directly. Please, restart or refresh your client so it picks up new tools you add. ![image/png](https://cdn-uploads.huggingface.co/production/uploads/5f17f0a0925b9863e28ad517/ex9KRpvamn84ZaOlSp_Bj.png) Check out our dedicated guide for Spaces as MCP server [here](https://huggingface.co/docs/hub/spaces-mcp-servers#add-an-existing-space-to-your-mcp-tools). ## Learn more - Settings and client setup: https://huggingface.co/settings/mcp - Changelog announcement: https://huggingface.co/changelog/hf-mcp-server - Hugging Face MCP Server: https://huggingface.co/mcp - Build your own MCP Server with Gradio Spaces: https://www.gradio.app/guides/building-mcp-server-with-gradio ### Authentication for private and gated datasets https://huggingface.co/docs/hub/datasets-duckdb-auth.md # Authentication for private and gated datasets To access private or gated datasets, you need to configure your Hugging Face Token in the DuckDB Secrets Manager. Visit [Hugging Face Settings - Tokens](https://huggingface.co/settings/tokens) to obtain your access token. DuckDB supports two providers for managing secrets: - `CONFIG`: Requires the user to pass all configuration information into the CREATE SECRET statement. - `CREDENTIAL_CHAIN`: Automatically tries to fetch credentials. For the Hugging Face token, it will try to get it from `~/.cache/huggingface/token`. For more information about DuckDB Secrets visit the [Secrets Manager](https://duckdb.org/docs/configuration/secrets_manager.html) guide. ## Creating a secret with `CONFIG` provider To create a secret using the CONFIG provider, use the following command: ```bash CREATE SECRET hf_token (TYPE HUGGINGFACE, TOKEN 'your_hf_token'); ``` Replace `your_hf_token` with your actual Hugging Face token. ## Creating a secret with `CREDENTIAL_CHAIN` provider To create a secret using the CREDENTIAL_CHAIN provider, use the following command: ```bash CREATE SECRET hf_token (TYPE HUGGINGFACE, PROVIDER credential_chain); ``` This command automatically retrieves the stored token from `~/.cache/huggingface/token`. First you need to [Login with your Hugging Face account](/docs/huggingface_hub/quick-start#login), for example using: ```bash hf auth login ``` Alternatively, you can set your Hugging Face token as an environment variable: ```bash export HF_TOKEN="hf_xxxxxxxxxxxxx" ``` For more information on authentication, see the [Hugging Face authentication](/docs/huggingface_hub/main/en/quick-start#authentication) documentation. ### Giskard on Spaces https://huggingface.co/docs/hub/spaces-sdks-docker-giskard.md # Giskard on Spaces **Giskard** is an AI model quality testing toolkit for LLMs, tabular, and NLP models. It consists of an open-source Python library for scanning and testing AI models and an AI Model Quality Testing app, which can now be deployed using Hugging Face's Docker Spaces. Extending the features of the open-source library, the AI Model Quality Testing app enables you to: - Debug tests to diagnose your issues - Create domain-specific tests thanks to automatic model insights - Compare models to decide which model to promote - Collect business feedback of your model results - Share your results with your colleagues for alignment - Store all your QA objects (tests, data slices, evaluation criteria, etc.) in one place to work more efficiently Visit [Giskard's documentation](https://docs.giskard.ai/) and [Quickstart Guides](https://docs.giskard.ai/en/latest/getting_started/quickstart/index.html) to learn how to use the full range of tools provided by Giskard. In the next sections, you'll learn to deploy your own Giskard AI Model Quality Testing app and use it right from Hugging Face Spaces. This Giskard app is a **self-contained application completely hosted on Spaces using Docker**. ## Deploy Giskard on Spaces You can deploy Giskard on Spaces with just a few clicks: > [!WARNING] > IMPORTANT NOTE ABOUT DATA PERSISTENCE: > You can use the Giskard Space as is for initial exploration and experimentation. For **longer use in > small-scale projects, activate the [paid persistent storage option](https://huggingface.co/docs/hub/spaces-storage)**. This prevents data loss during Space restarts which > occur every 24 hours. You need to define the **Owner** (your personal account or an organization), a **Space name**, and the **Visibility**. If you don’t want to publicly share your models and quality tests, set your Space to **Private**. Once you have created the Space, you'll see the `Building` status. Once it becomes `Running`, your Space is ready to go. If you don't see a change in the screen, refresh the page. ## Request a free license Once your Giskard Space is up and running, you'll need to request a free license to start using the app. You will then automatically receive an email with the license file. ## Create a new Giskard project Once inside the app, start by creating a new project from the welcome screen. ## Generate a Hugging Face Giskard Space Token and Giskard API key The Giskard API key is used to establish communication between the environment where your AI models are running and the Giskard app on Hugging Face Spaces. If you've set the **Visibility** of your Space to **Private**, you will need to provide a Hugging Face user access token to generate the Hugging Face Giskard Space Token and establish a communication for access to your private Space. To do so, follow the instructions displayed in the settings page of the Giskard app. ## Start the ML worker Giskard executes your model using a worker that runs the model directly in your Python environment, with all the dependencies required by your model. You can either execute the ML worker: - From your local notebook within the kernel that contains all the dependencies of your model - From Google Colab within the kernel that contains all the dependencies of your model - Or from your terminal within the Python environment that contains all the dependencies of your model Simply run the following command within the Python environment that contains all the dependencies of your model: ```bash giskard worker start -d -k GISKARD-API-KEY -u https://XXX.hf.space --hf-token GISKARD-SPACE-TOKEN ``` ## Upload your test suite, models and datasets In order to start building quality tests for a project, you will need to upload model and dataset objects, and either create or upload a test suite from the Giskard Python library. > [!TIP] > For more information on how to create test suites from Giskard's Python library's automated model scanning tool, head > over to Giskard's [Quickstart Guides](https://docs.giskard.ai/en/latest/getting_started/quickstart/index.html). These actions will all require a connection between your Python environment and the Giskard Space. Achieve this by initializing a Giskard Client: simply copy the “Create a Giskard Client” snippet from the settings page of the Giskard app and run it within your Python environment. This will look something like this: ```python from giskard import GiskardClient url = "https://user_name-space_name.hf.space" api_key = "gsk-xxx" hf_token = "xxx" # Create a giskard client to communicate with Giskard client = GiskardClient(url, api_key, hf_token) ``` If you run into issues, head over to Giskard's [upload object documentation page](https://docs.giskard.ai/en/latest/giskard_hub/upload/index.html). ## Feedback and support If you have suggestions or need specific support, please join [Giskard's Discord community](https://discord.com/invite/ABvfpbu69R) or reach out on [Giskard's GitHub repository](https://github.com/Giskard-AI/giskard). ### Optimizations https://huggingface.co/docs/hub/datasets-polars-optimizations.md # Optimizations We briefly touched upon the difference between lazy and eager evaluation. On this page we will show how the lazy API can be used to get huge performance benefits. ## Lazy vs Eager Polars supports two modes of operation: lazy and eager. In the eager API the query is executed immediately while in the lazy API the query is only evaluated once it's 'needed'. Deferring the execution to the last minute can have significant performance advantages and is why the lazy API is preferred in most non-interactive cases. ## Example We will be using the example from the previous page to show the performance benefits of using the lazy API. The code below will compute the number of uploads from `archive.org`. ### Eager ```python import polars as pl import datetime df = pl.read_csv("hf://datasets/commoncrawl/statistics/tlds.csv", try_parse_dates=True) df = df.select("suffix", "crawl", "date", "tld", "pages", "domains") df = df.filter( (pl.col("date") >= datetime.date(2020, 1, 1)) | pl.col("crawl").str.contains("CC") ) df = df.with_columns( (pl.col("pages") / pl.col("domains")).alias("pages_per_domain") ) df = df.group_by("tld", "date").agg( pl.col("pages").sum(), pl.col("domains").sum(), ) df = df.group_by("tld").agg( pl.col("date").unique().count().alias("number_of_scrapes"), pl.col("domains").mean().alias("avg_number_of_domains"), pl.col("pages").sort_by("date").pct_change().mean().alias("avg_page_growth_rate"), ).sort("avg_number_of_domains", descending=True).head(10) ``` ### Lazy ```python import polars as pl import datetime lf = ( pl.scan_csv("hf://datasets/commoncrawl/statistics/tlds.csv", try_parse_dates=True) .filter( (pl.col("date") >= datetime.date(2020, 1, 1)) | pl.col("crawl").str.contains("CC") ).with_columns( (pl.col("pages") / pl.col("domains")).alias("pages_per_domain") ).group_by("tld", "date").agg( pl.col("pages").sum(), pl.col("domains").sum(), ).group_by("tld").agg( pl.col("date").unique().count().alias("number_of_scrapes"), pl.col("domains").mean().alias("avg_number_of_domains"), pl.col("pages").sort_by("date").pct_change().mean().alias("avg_page_growth_rate"), ).sort("avg_number_of_domains", descending=True).head(10) ) df = lf.collect() ``` ### Timings Running both queries leads to following run times on a regular laptop with a household internet connection: - Eager: `1.96` seconds - Lazy: `410` milliseconds The lazy query is ~5 times faster than the eager one. The reason for this is the query optimizer: if we delay `collect`-ing our dataset until the end, Polars will be able to reason about which columns and rows are required and apply filters as early as possible when reading the data. For file formats such as Parquet that contain metadata (e.g. min, max in a certain group of rows) the difference can even be bigger as Polars can skip entire row groups based on the filters and the metadata without sending the data over the wire. ### Paper Pages https://huggingface.co/docs/hub/paper-pages.md # Paper Pages Paper pages allow people to find artifacts related to a paper such as models, datasets and apps/demos (Spaces). Paper pages also enable the community to discuss about the paper. ## Linking a Paper to a model, dataset or Space If the repository card (`README.md`) includes a link to a Paper page (either on HF or an Arxiv abstract/PDF), the Hugging Face Hub will extract the arXiv ID and include it in the repository's tags. Clicking on the arxiv tag will let you: * Visit the Paper page. * Filter for other models or datasets on the Hub that cite the same paper. ## Claiming authorship to a Paper The Hub will attempt to automatically match paper to users based on their email. If your paper is not linked to your account, you can click in your name in the corresponding Paper page and click "claim authorship". This will automatically re-direct to your paper settings where you can confirm the request. The admin team will validate your request soon. Once confirmed, the Paper page will show as verified. If you don't have any papers on Hugging Face yet, you can index your first one as explained [here](#can-i-have-a-paper-page-even-if-i-have-no-modeldatasetspace). Once available, you can claim authorship. ## Frequently Asked Questions ### Can I control which Paper pages show in my profile? Yes! You can visit your Papers in [settings](https://huggingface.co/settings/papers), where you will see a list of verified papers. There, you can click the "Show on profile" checkbox to hide/show it in your profile. ### Do you support ACL anthology? We're starting with Arxiv as it accounts for 95% of the paper URLs Hugging Face users have linked in their repos organically. We'll check how this evolve and potentially extend to other paper hosts in the future. ### Can I have a Paper page even if I have no model/dataset/Space? Yes. You can go to [the main Papers page](https://huggingface.co/papers), click search and write the name of the paper or the full Arxiv id. If the paper does not exist, you will get an option to index it. You can also just visit the page `hf.co/papers/xxxx.yyyyy` replacing with the arxiv id of the paper you wish to index. ### Using SetFit with Hugging Face https://huggingface.co/docs/hub/setfit.md # Using SetFit with Hugging Face SetFit is an efficient and prompt-free framework for few-shot fine-tuning of [Sentence Transformers](https://sbert.net/). It achieves high accuracy with little labeled data - for instance, with only 8 labeled examples per class on the Customer Reviews sentiment dataset, SetFit is competitive with fine-tuning RoBERTa Large on the full training set of 3k examples 🤯! Compared to other few-shot learning methods, SetFit has several unique features: * 🗣 **No prompts or verbalizers:** Current techniques for few-shot fine-tuning require handcrafted prompts or verbalizers to convert examples into a format suitable for the underlying language model. SetFit dispenses with prompts altogether by generating rich embeddings directly from text examples. * 🏎 **Fast to train:** SetFit doesn't require large-scale models like [T0](https://huggingface.co/bigscience/T0) or GPT-3 to achieve high accuracy. As a result, it is typically an order of magnitude (or more) faster to train and run inference with. * 🌎 **Multilingual support**: SetFit can be used with any [Sentence Transformer](https://huggingface.co/models?library=sentence-transformers&sort=downloads) on the Hub, which means you can classify text in multiple languages by simply fine-tuning a multilingual checkpoint. ## Exploring SetFit on the Hub You can find SetFit models by filtering at the left of the [models page](https://huggingface.co/models?library=setfit). All models on the Hub come with these useful features: 1. An automatically generated model card with a brief description. 2. An interactive widget you can use to play with the model directly in the browser. 3. An Inference API that allows you to make inference requests. ## Installation To get started, you can follow the [SetFit installation guide](https://huggingface.co/docs/setfit/installation). You can also use the following one-line install through pip: ``` pip install -U setfit ``` ## Using existing models All `setfit` models can easily be loaded from the Hub. ```py from setfit import SetFitModel model = SetFitModel.from_pretrained("tomaarsen/setfit-paraphrase-mpnet-base-v2-sst2-8-shot") ``` Once loaded, you can use [`SetFitModel.predict`](https://huggingface.co/docs/setfit/reference/main#setfit.SetFitModel.predict) to perform inference. ```py model.predict("Amelia Earhart flew her single engine Lockheed Vega 5B across the Atlantic to Paris.") ``` ```bash ['positive', 'negative'] ``` If you want to load a specific SetFit model, you can click `Use in SetFit` and you will be given a working snippet! ## Additional resources * [All SetFit models available on the Hub](https://huggingface.co/models?library=setfit) * SetFit [repository](https://github.com/huggingface/setfit) * SetFit [docs](https://huggingface.co/docs/setfit) * SetFit [paper](https://arxiv.org/abs/2209.11055) ### Jupyter Notebooks on the Hugging Face Hub https://huggingface.co/docs/hub/notebooks.md # Jupyter Notebooks on the Hugging Face Hub [Jupyter notebooks](https://jupyter.org/) are a very popular format for sharing code and data analysis for machine learning and data science. They are interactive documents that can contain code, visualizations, and text. ## Open models in Google Colab and Kaggle When you visit a model page on the Hugging Face Hub, you’ll see a new “Google Colab”/ "Kaggle" button in the “Use this model” drop down. Clicking this will generate a ready-to-run notebook with basic code to load and test the model. This is perfect for quick prototyping, inference testing, or fine-tuning experiments — all without leaving your browser. ![Google Colab and Kaggle option for models on the Hub](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/hf-google-colab/gemma3-4b-it-dark.png) Users can also access a ready-to-run notebook by appending /colab to the model card’s URL. As an example, for the latest Gemma 3 4B IT model, the corresponding Colab notebook can be reached by taking the model card URL: https://huggingface.co/google/gemma-3-4b-it And then appending `/colab` to it: https://huggingface.co/google/gemma-3-4b-it/colab and similarly for kaggle: https://huggingface.co/google/gemma-3-4b-it/kaggle If a model repository includes a file called `notebook.ipynb`, we will use it for Colab and Kaggle instead of the auto-generated notebook content. Model authors can provide tailored examples, detailed walkthroughs, or advanced use cases while still benefiting from one-click Colab integration. [NousResearch/Genstruct-7B](https://huggingface.co/NousResearch/Genstruct-7B) is one such example. ![](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/hf-google-colab/genstruct-notebook-dark.png) ## Rendering .ipynb Jupyter notebooks on the Hub Under the hood, Jupyter Notebook files (usually shared with a `.ipynb` extension) are JSON files. While viewing these files directly is possible, it's not a format intended to be read by humans. The Hub has rendering support for notebooks hosted on the Hub. This means that notebooks are displayed in a human-readable format. ![Before and after notebook rendering](https://huggingface.co/blog/assets/135_notebooks-hub/before_after_notebook_rendering.png) Notebooks will be rendered when included in any type of repository on the Hub. This includes models, datasets, and Spaces. ### Launch in Google Colab [Google Colab](https://colab.google/) is a free Jupyter Notebook environment that requires no setup and runs entirely in the cloud. It's a great way to run Jupyter Notebooks without having to install anything on your local machine. All .ipynb files hosted on the Hub are automatically given a "Open in Colab" button. This allows you to open the notebook in Colab with a single click. ### Libraries https://huggingface.co/docs/hub/datasets-libraries.md # Libraries The Datasets Hub has support for several libraries in the Open Source ecosystem. Thanks to the [huggingface_hub Python library](/docs/huggingface_hub), it's easy to enable sharing your datasets on the Hub. We're happy to welcome to the Hub a set of Open Source libraries that are pushing Machine Learning forward. The table below summarizes the supported libraries and their level of integration. | Library | Description | Download from Hub | Push to Hub | | ----------------------------------- | ------------------------------------------------------------------------------------------------------------------------------ | ----------------- | ----------- | | [Argilla](./datasets-argilla) | Collaboration tool for AI engineers and domain experts that value high quality data. | ✅ | ✅ | | [Daft](./datasets-daft) | Data engine for large scale, multimodal data processing with a Python-native interface. | ✅ | ✅ | | [Dask](./datasets-dask) | Parallel and distributed computing library that scales the existing Python and PyData ecosystem. | ✅ | ✅ | | [Datasets](./datasets-usage) | 🤗 Datasets is a library for accessing and sharing datasets for Audio, Computer Vision, and Natural Language Processing (NLP). | ✅ | ✅ | | [Distilabel](./datasets-distilabel) | The framework for synthetic data generation and AI feedback. | ✅ | ✅ | | [DuckDB](./datasets-duckdb) | In-process SQL OLAP database management system. | ✅ | ✅ | | [Embedding Atlas](./datasets-embedding-atlas) | Interactive visualization and exploration tool for large embeddings. | ✅ | ❌ | | [fenic](./datasets-fenic) | PySpark-inspired DataFrame framework for building production AI and agentic applications. | ✅ | ❌ | | [FiftyOne](./datasets-fiftyone) | FiftyOne is a library for curation and visualization of image, video, and 3D data. | ✅ | ✅ | | [Pandas](./datasets-pandas) | Python data analysis toolkit. | ✅ | ✅ | | [Polars](./datasets-polars) | A DataFrame library on top of an OLAP query engine. | ✅ | ✅ | | [PyArrow](./datasets-pyarrow) | Apache Arrow is a columnar format and a toolbox for fast data interchange and in-memory analytics. | ✅ | ✅ | | [Spark](./datasets-spark) | Real-time, large-scale data processing tool in a distributed environment. | ✅ | ✅ | | [WebDataset](./datasets-webdataset) | Library to write I/O pipelines for large datasets. | ✅ | ❌ | ## Integrating data libraries and tools with the Hub This guide is designed for developers and maintainers of data libraries and tools who want to integrate with the Hugging Face Hub. Whether you're building a data processing library, analysis tool, or any software that needs to interact with datasets, this documentation will help you implement a Hub integration. The guide covers: - Possible approaches to loading data from the Hub into your library/tool - Possible approaches to uploading data from your library/tool to the Hub ### Loading data from the Hub If you have a library for working with data, it can be helpful for your users to load data from the Hub. In general, we suggest relying on an existing library like `datasets`, `pandas` or `polars` to do this unless you have a specific reason to implement your own. If you require more control over the loading process, you can use the `huggingface_hub` library, which will allow you, for example, to download a specific subset of files from a repository. You can find more information about loading data from the Hub [here](https://huggingface.co/docs/hub/datasets-downloading). #### Integrating via the Dataset Viewer and Parquet Files The Hub's dataset viewer and Parquet conversion system provide a standardized way to integrate with datasets, regardless of their original format. This infrastructure is a reliable integration layer between the Hub and external libraries. If the dataset is not already in Parquet, the Hub automatically converts the first 5GB of every dataset to Parquet format to power the dataset viewer and provide consistent access patterns. This standardization offers several benefits for library integrations: - Consistent data access patterns regardless of original format - Built-in dataset preview and exploration through the Hub's dataset viewer. The dataset viewer can also be embedded as an iframe in your applications, making it easy to provide rich dataset previews. For more information about embedding the viewer, see the [dataset viewer embedding documentation](https://huggingface.co/docs/hub/en/datasets-viewer-embed). - Efficient columnar storage optimized for querying. For example, you could use a tool like [DuckDB](https://duckdb.org/) to query or filter for a specific subset of data. - Parquet is well supported across the machine learning and data science ecosystem. For more details on working with the Dataset Viewer API, see the [Dataset Viewer API documentation](https://huggingface.co/docs/dataset-viewer/index) ### Uploading data to the Hub This section covers possible approaches for adding the ability to upload data to the Hub in your library, i.e. how to implement a `push_to_hub` method. This guide will cover three primary ways to upload data to the Hub: - using the `datasets` library and the `push_to_hub` method - using `pandas` to write to the Hub - using the `huggingface_hub` library and the `hf_hub_download` method - directly using the API or Git with git-xet #### Use the `datasets` library The most straightforward approach to pushing data to the Hub is to rely on the existing [`push_to_hub`](https://huggingface.co/docs/datasets/v3.2.0/en/package_reference/main_classes#datasets.Dataset.push_to_hub) method from the `datasets` library. The `push_to_hub` method will automatically handle: - the creation of the repository - the conversion of the dataset to Parquet - chunking the dataset into suitable parts - uploading the data For example, if you have a synthetic data generation library that returns a list of dictionaries, you could simply do the following: ```python from datasets import Dataset data = [{"prompt": "Write a cake recipe", "response": "Measure 1 cup ..."}] ds = Dataset.from_list(data) ds.push_to_hub("USERNAME_OR_ORG/repo_ID") ``` Examples of this kind of integration: - [Distilabel](https://github.com/argilla-io/distilabel/blob/8ad48387dfa4d7bd5639065661f1975dcb44c16a/src/distilabel/distiset.py#L77) #### Rely on an existing libraries integration with the Hub Polars, Pandas, Dask, Spark, DuckDB, and Daft can all write to a Hugging Face Hub repository. See [datasets libraries](https://huggingface.co/docs/hub/datasets-libraries) for more details. If you are already using one of these libraries in your code, adding the ability to push to the Hub is straightforward. For example, if you have a synthetic data generation library that can return a Pandas DataFrame, here is the code you would need to write to the Hub: ```python from huggingface_hub import HfApi # Initialize the Hub API hf_api = HfApi(token=os.getenv("HF_TOKEN")) # Create a repository (if it doesn't exist) hf_api.create_repo(repo_id="username/my-dataset", repo_type="dataset") # Convert your data to a DataFrame and save directly to the Hub df.to_parquet("hf://datasets/username/my-dataset/data.parquet") ``` #### Using the huggingface_hub Python library The `huggingface_hub` Python library offers a more flexible approach to uploading data to the Hub. The library allows you to upload specific files or subsets of files to a repository. This is useful if you have a large dataset that you don't want to convert to Parquet, want to upload a specific subset of files, or want more control over the repo structure. Depending on your use case, you can upload a file or folder at a specific point in your code, i.e., export annotations from a tool to the Hub when a user clicks "push to Hub". For example, ```python from huggingface_hub import HfApi api = HfApi(token=HF_TOKEN) api.upload_folder( folder_path="/my-cool-library/data-folder", repo_id="username/my-cool-space", repo_type="dataset", commit_message="Push annotations to Hub" allow_patterns="*.jsonl", ) ``` You can find more information about ways to upload data to the Hub [here](https://huggingface.co/docs/huggingface_hub/main/en/guides/upload). Alternatively, there are situations where you may want to upload data in the background, for example, synthetic data being generated every 10 minutes. In this case you can use the `scheduled_uploads` feature of the `huggingface_hub` library. For more details, see the [scheduled uploads documentation](https://huggingface.co/docs/huggingface_hub/main/en/guides/upload#scheduled-uploads). You can see an example of using this approach to upload data to the Hub in - The [fastdata](https://github.com/AnswerDotAI/fastdata/blob/main/nbs/00_core.ipynb) library - This [magpie](https://huggingface.co/spaces/davanstrien/magpie/blob/fc79672c740b8d3d098378dca37c0f191c208de0/app.py#L67) Demo Space ## More support For technical questions about integration, feel free to contact the datasets team at datasets@huggingface.co. ### Models Frequently Asked Questions https://huggingface.co/docs/hub/models-faq.md # Models Frequently Asked Questions ## How can I see what dataset was used to train the model? It's up to the person who uploaded the model to include the training information! A user can [specify](./model-cards#specifying-a-dataset) the dataset used for training a model. If the datasets used for the model are on the Hub, the uploader may have included them in the [model card's metadata](https://huggingface.co/Jiva/xlm-roberta-large-it-mnli/blob/main/README.md#L7-L9). In that case, the datasets would be linked with a handy card on the right side of the model page: ## How can I see an example of the model in action? Models can have inference widgets that let you try out the model in the browser! Inference widgets are easy to configure, and there are many different options at your disposal. Visit the [Widgets documentation](models-widgets) to learn more. The Hugging Face Hub is also home to Spaces, which are interactive demos used to showcase models. If a model has any Spaces associated with it, you'll find them linked on the model page like so: Spaces are a great way to show off a model you've made or explore new ways to use existing models! Visit the [Spaces documentation](./spaces) to learn how to make your own. ## How do I upload an update / new version of the model? Releasing an update to a model that you've already published can be done by pushing a new commit to your model's repo. To do this, go through the same process that you followed to upload your initial model. Your previous model versions will remain in the repository's commit history, so you can still download previous model versions from a specific git commit or tag or revert to previous versions if needed. ## What if I have a different checkpoint of the model trained on a different dataset? By convention, each model repo should contain a single checkpoint. You should upload any new checkpoints trained on different datasets to the Hub in a new model repo. You can link the models together by using a tag specified in the `tags` key in your [model card's metadata](./model-cards), by using [Collections](./collections) to group distinct related repositories together or by linking to them in the model cards. The [akiyamasho/AnimeBackgroundGAN-Shinkai](https://huggingface.co/akiyamasho/AnimeBackgroundGAN-Shinkai#other-pre-trained-model-versions) model, for example, references other checkpoints in the model card under *"Other pre-trained model versions"*. ## Can I link my model to a paper on arXiv? If the model card includes a link to a paper on arXiv, the Hugging Face Hub will extract the arXiv ID and include it in the model tags with the format `arxiv:`. Clicking on the tag will let you: * Visit the paper page * Filter for other models on the Hub that cite the same paper. Read more about paper pages [here](./paper-pages). ### Spaces Custom Domain https://huggingface.co/docs/hub/spaces-custom-domain.md # Spaces Custom Domain > [!WARNING] > Spaces Custom Domain feature is part of PRO and Team or Enterprise subscriptions. ## Getting started with a Custom Domain Spaces Custom Domain is a feature that allows you to host your space in a custom domain of your choosing: `yourdomain.example.com` 🚀 The custom domain must be a valid DNS name. ## Using a Custom Domain You can submit a custom domain to host your space in the settings of your Space, under "Custom Domain". You'll need to add the CNAME Record Type: The request will move to 'pending' status after submission as seen below. Please make sure to point the domain to `hf.space`. Once set up, you'll see a 'ready' status to know the custom domain is active for your Space 🔥 If you've completed all the steps but aren't seeing a 'ready' status, you can enter your domain [here](https://toolbox.googleapps.com/apps/dig/#CNAME/) to verify it points to `hf.space`. If if doesn't, please check your domain host to ensure the CNAME record was added correctly. ## Removing a Custom Domain Simply remove a custom domain by using the delete button to the right of “Custom Domain” in the settings of your Space. You can delete while the custom domain is pending or in ready state. ### How to Add a Space to ArXiv https://huggingface.co/docs/hub/spaces-add-to-arxiv.md # How to Add a Space to ArXiv Demos on Hugging Face Spaces allow a wide audience to try out state-of-the-art machine learning research without writing any code. [Hugging Face and ArXiv have collaborated](https://huggingface.co/blog/arxiv) to embed these demos directly along side papers on ArXiv! Thanks to this integration, users can now find the most popular demos for a paper on its arXiv abstract page. For example, if you want to try out demos of the LayoutLM document classification model, you can go to [the LayoutLM paper's arXiv page](https://arxiv.org/abs/1912.13318), and navigate to the demo tab. You will see open-source demos built by the machine learning community for this model, which you can try out immediately in your browser: ![](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/layout-lm-space-arxiv.gif) We'll cover two different ways to add your Space to ArXiv and have it show up in the Demos tab. **Prerequisites** * There's an existing paper on ArXiv that you'd like to create a demo for * You have built or (can build) a demo for the model on Spaces **Method 1 (Recommended): Linking from the Space README** The simplest way to add a Space to an ArXiv paper is to include the link to the paper in the Space README file (`README.md`). It's good practice to include a full citation as well. You can see an example of a link and a citation on this [Echocardiogram Segmentation Space README](https://huggingface.co/spaces/abidlabs/echocardiogram-arxiv/blob/main/README.md). And that's it! Your Space should appear in the Demo tab next to the paper on ArXiv in a few minutes 🤗 **Method 2: Linking a Related Model** An alternative approach can be used to link Spaces to papers by linking an intermediate model to the Space. This requires that the paper is **associated with a model** that is on the Hugging Face Hub (or can be uploaded there) 1. First, upload the model associated with the ArXiv paper onto the Hugging Face Hub if it is not already there. ([Detailed instructions are here](./models-uploading)) 2. When writing the model card (README.md) for the model, include a link to the ArXiv paper. It's good practice to include a full citation as well. You can see an example of a link and a citation on the [LayoutLM model card](https://huggingface.co/microsoft/layoutlm-base-uncased) *Note*: you can verify this step has been carried out successfully by seeing if an ArXiv button appears above the model card. In the case of LayoutLM, the button says: "arxiv:1912.13318" and links to the LayoutLM paper on ArXiv. ![](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/arxiv-button.png) 3. Then, create a demo on Spaces that loads this model. Somewhere within the code, the model name must be included in order for Hugging Face to detect that a Space is associated with it. For example, the [docformer_for_document_classification](https://huggingface.co/spaces/iakarshu/docformer_for_document_classification) Space loads the LayoutLM [like this](https://huggingface.co/spaces/iakarshu/docformer_for_document_classification/blob/main/modeling.py#L484) and include the string `"microsoft/layoutlm-base-uncased"`: ```py from transformers import LayoutLMForTokenClassification layoutlm_dummy = LayoutLMForTokenClassification.from_pretrained("microsoft/layoutlm-base-uncased", num_labels=1) ``` *Note*: Here's an [overview on building demos on Hugging Face Spaces](./spaces-overview) and here are more specific instructions for [Gradio](./spaces-sdks-gradio) and [Streamlit](./spaces-sdks-streamlit). 4. As soon as your Space is built, Hugging Face will detect that it is associated with the model. A "Linked Models" button should appear in the top right corner of the Space, as shown here: ![](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/linked-models.png) *Note*: You can also add linked models manually by explicitly updating them in the [README metadata for the Space, as described here](https://huggingface.co/docs/hub/spaces-config-reference). Your Space should appear in the Demo tab next to the paper on ArXiv in a few minutes 🤗 ### File names and splits https://huggingface.co/docs/hub/datasets-file-names-and-splits.md # File names and splits To host and share your dataset, create a dataset repository on the Hugging Face Hub and upload your data files. This guide will show you how to name your files and directories in your dataset repository when you upload it and enable all the Datasets Hub features like the Dataset Viewer. Look at the [companion collection of example datasets](https://huggingface.co/collections/datasets-examples/file-names-and-splits-655e28af4471bd95709eb135) for more details. A dataset with a supported structure and [file formats](./datasets-adding#file-formats) automatically has a dataset viewer on its page on the Hub. Note that if none of the structures below suits your case, you can have more control over how you define splits and subsets with the [Manual Configuration](./datasets-manual-configuration). ## Basic use-case If your dataset isn't split into [train/validation/test splits](https://en.wikipedia.org/wiki/Training,_validation,_and_test_data_sets), the simplest dataset structure is to have one file: `data.csv` (this works with any [supported file format](./datasets-adding#file-formats) and any file name). Your repository will also contain a `README.md` file, the [dataset card](./datasets-cards) displayed on your dataset page. ``` my_dataset_repository/ ├── README.md └── data.csv ``` ## Splits Some patterns in the dataset repository can be used to assign certain files to train/validation/test splits. ### File name You can name your data files after the `train`, `test`, and `validation` splits: ``` my_dataset_repository/ ├── README.md ├── train.csv ├── test.csv └── validation.csv ``` If you don't have any non-traditional splits, then you can place the split name anywhere in the data file. The only rule is that the split name must be delimited by non-word characters, like `test-file.csv` for example instead of `testfile.csv`. Supported delimiters include underscores, dashes, spaces, dots, and numbers. For example, the following file names are all acceptable: - train split: `train.csv`, `my_train_file.csv`, `train1.csv` - validation split: `validation.csv`, `my_validation_file.csv`, `validation1.csv` - test split: `test.csv`, `my_test_file.csv`, `test1.csv` ### Directory name You can place your data files into different directories named `train`, `test`, and `validation` where each directory contains the data files for that split: ``` my_dataset_repository/ ├── README.md └── data/ ├── train/ │ └── data.csv ├── test/ │ └── more_data.csv └── validation/ └── even_more_data.csv ``` ### Keywords There are several ways to refer to train/validation/test splits. Validation splits are sometimes called "dev", and test splits may be referred to as "eval". These other split names are also supported, and the following keywords are equivalent: - train, training - validation, valid, val, dev - test, testing, eval, evaluation Therefore, the structure below is a valid repository: ``` my_dataset_repository/ ├── README.md └── data/ ├── training.csv ├── eval.csv └── valid.csv ``` ### Multiple files per split Splits can span several files, for example: ``` my_dataset_repository/ ├── README.md ├── train_0.csv ├── train_1.csv ├── train_2.csv ├── train_3.csv ├── test_0.csv └── test_1.csv ``` Make sure all the files of your `train` set have *train* in their names (same for test and validation). You can even add a prefix or suffix to `train` in the file name (like `my_train_file_00001.csv` for example). For convenience, you can also place your data files into different directories. In this case, the split name is inferred from the directory name. ``` my_dataset_repository/ ├── README.md └── data/ ├── train/ │ ├── shard_0.csv │ ├── shard_1.csv │ ├── shard_2.csv │ └── shard_3.csv └── test/ ├── shard_0.csv └── shard_1.csv ``` ### Custom split name If your dataset splits have custom names that aren't `train`, `test`, or `validation`, then you can name your data files like `data/-xxxxx-of-xxxxx.csv`. Here is an example with three splits, `train`, `test`, and `random`: ``` my_dataset_repository/ ├── README.md └── data/ ├── train-00000-of-00003.csv ├── train-00001-of-00003.csv ├── train-00002-of-00003.csv ├── test-00000-of-00001.csv ├── random-00000-of-00003.csv ├── random-00001-of-00003.csv └── random-00002-of-00003.csv ``` ### More ways to create Spaces https://huggingface.co/docs/hub/spaces-more-ways-to-create.md # More ways to create Spaces ## Duplicating a Space You can duplicate a Space by clicking the three dots at the top right and selecting **Duplicate this Space**. Learn more about it [here](./spaces-overview#duplicating-a-space). ## Creating a Space from a model New! You can now create a Gradio demo directly from most model pages, using the "Deploy -> Spaces" button. As another example of how to create a Space from a set of models, the [Model Comparator Space Builder](https://huggingface.co/spaces/farukozderim/Model-Comparator-Space-Builder) from [@farukozderim](https://huggingface.co/farukozderim) can be used to create a Space directly from any model hosted on the Hub. ### PyArrow https://huggingface.co/docs/hub/datasets-pyarrow.md # PyArrow [Arrow](https://github.com/apache/arrow) is a columnar format and a toolbox for fast data interchange and in-memory analytics. Since PyArrow supports [fsspec](https://filesystem-spec.readthedocs.io) to read and write remote data, you can use the Hugging Face paths ([`hf://`](/docs/huggingface_hub/guides/hf_file_system#integrations)) to read and write data on the Hub. It is especially useful for [Parquet](https://parquet.apache.org/) data, since Parquet is the most common file format on Hugging Face. Indeed, Parquet is particularly efficient thanks to its structure, typing, metadata and compression. ## Load a Table You can load data from local files or from remote storage like Hugging Face Datasets. PyArrow supports many formats including CSV, JSON and more importantly Parquet: ```python >>> import pyarrow.parquet as pq >>> table = pq.read_table("path/to/data.parquet") ``` To load a file from Hugging Face, the path needs to start with `hf://`. For example, the path to the [stanfordnlp/imdb](https://huggingface.co/datasets/stanfordnlp/imdb) dataset repository is `hf://datasets/stanfordnlp/imdb`. The dataset on Hugging Face contains multiple Parquet files. The Parquet file format is designed to make reading and writing data frames efficient, and to make sharing data across data analysis languages easy. Here is how to load the file `plain_text/train-00000-of-00001.parquet` as a pyarrow Table (it requires `pyarrow>=21.0`): ```python >>> import pyarrow.parquet as pq >>> table = pq.read_table("hf://datasets/stanfordnlp/imdb/plain_text/train-00000-of-00001.parquet") >>> table pyarrow.Table text: string label: int64 ---- text: [["I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it (... 1542 chars omitted)", ...],...,[..., "The story centers around Barry McKenzie who must go to England if he wishes to claim his inheritan (... 221 chars omitted)"]] label: [[0,0,0,0,0,...,0,0,0,0,0],...,[1,1,1,1,1,...,1,1,1,1,1]] ``` If you don't want to load the full Parquet data, you can get the Parquet metadata or load row group by row group instead: ```python >>> import pyarrow.parquet as pq >>> pf = pq.ParquetFile("hf://datasets/stanfordnlp/imdb/plain_text/train-00000-of-00001.parquet") >>> pf.metadata created_by: parquet-cpp-arrow version 12.0.0 num_columns: 2 num_rows: 25000 num_row_groups: 25 format_version: 2.6 serialized_size: 62036 >>> for i in pf.num_row_groups: ... table = pf.read_row_group(i) ... ... ``` For more information on the Hugging Face paths and how they are implemented, please refer to the [the client library's documentation on the HfFileSystem](/docs/huggingface_hub/guides/hf_file_system). ## Save a Table You can save a pyarrow Table using `pyarrow.parquet.write_table` to a local file or to Hugging Face directly. To save the Table on Hugging Face, you first need to [Login with your Hugging Face account](/docs/huggingface_hub/quick-start#login), for example using: ``` hf auth login ``` Then you can [create a dataset repository](/docs/huggingface_hub/quick-start#create-a-repository), for example using: ```python from huggingface_hub import HfApi HfApi().create_repo(repo_id="username/my_dataset", repo_type="dataset") ``` Finally, you can use [Hugging Face paths](/docs/huggingface_hub/guides/hf_file_system#integrations) in PyArrow: ```python import pyarrow.parquet as pq pq.write_table(table, "hf://datasets/username/my_dataset/imdb.parquet", use_content_defined_chunking=True) # or write in separate files if the dataset has train/validation/test splits pq.write_table(table_train, "hf://datasets/username/my_dataset/train.parquet", use_content_defined_chunking=True) pq.write_table(table_valid, "hf://datasets/username/my_dataset/validation.parquet", use_content_defined_chunking=True) pq.write_table(table_test , "hf://datasets/username/my_dataset/test.parquet", use_content_defined_chunking=True) ``` We use `use_content_defined_chunking=True` to enable faster uploads and downloads from Hugging Face thanks to Xet deduplication (it requires `pyarrow>=21.0`). > [!TIP] > Content defined chunking (CDC) makes the Parquet writer chunk the data pages in a way that makes duplicate data chunked and compressed identically. > Without CDC, the pages are arbitrarily chunked and therefore duplicate data are impossible to detect because of compression. > Thanks to CDC, Parquet uploads and downloads from Hugging Face are faster, since duplicate data are uploaded or downloaded only once. Find more information about Xet [here](https://huggingface.co/join/xet). ## Use Images You can load a folder with a metadata file containing a field for the names or paths to the images, structured like this: ``` Example 1: Example 2: folder/ folder/ ├── metadata.parquet ├── metadata.parquet ├── img000.png └── images ├── img001.png ├── img000.png ... ... └── imgNNN.png └── imgNNN.png ``` You can iterate on the images paths like this: ```python from pathlib import Path import pyarrow as pq folder_path = Path("path/to/folder") table = pq.read_table(folder_path + "metadata.parquet") for file_name in table["file_name"].to_pylist(): image_path = folder_path / file_name ... ``` Since the dataset is in a [supported structure](https://huggingface.co/docs/hub/en/datasets-image#additional-columns) (a `metadata.parquet` file with a `file_name` field), you can save this dataset to Hugging Face and the Dataset Viewer shows both the metadata and images. ```python from huggingface_hub import HfApi api = HfApi() api.upload_folder( folder_path=folder_path, repo_id="username/my_image_dataset", repo_type="dataset", ) ``` ### Embed Images inside Parquet PyArrow has a binary type which allows to have the images bytes in Arrow tables. Therefore it enables saving the dataset as one single Parquet file containing both the images (bytes and path) and the samples metadata: ```python import pyarrow as pa import pyarrow.parquet as pq # Embed the image bytes in Arrow image_array = pa.array([ { "bytes": (folder_path / file_name).read_bytes(), "path": file_name, } for file_name in table["file_name"].to_pylist() ]) table.append_column("image", image_array) # (Optional) Set the HF Image type for the Dataset Viewer and the `datasets` library features = {"image": {"_type": "Image"}} # or using datasets.Features(...).to_dict() schema_metadata = {"huggingface": {"dataset_info": {"features": features}}} table = table.replace_schema_metadata(schema_metadata) # Save to Parquet # (Optional) with use_content_defined_chunking for faster uploads and downloads pq.write_table(table, "data.parquet", use_content_defined_chunking=True) ``` Setting the Image type in the Arrow schema metadata allows other libraries and the Hugging Face Dataset Viewer to know that "image" contains images and not just binary data. ## Use Audios You can load a folder with a metadata file containing a field for the names or paths to the audios, structured like this: ``` Example 1: Example 2: folder/ folder/ ├── metadata.parquet ├── metadata.parquet ├── rec000.wav └── audios ├── rec001.wav ├── rec000.wav ... ... └── recNNN.wav └── recNNN.wav ``` You can iterate on the audios paths like this: ```python from pathlib import Path import pyarrow as pq folder_path = Path("path/to/folder") table = pq.read_table(folder_path + "metadata.parquet") for file_name in table["file_name"].to_pylist(): audio_path = folder_path / file_name ... ``` Since the dataset is in a [supported structure](https://huggingface.co/docs/hub/en/datasets-audio#additional-columns) (a `metadata.parquet` file with a `file_name` field), you can save it to Hugging Face, and the Hub Dataset Viewer shows both the metadata and audio. ```python from huggingface_hub import HfApi api = HfApi() api.upload_folder( folder_path=folder_path, repo_id="username/my_audio_dataset", repo_type="dataset", ) ``` ### Embed Audio inside Parquet PyArrow has a binary type which allows for having audio bytes in Arrow tables. Therefore, it enables saving the dataset as one single Parquet file containing both the audio (bytes and path) and the samples metadata: ```python import pyarrow as pa import pyarrow.parquet as pq # Embed the audio bytes in Arrow audio_array = pa.array([ { "bytes": (folder_path / file_name).read_bytes(), "path": file_name, } for file_name in table["file_name"].to_pylist() ]) table.append_column("audio", audio_array) # (Optional) Set the HF Audio type for the Dataset Viewer and the `datasets` library features = {"audio": {"_type": "Audio"}} # or using datasets.Features(...).to_dict() schema_metadata = {"huggingface": {"dataset_info": {"features": features}}} table = table.replace_schema_metadata(schema_metadata) # Save to Parquet # (Optional) with use_content_defined_chunking for faster uploads and downloads pq.write_table(table, "data.parquet", use_content_defined_chunking=True) ``` Setting the Audio type in the Arrow schema metadata enables other libraries and the Hugging Face Dataset Viewer to recognise that "audio" contains audio data, not just binary data. ### Using `Transformers.js` at Hugging Face https://huggingface.co/docs/hub/transformers-js.md # Using `Transformers.js` at Hugging Face Transformers.js is a JavaScript library for running 🤗 Transformers directly in your browser, with no need for a server! It is designed to be functionally equivalent to the original [Python library](https://github.com/huggingface/transformers), meaning you can run the same pretrained models using a very similar API. ## Exploring `transformers.js` in the Hub You can find `transformers.js` models by filtering by library in the [models page](https://huggingface.co/models?library=transformers.js). ## Quick tour It's super simple to translate from existing code! Just like the Python library, we support the `pipeline` API. Pipelines group together a pretrained model with preprocessing of inputs and postprocessing of outputs, making it the easiest way to run models with the library. Python (original) Javascript (ours) ```python from transformers import pipeline # Allocate a pipeline for sentiment-analysis pipe = pipeline('sentiment-analysis') out = pipe('I love transformers!') # [{'label': 'POSITIVE', 'score': 0.999806941}] ``` ```javascript import { pipeline } from '@huggingface/transformers'; // Allocate a pipeline for sentiment-analysis let pipe = await pipeline('sentiment-analysis'); let out = await pipe('I love transformers!'); // [{'label': 'POSITIVE', 'score': 0.999817686}] ``` You can also use a different model by specifying the model id or path as the second argument to the `pipeline` function. For example: ```javascript // Use a different model for sentiment-analysis let pipe = await pipeline('sentiment-analysis', 'nlptown/bert-base-multilingual-uncased-sentiment'); ``` Refer to the [documentation](https://huggingface.co/docs/transformers.js) for the full list of supported tasks and models. ## Installation To install via [NPM](https://www.npmjs.com/package/@huggingface/transformers), run: ```bash npm i @huggingface/transformers ``` For more information, including how to use it in vanilla JS (without any bundler) via a CDN or static hosting, refer to the [README](https://github.com/huggingface/transformers.js/blob/main/README.md#installation). ## Additional resources * Transformers.js [repository](https://github.com/huggingface/transformers.js) * Transformers.js [docs](https://huggingface.co/docs/transformers.js) * Transformers.js [demo](https://huggingface.github.io/transformers.js/) ### Configure the Dataset Viewer https://huggingface.co/docs/hub/datasets-viewer-configure.md # Configure the Dataset Viewer The Dataset Viewer supports many [data files formats](./datasets-adding#file-formats), from text to tabular and from image to audio formats. It also separates the train/validation/test splits based on file and folder names. To configure the Dataset Viewer for your dataset, first make sure your dataset is in a [supported data format](./datasets-adding#file-formats). ## Configure dropdowns for splits or subsets In the Dataset Viewer you can view the [train/validation/test](https://en.wikipedia.org/wiki/Training,_validation,_and_test_data_sets) splits of datasets, and sometimes additionally choose between multiple subsets (e.g. one per language). To define those dropdowns, you can name the data files or their folder after their split names (train/validation/test). It is also possible to customize your splits manually using YAML. For more information, feel free to check out the documentation on [Data files Configuration](./datasets-data-files-configuration) and the [collections of example datasets](https://huggingface.co/datasets-examples). The [Image Dataset doc page](./datasets-image) proposes various methods to structure a dataset with images. ## Disable the viewer The dataset viewer can be disabled. To do this, add a YAML section to the dataset's `README.md` file (create one if it does not already exist) and add a `viewer` property with the value `false`. ```yaml --- viewer: false --- ``` ## Private datasets For **private** datasets, the Dataset Viewer is enabled for [PRO users](https://huggingface.co/pricing) and [Team or Enterprise organizations](https://huggingface.co/enterprise). ### Third-party scanner: Protect AI https://huggingface.co/docs/hub/security-protectai.md # Third-party scanner: Protect AI > [!TIP] > Interested in joining our security partnership / providing scanning information on the Hub? Please get in touch with us over at security@huggingface.co.* [Protect AI](https://protectai.com/)'s [Guardian](https://protectai.com/guardian) catches pickle, Keras, and other exploits as detailed on their [Knowledge Base page](https://protectai.com/insights/knowledge-base/). Guardian also benefits from reports sent in by their community of bounty [Huntr](https://huntr.com/)s. ![Protect AI report for the danger.dat file contained in mcpotato/42-eicar-street](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/protect-ai-report.png) *Example of a report for [danger.dat](https://huggingface.co/mcpotato/42-eicar-street/blob/main/danger.dat)* We partnered with Protect AI to provide scanning in order to make the Hub safer. The same way files are scanned by our internal scanning system, public repositories' files are scanned by Guardian. Our frontend has been redesigned specifically for this purpose, in order to accommodate for new scanners: Here is an example repository you can check out to see the feature in action: [mcpotato/42-eicar-street](https://huggingface.co/mcpotato/42-eicar-street). ## Model security refresher To share models, we serialize the data structures we use to interact with the models, in order to facilitate storage and transport. Some serialization formats are vulnerable to nasty exploits, such as arbitrary code execution (looking at you pickle), making sharing models potentially dangerous. As Hugging Face has become a popular platform for model sharing, we’d like to protect the community from this, hence why we have developed tools like [picklescan](https://github.com/mmaitre314/picklescan) and why we integrate third party scanners. Pickle is not the only exploitable format out there, [see for reference](https://github.com/Azure/counterfit/wiki/Abusing-ML-model-file-formats-to-create-malware-on-AI-systems:-A-proof-of-concept) how one can exploit Keras Lambda layers to achieve arbitrary code execution. ### User Provisioning (SCIM) https://huggingface.co/docs/hub/enterprise-hub-scim.md # User Provisioning (SCIM) > [!WARNING] > This feature is part of the Enterprise Plus plan. SCIM, or System for Cross-domain Identity Management, is a standard for automating user provisioning. It allows you to connect your Identity Provider (IdP) to Hugging Face to automatically manage your organization's members. With SCIM, you can: - **Provision users**: Automatically create user accounts in your Hugging Face organization when they are assigned the application in your IdP. - **Update user attributes**: Changes made to user profiles in your IdP (like name or email) are automatically synced to Hugging Face. - **Provision groups**: Create groups in your Hugging Face organization based on groups in your IdP. - **Deprovision users**: Automatically deactivate user accounts in your Hugging Face organization when they are unassigned from the application or deactivated in your IdP. This ensures that your Hugging Face organization's member list is always in sync with your IdP, streamlining user lifecycle management and improving security. ## How to enable SCIM To enable SCIM, go to your organization's settings, navigate to the **SSO** tab, and then select the **SCIM** sub-tab. You will find the **SCIM Tenant URL** and a button to generate an **access token**. You will need both of these to configure your IdP. The SCIM token is a secret and should be stored securely in your IdP's configuration. Once SCIM is enabled in your IdP, users and groups provisioned will appear in the "Users Management" and "SCIM" tabs respectively. ## Supported Identity Providers We support SCIM with any IdP that implements the SCIM 2.0 protocol. We have specific guides for some of the most popular providers: - [How to configure SCIM with Microsoft Entra ID](./security-sso-entra-id-scim) - [How to configure SCIM with Okta](./security-sso-okta-scim) ### Webhook guide: Setup an automatic system to re-train a model when a dataset changes https://huggingface.co/docs/hub/webhooks-guide-auto-retrain.md # Webhook guide: Setup an automatic system to re-train a model when a dataset changes > [!TIP] > Webhooks are now publicly available! This guide will help walk you through the setup of an automatic training pipeline on the Hugging Face platform using HF Datasets, Webhooks, Spaces, and AutoTrain. We will build a Webhook that listens to changes on an image classification dataset and triggers a fine-tuning of [microsoft/resnet-50](https://huggingface.co/microsoft/resnet-50) using [AutoTrain](https://huggingface.co/autotrain). ## Prerequisite: Upload your dataset to the Hub We will use a [simple image classification dataset](https://huggingface.co/datasets/huggingface-projects/auto-retrain-input-dataset) for the sake of the example. Learn more about uploading your data to the Hub [here](https://huggingface.co/docs/datasets/upload_dataset). ![dataset](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/webhooks-guides/002-auto-retrain/dataset.png) ## Create a Webhook to react to the dataset's changes First, let's create a Webhook from your [settings]( https://huggingface.co/settings/webhooks). - Select your dataset as the target repository. We will target [huggingface-projects/input-dataset](https://huggingface.co/datasets/huggingface-projects/input-dataset) in this example. - You can put a dummy Webhook URL for now. Defining your Webhook will let you look at the events that will be sent to it. You can also replay them, which will be useful for debugging! - Input a secret to make it more secure. - Subscribe to "Repo update" events as we want to react to data changes Your Webhook will look like this: ![webhook-creation](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/webhooks-guides/002-auto-retrain/webhook-creation.png) ## Create a Space to react to your Webhook We now need a way to react to your Webhook events. An easy way to do this is to use a [Space](https://huggingface.co/docs/hub/spaces-overview)! You can find an example Space [here](https://huggingface.co/spaces/huggingface-projects/auto-retrain/tree/main). This Space uses Docker, Python, [FastAPI](https://fastapi.tiangolo.com/), and [uvicorn](https://www.uvicorn.org) to run a simple HTTP server. Read more about Docker Spaces [here](https://huggingface.co/docs/hub/spaces-sdks-docker). The entry point is [src/main.py](https://huggingface.co/spaces/huggingface-projects/auto-retrain/blob/main/src/main.py). Let's walk through this file and detail what it does: 1. It spawns a FastAPI app that will listen to HTTP `POST` requests on `/webhook`: ```python from fastapi import FastAPI # [...] @app.post("/webhook") async def post_webhook( # ... ): # ... ``` 2. 2. This route checks that the `X-Webhook-Secret` header is present and that its value is the same as the one you set in your Webhook's settings. The `WEBHOOK_SECRET` secret must be set in the Space's settings and be the same as the secret set in your Webhook. ```python # [...] WEBHOOK_SECRET = os.getenv("WEBHOOK_SECRET") # [...] @app.post("/webhook") async def post_webhook( # [...] x_webhook_secret: Optional[str] = Header(default=None), # ^ checks for the X-Webhook-Secret HTTP header ): if x_webhook_secret is None: raise HTTPException(401) if x_webhook_secret != WEBHOOK_SECRET: raise HTTPException(403) # [...] ``` 3. The event's payload is encoded as JSON. Here, we'll be using pydantic models to parse the event payload. We also specify that we will run our Webhook only when: - the event concerns the input dataset - the event is an update on the repo's content, i.e., there has been a new commit ```python # defined in src/models.py class WebhookPayloadEvent(BaseModel): action: Literal["create", "update", "delete"] scope: str class WebhookPayloadRepo(BaseModel): type: Literal["dataset", "model", "space"] name: str id: str private: bool headSha: str class WebhookPayload(BaseModel): event: WebhookPayloadEvent repo: WebhookPayloadRepo # [...] @app.post("/webhook") async def post_webhook( # [...] payload: WebhookPayload, # ^ Pydantic model defining the payload format ): # [...] if not ( payload.event.action == "update" and payload.event.scope.startswith("repo.content") and payload.repo.name == config.input_dataset and payload.repo.type == "dataset" ): # no-op if the payload does not match our expectations return {"processed": False} #[...] ``` 4. If the payload is valid, the next step is to create a project on AutoTrain, schedule a fine-tuning of the input model (`microsoft/resnet-50` in our example) on the input dataset, and create a discussion on the dataset when it's done! ```python def schedule_retrain(payload: WebhookPayload): # Create the autotrain project try: project = AutoTrain.create_project(payload) AutoTrain.add_data(project_id=project["id"]) AutoTrain.start_processing(project_id=project["id"]) except requests.HTTPError as err: print("ERROR while requesting AutoTrain API:") print(f" code: {err.response.status_code}") print(f" {err.response.json()}") raise # Notify in the community tab notify_success(project["id"]) ``` Visit the link inside the comment to review the training cost estimate, and start fine-tuning the model! ![community tab notification](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/webhooks-guides/002-auto-retrain/notification.png) In this example, we used Hugging Face AutoTrain to fine-tune our model quickly, but you can of course plug in your training infrastructure! Feel free to duplicate the Space to your personal namespace and play with it. You will need to provide two secrets: - `WEBHOOK_SECRET` : the secret from your Webhook. - `HF_ACCESS_TOKEN` : a User Access Token with `write` rights. You can create one [from your settings](https://huggingface.co/settings/tokens). You will also need to tweak the [`config.json` file](https://huggingface.co/spaces/huggingface-projects/auto-retrain/blob/main/config.json) to use the dataset and model of you choice: ```json { "target_namespace": "the namespace where the trained model should end up", "input_dataset": "the dataset on which the model will be trained", "input_model": "the base model to re-train", "autotrain_project_prefix": "A prefix for the AutoTrain project" } ``` ## Configure your Webhook to send events to your Space Last but not least, you'll need to configure your webhook to send POST requests to your Space. Let's first grab our Space's "direct URL" from the contextual menu. Click on "Embed this Space" and copy the "Direct URL". ![embed this Space](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/webhooks-guides/002-auto-retrain/duplicate-space.png) ![direct URL](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/webhooks-guides/002-auto-retrain/direct-url.png) Update your Webhook to send requests to that URL: ![webhook settings](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/webhooks-guides/002-auto-retrain/update-webhook.png) And that's it! Now every commit to the input dataset will trigger a fine-tuning of ResNet-50 with AutoTrain 🎉 ### Git over SSH https://huggingface.co/docs/hub/security-git-ssh.md # Git over SSH You can access and write data in repositories on huggingface.co using SSH (Secure Shell Protocol). When you connect via SSH, you authenticate using a private key file on your local machine. Some actions, such as pushing changes, or cloning private repositories, will require you to upload your SSH public key to your account on huggingface.co. You can use a pre-existing SSH key, or generate a new one specifically for huggingface.co. ## Checking for existing SSH keys If you have an existing SSH key, you can use that key to authenticate Git operations over SSH. SSH keys are usually located under `~/.ssh` on Mac & Linux, and under `C:\\Users\\\\.ssh` on Windows. List files under that directory and look for files of the form: - id_rsa.pub - id_ecdsa.pub - id_ed25519.pub Those files contain your SSH public key. If you don't have such file under `~/.ssh`, you will have to [generate a new key](#generating-a-new-ssh-keypair). Otherwise, you can [add your existing SSH public key(s) to your huggingface.co account](#add-a-ssh-key-to-your-account). ## Generating a new SSH keypair If you don't have any SSH keys on your machine, you can use `ssh-keygen` to generate a new SSH key pair (public + private keys): ``` $ ssh-keygen -t ed25519 -C "your.email@example.co" ``` We recommend entering a passphrase when you are prompted to. A passphrase is an extra layer of security: it is a password that will be prompted whenever you use your SSH key. Once your new key is generated, add it to your SSH agent with `ssh-add`: ``` $ ssh-add ~/.ssh/id_ed25519 ``` If you chose a different location than the default to store your SSH key, you would have to replace `~/.ssh/id_ed25519` with the file location you used. ## Add a SSH key to your account To access private repositories with SSH, or to push changes via SSH, you will need to add your SSH public key to your huggingface.co account. You can manage your SSH keys [in your user settings](https://huggingface.co/settings/keys). To add a SSH key to your account, click on the "Add SSH key" button. Then, enter a name for this key (for example, "Personal computer"), and copy and paste the content of your **public** SSH key in the area below. The public key is located in the `~/.ssh/id_XXXX.pub` file you found or generated in the previous steps. Click on "Add key", and voilà! You have added a SSH key to your huggingface.co account. ## Testing your SSH authentication Once you have added your SSH key to your huggingface.co account, you can test that the connection works as expected. In a terminal, run: ``` $ ssh -T git@hf.co ``` If you see a message with your username, congrats! Everything went well, you are ready to use git over SSH. Otherwise, if the message states something like the following, make sure your SSH key is actually used by your SSH agent. ``` Hi anonymous, welcome to Hugging Face. ``` ## HuggingFace's SSH key fingerprints Public key fingerprints can be used to validate a connection to a remote server. These are HuggingFace's public key fingerprints: > SHA256:aBG5R7IomF4BSsx/h6tNAUVLhEkkaNGB8Sluyh/Q/qY (ECDSA) > SHA256:skgQjK2+RuzvdmHr24IIAJ6uLWQs0TGtEUt3FtzqirQ (DSA - deprecated) > SHA256:dVjzGIdV7d6cwKIeZiCoRMa2gMvSKfGZAvHf4gMiMao (ED25519) > SHA256:uqjYymysBGCXXiMVebB8L8RIuWbPSKGBxQQNhcT5a3Q (RSA) You can add the following ssh key entries to your ~/.ssh/known_hosts file to avoid manually verifying HuggingFace hosts: ``` hf.co ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABgQDtPB+snz63eZvTrbMY2Qt39a6HYile89JOum55z3lhIqAqUHxLtXFd+q+ED8izQvyORFPSmFIaPw05rtXo37bm+ixL6wDmvWrHN74oUUWmtrv2MNCLHE5VDb3+Q6MJjjDVIoK5QZIuTStlq0cUbGGxQk7vFZZ2VXdTPqgPjw4hMV7MGp3RFY/+Wy8rIMRv+kRCIwSAOeuaLPT7FzL0zUMDwj/VRjlzC08+srTQHqfoh0RguZiXZQneZKmM75AFhoMbP5x4AW2bVoZam864DSGiEwL8R2jMiyXxL3OuicZteZqll0qfRlNopKnzoxS29eBbXTr++ILqYz1QFqaruUgqSi3MIC9sDYEqh2Q8UxP5+Hh97AnlgWDZC0IhojVmEPNAc7Y2d+ctQl4Bt91Ik4hVf9bU+tqMXgaTrTMXeTURSXRxJEm2zfKQVkqn3vS/zGVnkDS+2b2qlVtrgbGdU/we8Fux5uOAn/dq5GygW/DUlHFw412GtKYDFdWjt3nJCY8= hf.co ssh-dss AAAAB3NzaC1kc3MAAACBAORXmoE8fn/UTweWy7tCYXZxigmODg71CIvs/haZQN6GYqg0scv8OFgeIQvBmIYMnKNJ7eoo5ZK+fk1yPv8aa9+8jfKXNJmMnObQVyObxFVzB51x8yvtHSSrL4J3z9EAGX9l9b+Fr2+VmVFZ7a90j2kYC+8WzQ9HaCYOlrALzz2VAAAAFQC0RGD5dE5Du2vKoyGsTaG/mO2E5QAAAIAHXRCMYdZij+BYGC9cYn5Oa6ZGW9rmGk98p1Xc4oW+O9E/kvu4pCimS9zZordLAwHHWwOUH6BBtPfdxZamYsBgO8KsXOWugqyXeFcFkEm3c1HK/ysllZ5kM36wI9CUWLedc2vj5JC+xb5CUzhVlGp+Xjn59rGSFiYzIGQC6pVkHgAAAIBve2DugKh3x8qq56sdOH4pVlEDe997ovEg3TUxPPIDMSCROSxSR85fa0aMpxqTndFMNPM81U/+ye4qQC/mr0dpFLBzGuum4u2dEpjQ7B2UyJL9qhs1Ubby5hJ8Z3bmHfOK9/hV8nhyN8gf5uGdrJw6yL0IXCOPr/VDWSUbFrsdeQ== hf.co ecdsa-sha2-nistp256 AAAAE2VjZHNhLXNoYTItbmlzdHAyNTYAAAAIbmlzdHAyNTYAAABBBL0wtM52yIjm8gRecBy2wRyEMqr8ulG0uewT/IQOGz5K0ZPTIy6GIGHsTi8UXBiEzEIznV3asIz2sS7SiQ311tU= hf.co ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAINJjhgtT9FOQrsVSarIoPVI1jFMh3VSHdKfdqp/O776s ``` ### Using AllenNLP at Hugging Face https://huggingface.co/docs/hub/allennlp.md # Using AllenNLP at Hugging Face `allennlp` is a NLP library for developing state-of-the-art models on different linguistic tasks. It provides high-level abstractions and APIs for common components and models in modern NLP. It also provides an extensible framework that makes it easy to run and manage NLP experiments. ## Exploring allennlp in the Hub You can find `allennlp` models on the Hub by filtering at the left of the [models page](https://huggingface.co/models?library=allennlp). All models on the Hub come up with useful features 1. A training metrics tab with automatically hosted TensorBoard traces. 2. Metadata tags that help for discoverability. 3. An interactive widget you can use to play out with the model directly in the browser. 4. An Inference API that allows to make inference requests. ## Using existing models You can use the `Predictor` class to load existing models on the Hub. To achieve this, use the `from_path` method and use the `"hf://"` prefix with the repository id. Here is an end-to-end example. ```py import allennlp_models from allennlp.predictors.predictor import Predictor predictor = Predictor.from_path("hf://allenai/bidaf-elmo") predictor_input = { "passage": "My name is Wolfgang and I live in Berlin", "question": "Where do I live?" } predictions = predictor.predict_json(predictor_input) ``` To get a snippet such as this, you can click `Use in AllenNLP` at the top right, ## Sharing your models The first step is to save the model locally. For example, you can use the [`archive_model`](https://docs.allennlp.org/main/api/models/archival/#archive_model) method to save the model as a `model.tar.gz` file. You can then push the zipped model to the Hub. When you train a model with `allennlp`, the model is automatically serialized so you can use that as a preferred option. ### Using the AllenNLP CLI To push with the CLI, you can use the `allennlp push_to_hf` command as seen below. ```bash allennlp push_to_hf --repo_name test_allennlp --archive_path model ``` | Argument | Type | Description | |----------------------------- |-------------- |------------------------------------------------------------------------------------------------------------------------------- | | `--repo_name`, `-n` | str / `Path` | Name of the repository on the Hub. | | `--organization`, `-o` | str | Optional name of organization to which the pipeline should be uploaded. | | `--serialization-dir`, `-s` | str / `Path` | Path to directory with the serialized model. | | `--archive-path`, `-a` | str / `Path` | If instead of a serialization path you're using a zipped model (e.g. model/model.tar.gz), you can use this flag. | | `--local-repo-path`, `-l` | str / `Path` | Local path to the model repository (will be created if it doesn't exist). Defaults to `hub` in the current working directory. | | `--commit-message`, `-c` | str | Commit message to use for update. Defaults to `"update repository"`. | ### From a Python script The `push_to_hf` function has the same parameters as the bash script. ```py from allennlp.common.push_to_hf import push_to_hf serialization_dir = "path/to/serialization/directory" push_to_hf( repo_name="my_repo_name", serialization_dir=serialization_dir, local_repo_path=self.local_repo_path ) ``` In just a minute, you can get your model in the Hub, try it out directly in the browser, and share it with the rest of the community. All the required metadata will be uploaded for you! ## Additional resources * AllenNLP [website](https://allenai.org/allennlp). * AllenNLP [repository](https://github.com/allenai/allennlp). ### Using RL-Baselines3-Zoo at Hugging Face https://huggingface.co/docs/hub/rl-baselines3-zoo.md # Using RL-Baselines3-Zoo at Hugging Face `rl-baselines3-zoo` is a training framework for Reinforcement Learning using Stable Baselines3. ## Exploring RL-Baselines3-Zoo in the Hub You can find RL-Baselines3-Zoo models by filtering at the left of the [models page](https://huggingface.co/models?library=stable-baselines3). The Stable-Baselines3 team is hosting a collection of +150 trained Reinforcement Learning agents with tuned hyperparameters that you can find [here](https://huggingface.co/sb3). All models on the Hub come up with useful features: 1. An automatically generated model card with a description, a training configuration, and more. 2. Metadata tags that help for discoverability. 3. Evaluation results to compare with other models. 4. A video widget where you can watch your agent performing. ## Using existing models You can simply download a model from the Hub using `load_from_hub`: ``` # Download ppo SpaceInvadersNoFrameskip-v4 model and save it into the logs/ folder python -m rl_zoo3.load_from_hub --algo dqn --env SpaceInvadersNoFrameskip-v4 -f logs/ -orga sb3 python enjoy.py --algo dqn --env SpaceInvadersNoFrameskip-v4 -f logs/ ``` You can define three parameters: - `--repo-name`: The name of the repo. - `-orga`: A Hugging Face username or organization. - `-f`: The destination folder. ## Sharing your models You can easily upload your models with `push_to_hub`. That will save the model, evaluate it, generate a model card and record a replay video of your agent before pushing the complete repo to the Hub. ``` python -m rl_zoo3.push_to_hub --algo dqn --env SpaceInvadersNoFrameskip-v4 --repo-name dqn-SpaceInvadersNoFrameskip-v4 -orga ThomasSimonini -f logs/ ``` You can define three parameters: - `--repo-name`: The name of the repo. - `-orga`: Your Hugging Face username. - `-f`: The folder where the model is saved. ## Additional resources * RL-Baselines3-Zoo [official trained models](https://huggingface.co/sb3) * RL-Baselines3-Zoo [documentation](https://github.com/DLR-RM/rl-baselines3-zoo) ### ChatUI on Spaces https://huggingface.co/docs/hub/spaces-sdks-docker-chatui.md # ChatUI on Spaces **Hugging Chat** is an open-source interface enabling everyone to try open-source large language models such as Falcon, StarCoder, and BLOOM. Thanks to an official Docker template called ChatUI, you can deploy your own Hugging Chat based on a model of your choice with a few clicks using Hugging Face's infrastructure. ## Deploy your own Chat UI To get started, simply head [here](https://huggingface.co/new-space?template=huggingchat/chat-ui-template). In the backend of this application, [text-generation-inference](https://github.com/huggingface/text-generation-inference) is used for better optimized model inference. Since these models can't run on CPUs, you can select the GPU depending on your choice of model. You should provide a MongoDB endpoint where your chats will be written. If you leave this section blank, your logs will be persisted to a database inside the Space. Note that Hugging Face does not have access to your chats. You can configure the name and the theme of the Space by providing the application name and application color parameters. Below this, you can select the Hugging Face Hub ID of the model you wish to serve. You can also change the generation hyperparameters in the dictionary below in JSON format. _Note_: If you'd like to deploy a model with gated access or a model in a private repository, you can simply provide `HF_TOKEN` in repository secrets. You need to set its value to an access token you can get from [here](https://huggingface.co/settings/tokens). Once the creation is complete, you will see `Building` on your Space. Once built, you can try your own HuggingChat! Start chatting! ## Read more - [HF Docker Spaces](https://huggingface.co/docs/hub/spaces-sdks-docker) - [chat-ui GitHub Repository](https://github.com/huggingface/chat-ui) - [text-generation-inference GitHub repository](https://github.com/huggingface/text-generation-inference) ### Datasets https://huggingface.co/docs/hub/enterprise-hub-datasets.md # Datasets > [!WARNING] > This feature is part of the Team & Enterprise plans. Data Studio is enabled on private datasets under your Enterprise Hub organization. Data Studio allows teams to understand their data and to help them build better data processing and filtering for AI. This powerful viewer allows you to explore dataset content, inspect data distributions, filter by values, search for keywords, or even run SQL queries on your data without leaving your browser. More information about [Data Studio](./datasets-viewer). ### Xet History & Overview https://huggingface.co/docs/hub/xet/overview.md # Xet History & Overview [In August 2024 Hugging Face acquired XetHub](https://huggingface.co/blog/xethub-joins-hf), a [seed-stage startup based in Seattle](https://www.geekwire.com/2023/ex-apple-engineers-raise-7-5m-for-new-seattle-data-storage-startup/), to replace Git LFS on the Hub. Like Git LFS, a Xet-backed repository utilizes S3 as the remote storage with a `.gitattributes` file at the repository root helping identify what files should be stored remotely. A Git LFS pointer file provides metadata to locate the actual file contents in remote storage: - **SHA256**: Provides a unique identifier for the actual large file. This identifier is generated by computing the SHA-256 hash of the file’s contents. - **Pointer size**: The size of the pointer file stored in the Git repository. - **Size of the remote file**: Indicates the size of the actual large file in bytes. This metadata is useful for both verification purposes and for managing storage and transfer operations. A Xet pointer includes all of this information by design. Refer to the section on [backwards compatibility with Git LFS](legacy-git-lfs#backward-compatibility-with-lfs) with the addition of a `Xet backed hash` field for referencing the file in Xet storage. Unlike Git LFS, which deduplicates at the file level, Xet-enabled repositories deduplicate at the level of bytes. When a file backed by Xet storage is updated, only the modified data is uploaded to remote storage, significantly saving on network transfers. For many workflows, like incremental updates to model checkpoints or appending/inserting new data into a dataset, this improves iteration speed for yourself and your collaborators. To learn more about deduplication in Xet storage, refer to [Deduplication](deduplication). ### Backward Compatibility with LFS https://huggingface.co/docs/hub/xet/legacy-git-lfs.md # Backward Compatibility with LFS Uploads from legacy / non‑Xet‑aware clients still follow the standard Git LFS path, even if the repo is already Xet-backed. Once the file is uploaded to LFS, a background process automatically migrates the file to using Xet storage. The Xet architecture provides backwards compatibility for legacy clients downloading files from Xet-backed repos by offering a Git LFS bridge. While a Xet-aware client will receive file reconstruction information from CAS to download the Xet-backed file, a legacy client will get a single URL from the bridge which does the work of reconstructing the request file and returning the URL to the resource. This allows downloading files through a URL so that you can continue to use the Hub's web interface or `curl`. By having LFS file uploads automatically migrate and having older clients continue to download files from Xet-backed repositories, maintainers and the rest of the Hub can update their pipelines at their own pace. Xet storage provides a seamless transition for existing Hub repositories. It isn't necessary to know if the Xet backend is involved at all. Xet-backed repositories continue to use the Git LFS pointer file format; the addition of the `Xet backed hash` is only added to the web interface as a convenience. Practically, this means existing repos and newly created repos will not look any different if you do a `bare clone` of them. Each of the large files (or binary files) will continue to have a pointer file that matches the Git LFS pointer file specification. This symmetry allows non-Xet-aware clients (e.g., older versions of the `huggingface_hub`) to interact with Xet-backed repositories without concern. In fact, within a repository a mixture of Git LFS and Xet backed files are supported. The Xet backend indicates whether a file is in Git LFS or Xet storage, allowing downstream services to request the proper URL(s) from S3, regardless of which storage system holds the content. ## Legacy Storage: Git LFS The legacy storage system on the Hub, Git LFS utilizes many of the same conventions as Xet-backed repositories. The Hub's Git LFS backend is [Amazon Simple Storage Service (S3)](https://aws.amazon.com/s3/). When Git LFS is invoked, it stores the file contents in S3 using the SHA256 hash to name the file for future access. This storage architecture is relatively simple and has allowed Hub to store millions of models, datasets, and spaces repositories' files. The primary limitation of Git LFS is its file-centric approach to deduplication. Any change to a file, irrespective of how large of small that change is, means the entire file is versioned - incurring significant overheads in file transfers as the entire file is uploaded (if committing to a repository) or downloaded (if pulling the latest version to your machine). This leads to a worse developer experience along with a proliferation of additional storage. ### Xet: our Storage Backend https://huggingface.co/docs/hub/xet/index.md # Xet: our Storage Backend Repositories on the Hugging Face Hub are different from those on software development platforms. They contain files that are: - Large - model or dataset files are in the range of GB and above. We have a few TB-scale files! - Binary - not in a human readable format by default (e.g., [Safetensors](https://huggingface.co/docs/safetensors/en/index) or [Parquet](https://huggingface.co/docs/dataset-viewer/en/parquet#what-is-parquet)) While the Hub leverages modern version control with the support of Git, these differences make [Model](https://huggingface.co/docs/hub/models) and [Dataset](https://huggingface.co/docs/hub/datasets) repositories quite different from those that contain only source code. Storing these files directly in a pure Git repository is impractical. Not only are the typical storage systems behind Git repositories unsuited for such files, but when you clone a repository, Git retrieves the entire history, including all file revisions. This can be prohibitively large for massive binaries, forcing you to download gigabytes of historic data you may never need. Instead, on the Hub, these large files are tracked using "pointer files" and identified through a `.gitattributes` file (both discussed in more detail below), which remain in the Git repository while the actual data is stored in remote storage (like [Amazon S3](https://aws.amazon.com/s3/)). As a result, the repository stays small and typical Git workflows remain efficient. Historically, Hub repositories have relied on [Git LFS](https://git-lfs.com/) for this mechanism. While Git LFS remains supported (see [Backwards Compatibility & Legacy](./legacy-git-lfs)), the Hub has adopted Xet, a modern custom storage system built specifically for AI/ML development. It enables chunk-level deduplication, smaller uploads, and faster downloads than Git LFS. ## Open Source Xet Protocol If you are looking to understand the underlying Xet protocol or are looking to build a new client library to access Xet Storage, check out the [Xet Protocol Specification](https://huggingface.co/docs/xet/index). In these pages you will get started in using Xet Storage. ## Contents - [Xet History & Overview](./overview) - [Using Xet Storage](./using-xet-storage) - [Security](./security) - [Backwards Compatibility & Legacy](./legacy-git-lfs) - [Deduplication](./deduplication) ### Deduplication https://huggingface.co/docs/hub/xet/deduplication.md # Deduplication Xet-enabled repositories utilize [content-defined chunking (CDC)](https://huggingface.co/blog/from-files-to-chunks) to deduplicate on the level of bytes (~64KB of data, also referred to as a "chunk"). Each chunk is identified by a rolling hash that determines chunk boundaries based on the actual file contents, making it resilient to insertions or deletions anywhere in the file. When a file is uploaded to a Xet-backed repository using a Xet-aware client, its contents are broken down into these variable-sized chunks. Only new chunks not already present in Xet storage are kept after chunking, everything else is discarded. To avoid the overhead of communicating and managing at the level of chunks, new chunks are grouped together in [64MB blocks](https://huggingface.co/blog/from-chunks-to-blocks#scaling-deduplication-with-aggregation) and uploaded. Each block is stored once in a content-addressed store (CAS), keyed by its hash. The Hub's [current recommendation](https://huggingface.co/docs/hub/storage-limits#recommendations) is to limit files to 20GB. At a 64KB chunk size, a 20GB file has 312,500 chunks, many of which go unchanged from version to version. Git LFS is designed to notice only that a file has changed and store the entirety of that revision. By deduplicating at the level of chunks, the Xet backend enables storing only the modified content in a file (which might only be a few KB or MB) and securely deduplicates shared blocks across repositories. For the large binary files found in Model and Dataset repositories, this provides significant improvements to file transfer times. For more details, refer to the [From Files to Chunks](https://huggingface.co/blog/from-files-to-chunks) and [From Chunks to Blocks](https://huggingface.co/blog/from-chunks-to-blocks) blog posts, or the [Git is for Data](https://www.cidrdb.org/cidr2023/papers/p43-low.pdf) paper by Low et al. that served as the launch point for XetHub prior to being acquired by Hugging Face. ### Security Model https://huggingface.co/docs/hub/xet/security.md # Security Model Xet storage provides data deduplication over all chunks stored in Hugging Face. This is done via cryptographic hashing in a privacy sensitive way. The contents of chunks are protected and are associated with repository permissions, i.e. you can only read chunks which are required to reproduce files you have access to, and no more. More information and details on how deduplication is done in a privacy-preserving way are described in the [Xet Protocol Specification](https://huggingface.co/docs/xet/deduplication). ### Using Xet Storage https://huggingface.co/docs/hub/xet/using-xet-storage.md # Using Xet Storage ## Python To access a Xet-aware version of the `huggingface_hub`, simply install the latest version: ```bash pip install -U huggingface_hub ``` As of `huggingface_hub` 0.32.0, this will also install `hf_xet`. The `hf_xet` package integrates `huggingface_hub` with [`xet-core`](https://github.com/huggingface/xet-core), the Rust client for the Xet backend. If you use the `transformers` or `datasets` libraries, it's already using `huggingface_hub`. So long as the version of `huggingface_hub` >= 0.32.0, no further action needs to be taken. Where versions of `huggingface_hub` >= 0.30.0 and Git users can access the benefits of Xet by downloading and installing the Git Xet extension. Once installed, simply use the [standard workflows for managing Hub repositories with Git](../repositories-getting-started) - no additional changes necessary. ### Prerequisites Install [Git](https://git-scm.com/) and [Git LFS](https://git-lfs.com/). ### Install on macOS or Linux (amd64 or aarch64) Install using an installation script with the following command in your terminal (requires `curl` and `unzip`): ``` curl --proto '=https' --tlsv1.2 -sSf https://raw.githubusercontent.com/huggingface/xet-core/refs/heads/main/git_xet/install.sh | sh ``` Or, install using [Homebrew](https://brew.sh/), with the following [tap](https://docs.brew.sh/Taps) (direct `brew install` coming soon): ``` brew tap huggingface/tap brew install git-xet git xet install ``` To verify the installation, run: ``` git-xet --version ``` ### Windows (amd64) Using an installer: - Download `git-xet-windows-installer-x86_64.zip` ([available here](https://github.com/huggingface/xet-core/releases/download/git-xet-v0.1.0/git-xet-windows-installer-x86_64.zip)) and unzip. - Run the `msi` installer file and follow the prompts. Manual installation: - Download `git-xet-windows-x86_64.zip` ([available here](https://github.com/huggingface/xet-core/releases/download/git-xet-v0.1.0/git-xet-windows-x86_64.zip)) and unzip. - Place the extracted `git-xet.exe` under a `PATH` directory. - Run `git xet install` in a terminal. To verify the installation, run: ``` git-xet --version ``` ### Using Git Xet Once installed on your platform, using Git Xet is as simple as following the Hub's standard Git workflows. Make sure all [prerequisites are installed and configured](https://huggingface.co/docs/hub/repositories-getting-started#requirements), follow the [setup instructions for working with repositories on the Hub](https://huggingface.co/docs/hub/repositories-getting-started#set-up), then commit your changes, and `push` to the Hub: ``` # Create any files you like! Then... git add . git commit -m "Uploading new models" # You can choose any descriptive message git push ``` Under the hood, the [Xet protocol](https://huggingface.co/docs/xet/index) is invoked to upload large files directly to Xet storage, increasing upload speeds through the power of [chunk-level deduplication](./deduplication). ### Uninstall on macOS or Linux Using Homebrew: ```bash git-xet uninstall brew uninstall git-xet ``` If you used the installation script (for MacOS or Linux), run the following in your terminal: ```bash git-xet uninstall sudo rm $(which git-xet) ``` ### Uninstall on Windows If you used the installer: - Navigate to Settings -> Apps -> Installed apps - Find "Git-Xet". - Select the "Uninstall" option available in the context menu. If you manually installed: - Run `git-xet uninstall` in a terminal. - Delete the `git-xet.exe` file from the location where it was originally placed. ## Recommendations Xet integrates seamlessly with all of the Hub's workflows. However, there are a few steps you may consider to get the most benefits from Xet storage. When uploading or downloading with Python: - **Make sure `hf_xet` is installed**: While Xet remains backward compatible with legacy clients optimized for Git LFS, the `hf_xet` integration with `huggingface_hub` delivers optimal chunk-based performance and faster iteration on large files. - **Utilize `hf_xet` environment variables**: The default installation of `hf_xet` is designed to support the broadest range of hardware. To take advantage of setups with more network bandwidth or processing power read up on `hf_xet`'s [environment variables](https://huggingface.co/docs/huggingface_hub/package_reference/environment_variables#xet) to optimize downloads and uploads. When uploading or downloading in Git or Python: - **Leverage frequent, incremental commits**: Xet's chunk-level deduplication means you can safely make incremental updates to models or datasets. Only changed chunks are uploaded, so frequent commits are both fast and storage-efficient. - **Be Specific in .gitattributes**: When defining patterns for Xet or LFS, use precise file extensions (e.g., `*.safetensors`, `*.bin`) to avoid unnecessarily routing smaller files through large-file storage. - **Prioritize community access**: Xet substantially increases the efficiency and scale of large file transfers. Instead of structuring your repository to reduce its total size (or the size of individual files), organize it for collaborators and community users so they may easily navigate and retrieve the content they need. ## Current Limitations While Xet brings fine-grained deduplication and enhanced performance to Git-based storage, some features and platform compatibilities are still in development. As a result, keep the following constraints in mind when working with a Xet-enabled repository: - **64-bit systems only**: Both `hf_xet` and Git Xet currently require a 64-bit architecture; 32-bit systems are not supported.