Deploying Models with Hugging Face Inference Endpoints | Hugging Face

Deploying Models with Inference Endpoints

Deploy any Hugging Face model as a production API in minutes.

1. Go to [huggingface.co/inference-endpoints](https://ui.endpoints.huggingface.co/)

2. Select your model from the Hub

3. Choose instance type (GPU/CPU)

4. Select region (US, EU, Asia)

5. Deploy — API ready in 2-5 minutes

```python

import requests

API_URL = "https://your-endpoint.endpoints.huggingface.cloud"

headers = {"Authorization": "Bearer hf_..."}

response = requests.post(API_URL, headers=headers, json={

"inputs": "What is machine learning?",

"parameters": {"max_new_tokens": 200}

})

print(response.json())

```

| Instance | GPU | Cost/hr |

|----------|-----|--------|

| CPU (small) | None | $0.06 |

| GPU (small) | T4 | $0.60 |

| GPU (medium) | A10G | $1.30 |

| GPU (large) | A100 | $6.50 |

- Auto-scaling (0 to N replicas)

- Private networking (VPC)

- Custom Docker containers

- Monitoring and logging

- Blue/green deployments