Deploying Models with Inference Endpoints
Deploy any Hugging Face model as a production API in minutes.
Quick Setup
1. Go to [huggingface.co/inference-endpoints](https://ui.endpoints.huggingface.co/)
2. Select your model from the Hub
3. Choose instance type (GPU/CPU)
4. Select region (US, EU, Asia)
5. Deploy — API ready in 2-5 minutes
Using the API
```python
import requests
API_URL = "https://your-endpoint.endpoints.huggingface.cloud"
headers = {"Authorization": "Bearer hf_..."}
response = requests.post(API_URL, headers=headers, json={
"inputs": "What is machine learning?",
"parameters": {"max_new_tokens": 200}
})
print(response.json())
```
Pricing
| Instance | GPU | Cost/hr |
|----------|-----|--------|
| CPU (small) | None | $0.06 |
| GPU (small) | T4 | $0.60 |
| GPU (medium) | A10G | $1.30 |
| GPU (large) | A100 | $6.50 |
Features
- Auto-scaling (0 to N replicas)
- Private networking (VPC)
- Custom Docker containers
- Monitoring and logging
- Blue/green deployments