If you’re using langchain to load the deepseek-ai/DeepSeek-R1-Distill-Llama-70B model and getting a RuntimeError: API error: 403, it is likely due to the model being too large for automatic loading. Here’s how to fix it:
Since DeepSeek-R1-Distill-Llama-70B is 141GB, it may not be feasible to run it locally. Instead, try using a smaller model such as:
deepseek-ai/deepseek-llama-7b
deepseek-ai/deepseek-llama-13b
You can change your langchain code like this:
from langchain.llms import HuggingFacePipeline
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
model_name = "deepseek-ai/deepseek-llama-7b" # Use a smaller model
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Load the model using a pipeline with model parallelism
device_map="auto", # Distributes model across multiple GPUs
# Load the model into LangChain
llm = HuggingFacePipeline(pipeline=pipe)
print(llm("What is LangChain?"))
To reduce memory usage, load the model in 4-bit or 8-bit mode using bitsandbytes:
from langchain.llms import HuggingFacePipeline
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline, BitsAndBytesConfig
model_name = "deepseek-ai/deepseek-llama-7b" # Smaller variant
# Use quantization to reduce memory usage
bnb_config = BitsAndBytesConfig(
load_in_4bit=True, # Use 4-bit quantization
bnb_4bit_compute_dtype=torch.float16
# Load the model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
quantization_config=bnb_config
tokenizer = AutoTokenizer.from_pretrained(model_name)
llm = HuggingFacePipeline(pipeline=pipe)
print(llm("Tell me a joke"))
Instead of running the model locally, you can use an API-based approach with langchain.
Use the Hugging Face Inference API:
from langchain.llms import HuggingFaceHub
repo_id="deepseek-ai/DeepSeek-R1-Distill-Llama-70B",
huggingfacehub_api_token="YOUR_HF_TOKEN"
print(llm("Explain quantum computing"))
Make sure you get a token from Hugging Face.
DeepSeek also provides an API to use their models without needing to host them yourself. Check their official documentation for details.
If you have enough GPU memory, vLLM is an efficient way to run large models.
from langchain.llms import VLLM
model="deepseek-ai/DeepSeek-R1-Distill-Llama-70B",
print(llm("What is the meaning of life?"))
This method speeds up inference using optimized tensor-parallel execution.
- If you must use
DeepSeek-R1-Distill-Llama-70B, ensure you have enough GPU memory (at least 140GB VRAM with model sharding).
- Using a smaller model like
deepseek-llama-7b or deepseek-llama-13b is more practical.
- API-based solutions are the easiest and do not require high-end GPUs.
- Quantization (4-bit/8-bit) can significantly reduce memory usage.
Would you like me to check if DeepSeek provides a cloud-based API for inference? 🚀