Yes, you can store both text and metadata in a CSV file, and it’s a common practice for structured storage before embedding. Here’s how you can structure your CSV:
| id | text | source | author | date |
|---|
| 1 | ”This is the first document text.” | doc1.pdf | John Doe | 2024-01-29 |
| 2 | ”Another document with different info.” | doc2.txt | Jane Doe | 2024-01-30 |
✅ Each row represents a separate document or chunk of text.
✅ Columns include text and metadata such as source, author, and date.
✅ You can add more metadata like category, tags, or embeddings later.
✅ Works well with pandas in Python (pd.read_csv() makes it easy to process).
- Read CSV in Python:
df = pd.read_csv("documents.csv")
- Generate and Store Embeddings:
- Use an embedding model (e.g., OpenAI, Hugging Face, etc.) to generate vector embeddings for the
text column.
- Store the embeddings separately in a vector database (like Pinecone, FAISS, Weaviate) while keeping metadata in the database.
While you can store embeddings in a CSV, it’s not ideal because embeddings are high-dimensional vectors. If necessary:
- Store them as a JSON string in a column:
"[0.12, 0.85, -0.33, 0.44, ...]"
- Use
json.loads() to convert them back to a list in Python.
🔹 Best approach: Store text and metadata in a CSV and keep embeddings in a vector database.
🔹 CSV alone works for basic use cases but is not efficient for large-scale vector searches.
Would you like help with a Python script to generate embeddings and store them efficiently? 🚀