Notion DB Parsing API

2025-01-29 12:15 UTC gpt-4o-mini Open in ChatGPT ↗

To achieve the task of parsing your Notion database and extracting the content and metadata from both the individual items and their linked source materials, we will need to follow a step-by-step approach:

Steps to Implement:

Access Notion API: First, you need to ensure that you have access to the Notion API to fetch the data. You’ll need to create an integration in your Notion workspace and get the appropriate token.
Fetch Data from Notion: Using the Notion API, we’ll fetch the database records.
Extract Metadata and Text Blocks: For each item, retrieve the text blocks and their metadata (such as properties, links, authors, etc.).
Handle Linked Source Material: Each item has a link to another database (e.g., “IONIQ_RAG_SOURCES”), and we need to fetch metadata from this source.
Flatten the Data: Convert all the data into a flat structure, as shown in your example (one item in the list per object).
Store in JSON Format: Once the data is organized, write it into a JSON file.

Python Code Example

import requests
import json

# Define Notion API token and the database IDs
NOTION_API_TOKEN = 'your_notion_api_token_here'
DATABASE_ID_ITEMS = 'your_database_id_for_items_here'
DATABASE_ID_SOURCES = 'your_database_id_for_sources_here'

# Set up the Notion API headers
headers = {
    "Authorization": f"Bearer {NOTION_API_TOKEN}",
    "Content-Type": "application/json",
    "Notion-Version": "2021-05-13",  # You can update to the latest version if needed
}

# Helper function to make Notion API requests
def notion_request(url, params=None):
    response = requests.post(url, headers=headers, json=params)
    response.raise_for_status()  # Raise an exception for any failed request
    return response.json()

# Fetch items from the Items database
def fetch_items():
    url = f"https://api.notion.com/v1/databases/{DATABASE_ID_ITEMS}/query"
    data = notion_request(url)
    items = []

    for result in data.get('results', []):
        item = {
            "id": result['id'],
            "name": result['properties'].get('Name', {}).get('title', [{}])[0].get('text', {}).get('content'),
            "author": result['properties'].get('Author', {}).get('rich_text', [{}])[0].get('text', {}).get('content'),
            "source": result['properties'].get('IONIQ_RAG_SOURCES', {}).get('relation', [{}])[0].get('id'),
            "year": result['properties'].get('Year', {}).get('number'),
            "type": result['properties'].get('Type', {}).get('select', {}).get('name')
        }
        items.append(item)
    return items

# Fetch the metadata for the source material
def fetch_source_metadata(source_id):
    url = f"https://api.notion.com/v1/pages/{source_id}"
    data = notion_request(url)
    source_metadata = {
        "title": data['properties'].get('Full_Title', {}).get('title', [{}])[0].get('text', {}).get('content'),
        "author": data['properties'].get('Author', {}).get('rich_text', [{}])[0].get('text', {}).get('content'),
        "type": data['properties'].get('Type', {}).get('select', {}).get('name'),
        "year": data['properties'].get('Year', {}).get('number'),
        "url": data['properties'].get('URL', {}).get('url'),
    }
    return source_metadata

# Combine data and flatten it
def combine_data():
    items = fetch_items()
    all_data = []

    for item in items:
        source_metadata = fetch_source_metadata(item['source'])
        item_data = {
            "item_id": item['id'],
            "item_name": item['name'],
            "item_author": item['author'],
            "item_year": item['year'],
            "item_type": item['type'],
            "source": source_metadata
        }
        all_data.append(item_data)

    return all_data

# Write the flattened data into a JSON file
def save_to_json():
    all_data = combine_data()
    with open('notion_data.json', 'w') as outfile:
        json.dump(all_data, outfile, indent=4)

# Run the function to save the data
save_to_json()

What this code does:

Notion API Access: Uses your Notion API token and database IDs to fetch data.
Fetch Items: Retrieves items from the first database (your items list).
Fetch Source Metadata: For each item, fetches the corresponding source material metadata.
Flatten the Data: Combines the data in a flat structure (with item details and its source metadata).
Save JSON: Writes the resulting data into a notion_data.json file.

Prerequisites:

Replace the placeholders for NOTION_API_TOKEN, DATABASE_ID_ITEMS, and DATABASE_ID_SOURCES with your actual Notion API token and database IDs.
Install the required Python libraries by running:
Terminal window
```
pip install requests
```

Next Steps:

After running this script, you’ll have the flattened data in a JSON file, which can then be further processed or analyzed.

Sure! In the case where the author might be None, we should handle it gracefully. We can check if the author field exists and if not, assign a default value (such as None or an empty string). Here’s how you can modify the code to handle the case where the author might be missing:

Updated Python Code:

import requests
import json

# Define Notion API token and the database IDs
NOTION_API_TOKEN = 'your_notion_api_token_here'
DATABASE_ID_ITEMS = 'your_database_id_for_items_here'
DATABASE_ID_SOURCES = 'your_database_id_for_sources_here'

# Set up the Notion API headers
headers = {
    "Authorization": f"Bearer {NOTION_API_TOKEN}",
    "Content-Type": "application/json",
    "Notion-Version": "2021-05-13",  # You can update to the latest version if needed
}

# Helper function to make Notion API requests
def notion_request(url, params=None):
    response = requests.post(url, headers=headers, json=params)
    response.raise_for_status()  # Raise an exception for any failed request
    return response.json()

# Fetch items from the Items database
def fetch_items():
    url = f"https://api.notion.com/v1/databases/{DATABASE_ID_ITEMS}/query"
    data = notion_request(url)
    items = []

    for result in data.get('results', []):
        item = {
            "id": result['id'],
            "name": result['properties'].get('Name', {}).get('title', [{}])[0].get('text', {}).get('content'),
            "author": result['properties'].get('Author', {}).get('rich_text', [{}])[0].get('text', {}).get('content', None),  # Handle None for author
            "source": result['properties'].get('IONIQ_RAG_SOURCES', {}).get('relation', [{}])[0].get('id'),
            "year": result['properties'].get('Year', {}).get('number'),
            "type": result['properties'].get('Type', {}).get('select', {}).get('name')
        }
        items.append(item)
    return items

# Fetch the metadata for the source material
def fetch_source_metadata(source_id):
    url = f"https://api.notion.com/v1/pages/{source_id}"
    data = notion_request(url)
    source_metadata = {
        "title": data['properties'].get('Full_Title', {}).get('title', [{}])[0].get('text', {}).get('content'),
        "author": data['properties'].get('Author', {}).get('rich_text', [{}])[0].get('text', {}).get('content'),
        "type": data['properties'].get('Type', {}).get('select', {}).get('name'),
        "year": data['properties'].get('Year', {}).get('number'),
        "url": data['properties'].get('URL', {}).get('url'),
    }
    return source_metadata

# Combine data and flatten it
def combine_data():
    items = fetch_items()
    all_data = []

    for item in items:
        source_metadata = fetch_source_metadata(item['source'])
        item_data = {
            "item_id": item['id'],
            "item_name": item['name'],
            "item_author": item['author'] if item['author'] else "Unknown",  # Handle None or missing author
            "item_year": item['year'],
            "item_type": item['type'],
            "source": source_metadata
        }
        all_data.append(item_data)

    return all_data

# Write the flattened data into a JSON file
def save_to_json():
    all_data = combine_data()
    with open('notion_data.json', 'w') as outfile:
        json.dump(all_data, outfile, indent=4)

# Run the function to save the data
save_to_json()

Changes Made:

Handle Missing Author:
- In the fetch_items() function, when extracting the author, I added a fallback value of None if the author property doesn’t exist.
- When combining the data (combine_data()), I check if the author is None. If it is, I assign a default value like "Unknown" or you can leave it as None based on your preference.

This approach ensures that if the author field is missing for any item, the script won’t fail, and the output will include a default value like "Unknown" or None.

Let me know if you’d like further adjustments!

The error 400 Client Error: Bad Request indicates that there is an issue with how the API request is being made. Specifically, the /v1/pages/{page_id} endpoint is used for retrieving a single page by its ID. However, if you’re working with a database and need to query for records (or retrieve related records from another database), you’ll need to use the appropriate API endpoints.

The issue seems to be that you’re trying to retrieve data from the source database as if it’s a page, but since the source is another database, you might need to query it differently.

Correcting the Request

Instead of calling the /v1/pages/{page_id} endpoint, you should call the /v1/databases/{database_id}/query endpoint to retrieve all items or specific pages from the IONIQ_RAG_SOURCES database.

Here’s how you can modify the code to correctly fetch data from the source database:

Updated Code to Query the Source Database:

# Fetch the metadata for the source material by querying the database
def fetch_source_metadata(source_id):
    # Query the source database
    url = f"https://api.notion.com/v1/databases/{DATABASE_ID_SOURCES}/query"

    # Construct the query filter to retrieve the page using the source_id
    query_filter = {
        "filter": {
            "property": "ID",  # Assuming the field name for the page ID is "ID" (change this if necessary)
            "rich_text": {
                "equals": source_id  # Match the source ID in the source database
            }
        }
    }

    # Make the request to fetch the source metadata
    data = notion_request(url, query_filter)

    # Check if any result was found
    if not data.get("results"):
        raise ValueError(f"No source found for ID {source_id}")

    source_data = data["results"][0]  # Assuming the ID is unique, we take the first result

    # Extract and return source metadata
    source_metadata = {
        "title": source_data['properties'].get('Full_Title', {}).get('title', [{}])[0].get('text', {}).get('content'),
        "author": source_data['properties'].get('Author', {}).get('rich_text', [{}])[0].get('text', {}).get('content'),
        "type": source_data['properties'].get('Type', {}).get('select', {}).get('name'),
        "year": source_data['properties'].get('Year', {}).get('number'),
        "url": source_data['properties'].get('URL', {}).get('url'),
    }
    return source_metadata

Key Changes:

Querying the Source Database: Instead of calling the /v1/pages/{page_id} endpoint, we use /v1/databases/{database_id}/query to query the IONIQ_RAG_SOURCES database.
Using a Filter: The filter is used to narrow down the search to the specific source based on its ID. Make sure to adjust the property name ("ID") and value (source_id) to match the actual fields in your source database.

Why This Works:

The /v1/databases/{database_id}/query endpoint allows you to query a database and retrieve records. You can filter these records based on a specific property (like ID in this case).
Since the source material is stored in another database, you query that database and match the source_id using the filter.

Troubleshooting:

If you continue to encounter errors, check the exact structure of your source database to ensure you’re filtering on the correct property name (e.g., ID, Full_Title, etc.). You can explore your source database and adjust the code accordingly.

Testing:

Once you update the fetch_source_metadata function, you can test it again to retrieve the source metadata from the database, and it should work as expected.

Ah, I see! If you’re looking up the pages by their actual database ID (which is the ID of a record in the source database, not a custom property), then the request needs to be made differently.

When you are dealing with linked database records (via relations), you can use the page ID directly without needing to query by properties. The issue here is how to properly fetch the page’s metadata.

For the pages in the source database, you don’t need to query them via the database query endpoint if you already have the IDs of the individual pages that are referenced in your IONIQ_RAG_ITEMS database. You can directly use the /v1/pages/{page_id} endpoint to retrieve the metadata of a specific page, using the actual page ID (the ID of the record in the source database).

Here’s how you should modify your code to fetch metadata from the source database by page ID:

# Fetch the metadata for the source material by directly using the source page ID
def fetch_source_metadata(source_id):
    url = f"https://api.notion.com/v1/pages/{source_id}"

    # Make the request to fetch the source metadata
    data = notion_request(url)

    # Extract and return the source metadata
    source_metadata = {
        "title": data['properties'].get('Full_Title', {}).get('title', [{}])[0].get('text', {}).get('content'),
        "author": data['properties'].get('Author', {}).get('rich_text', [{}])[0].get('text', {}).get('content'),
        "type": data['properties'].get('Type', {}).get('select', {}).get('name'),
        "year": data['properties'].get('Year', {}).get('number'),
        "url": data['properties'].get('URL', {}).get('url'),
    }
    return source_metadata

Explanation:

/v1/pages/{page_id}: This endpoint is used to retrieve metadata of a specific page in a database using the actual page ID.
The source_id you’re passing here should be the actual ID of the page from the IONIQ_RAG_SOURCES database (which you’ve already gotten from the linked relation in IONIQ_RAG_ITEMS).

This way, you don’t need to query the entire database; you just fetch each page by its unique ID.

Example Flow:

You query IONIQ_RAG_ITEMS to get each item’s associated source_id (the actual ID of the page in the IONIQ_RAG_SOURCES database).
For each item, you use this source_id directly in the /v1/pages/{page_id} endpoint to fetch the corresponding metadata from the source.

Testing:

With this approach, you should be able to fetch the source metadata correctly by passing the page ID directly into the request. This should work unless there are other issues with your page IDs or Notion API access.

Let me know if this resolves your issue!

I see! If you’re getting an error even with the /v1/pages/{page_id} endpoint, there may be a few potential issues causing it. Let’s go through some troubleshooting steps to resolve the issue:

Common Issues and Fixes:

Permissions Issue:
- Ensure that your Notion API token has access to both the IONIQ_RAG_ITEMS and IONIQ_RAG_SOURCES databases. If the token doesn’t have the correct permissions, you might not be able to access the pages.
Wrong Page ID Format:
- Check that the page IDs are in the correct format. Notion API page IDs are typically 32-character strings in hexadecimal, but it’s important to ensure that you’re passing the correct IDs when making the request.
API Version Mismatch:
- The API might change with different versions. Make sure you’re using a valid version. You can update the Notion-Version header to the latest supported version.
Missing or Incorrect Notion API Token:
- Ensure that your API token is valid and that it’s being passed correctly in the headers.

Let’s go step-by-step with some additional error handling and checks.

1. Check for API Token and Permissions:

Make sure that the Notion API integration has access to both databases.

2. Enhanced Debugging for Fetching a Single Page:

I’ll update the function to include better error handling and debug information. This will give more context when an error occurs:

Updated `fetch_source_metadata` Function:

# Fetch the metadata for the source material by directly using the source page ID
def fetch_source_metadata(source_id):
    url = f"https://api.notion.com/v1/pages/{source_id}"

    try:
        # Make the request to fetch the source metadata
        data = notion_request(url)

        # Check if the response contains the expected data
        if not data:
            raise ValueError(f"Empty response from Notion for page ID {source_id}")

        # Extract and return the source metadata
        source_metadata = {
            "title": data['properties'].get('Full_Title', {}).get('title', [{}])[0].get('text', {}).get('content', "No Title"),
            "author": data['properties'].get('Author', {}).get('rich_text', [{}])[0].get('text', {}).get('content', "Unknown Author"),
            "type": data['properties'].get('Type', {}).get('select', {}).get('name', "Unknown Type"),
            "year": data['properties'].get('Year', {}).get('number', "Unknown Year"),
            "url": data['properties'].get('URL', {}).get('url', "No URL")
        }

        return source_metadata

    except requests.exceptions.RequestException as e:
        print(f"Error fetching source data for page ID {source_id}: {e}")
        return None
    except ValueError as e:
        print(f"Value error: {e}")
        return None
    except Exception as e:
        print(f"Unexpected error: {e}")
        return None

Key Changes:

Error Handling: Added try-except blocks to handle various types of errors, including network issues, unexpected responses, and missing values.
Fallback Values: If any of the properties (like Full_Title, Author, etc.) are missing, it falls back to a default value (e.g., "No Title", "Unknown Author"). This ensures that you always get some data back instead of an incomplete response.
Logging: Prints specific error messages to the console for better debugging.

Debugging Tips:

Check the actual ID: Confirm that the source_id you are passing is the correct ID of the source page. You can verify this in the Notion UI or by inspecting the response from the IONIQ_RAG_ITEMS database.
API Token: Ensure that the token has access to both databases. You can test this by trying to manually query both databases via the Notion API and checking the permissions of the integration.

Next Steps:

Print the source_id: Before making the request, print out the source_id to ensure that it’s being passed correctly.
```
print(f"Fetching source data for source_id: {source_id}")
```
Try the API Request Manually: You can use a tool like Postman or cURL to test the API manually by making a GET request to https://api.notion.com/v1/pages/{source_id} to see if it’s returning the expected data.
Check the Full Response: If the request fails, print out the full response from the API to gain more insight into what might be wrong.

print(f"Response: {data}")

Conclusion:

This method should help you identify the issue and resolve the error. If you are still encountering a 400 Bad Request, it might be due to an issue with the page ID or API permissions. Try the steps outlined above to troubleshoot. Let me know how it goes!

Yes, you can use the page ID to access the specific record in the Notion database using the Notion API. However, it’s important to note that the /v1/pages/{page_id} endpoint is specifically designed to fetch a single page’s metadata (which is a single record in a database), and it doesn’t allow querying a database using the page ID directly.

If you want to fetch data from the actual database (as opposed to a specific page), you would typically query the database and then filter results based on certain properties (such as matching the page ID) rather than directly accessing the database using the ID in a URL.

However, since your IONIQ_RAG_ITEMS database contains a relation to the IONIQ_RAG_SOURCES database, the source_id you retrieve from IONIQ_RAG_ITEMS refers to a specific page (or record) in the IONIQ_RAG_SOURCES database.

Here’s how you can query the source database (`IONIQ_RAG_SOURCES`) to fetch the page by ID:

Fetch the source database record using the source_id from the relation.
Query the source database to retrieve specific pages based on the relation.

Modified Approach to Fetch Data Based on Source ID (using relation):

You will query the source database and use the source_id you got from the IONIQ_RAG_ITEMS database to filter the results.

Example Python Code to Query the Source Database:

# Fetch metadata for a source by querying the source database with the source_id
def fetch_source_metadata(source_id):
    url = f"https://api.notion.com/v1/databases/{DATABASE_ID_SOURCES}/query"

    # Query the source database to filter by the page_id (source_id) from the relation field
    query_filter = {
        "filter": {
            "property": "ID",  # Use the actual field name that contains the source page ID (adjust if needed)
            "rich_text": {
                "equals": source_id  # Match the source ID you have
            }
        }
    }

    try:
        # Make the request to fetch the source metadata
        data = notion_request(url, query_filter)

        # Check if any result was found
        if not data.get("results"):
            raise ValueError(f"No source found for ID {source_id}")

        source_data = data["results"][0]  # Assuming the ID is unique, we take the first result

        # Extract and return source metadata
        source_metadata = {
            "title": source_data['properties'].get('Full_Title', {}).get('title', [{}])[0].get('text', {}).get('content', "No Title"),
            "author": source_data['properties'].get('Author', {}).get('rich_text', [{}])[0].get('text', {}).get('content', "Unknown Author"),
            "type": source_data['properties'].get('Type', {}).get('select', {}).get('name', "Unknown Type"),
            "year": source_data['properties'].get('Year', {}).get('number', "Unknown Year"),
            "url": source_data['properties'].get('URL', {}).get('url', "No URL")
        }

        return source_metadata

    except requests.exceptions.RequestException as e:
        print(f"Error fetching source data for source_id {source_id}: {e}")
        return None
    except ValueError as e:
        print(f"Value error: {e}")
        return None
    except Exception as e:
        print(f"Unexpected error: {e}")
        return None

Key Updates:

Querying the Source Database: You are now querying the source database using the /v1/databases/{database_id}/query endpoint instead of /v1/pages/{page_id}.
Filter by source_id: The filter is based on the relation property that links to the source page. Here, I assumed the field in your source database is called ID (replace it with the actual field name that stores the relation). If your source database stores the page ID in a specific field, use that field name.

Debugging:

Check your source_id: Ensure that source_id is the actual page ID of the record in the IONIQ_RAG_SOURCES database.
Property Names: Double-check that you’re using the correct property names (like ID, Full_Title, etc.) in the source database.

Test the Query:

To ensure it works:

Print the Source ID before making the request:

print(f"Fetching source data for source_id: {source_id}")

Check the Raw Response: Print the full response from the API for debugging:
```
print(f"API response: {data}")
```

Summary:

This approach allows you to query the source database and filter by the source_id to retrieve specific records based on the relation. It’s an effective way to fetch metadata from a related database by querying the source records.

Ah, I see! The issue likely lies in the way the URL is constructed or how the request is being made in Python. Let’s break down and troubleshoot this.

Common causes for this error:

URL Encoding: In some cases, special characters in the URL (like slashes, colons, etc.) need to be properly encoded.
Incorrect URL Format: The URL might be incorrectly constructed, causing the request to fail.
Headers or API Token: Ensure that the headers are correctly set, especially the Notion-Version and Authorization headers.
Request Method: Ensure that you’re using the correct request method (POST for querying a database).

Let’s go over the correct structure for the request and see how we can replicate it in Python.

Example cURL Command:

Here is a typical cURL command to query a Notion database:

curl -X POST https://api.notion.com/v1/databases/{DATABASE_ID}/query \
  -H "Authorization: Bearer {YOUR_API_TOKEN}" \
  -H "Notion-Version: 2022-06-28" \
  -H "Content-Type: application/json" \
  -d '{"filter": {"property": "ID", "rich_text": {"equals": "{SOURCE_ID}"}}}'

In this example:

Replace {DATABASE_ID} with your actual database ID.
Replace {YOUR_API_TOKEN} with your actual Notion API token.
Replace {SOURCE_ID} with the page ID you’re querying in the filter.

Translating to Python:

To correctly translate this to Python, ensure:

The headers are properly set.
The data (payload) is correctly structured.
The URL is correct and doesn’t have any issues.

Here’s the corrected Python code with proper error handling:

import requests
import json

# Define the Notion API token and the source database ID
NOTION_API_TOKEN = 'your_notion_api_token_here'
DATABASE_ID_SOURCES = 'your_source_database_id_here'

# Set the Notion API headers
headers = {
    "Authorization": f"Bearer {NOTION_API_TOKEN}",
    "Content-Type": "application/json",
    "Notion-Version": "2022-06-28"  # Make sure you're using the correct version
}

# Fetch metadata for a source by querying the source database using the source_id
def fetch_source_metadata(source_id):
    url = f"https://api.notion.com/v1/databases/{DATABASE_ID_SOURCES}/query"

    # Define the query filter to search by the page ID (source_id)
    query_filter = {
        "filter": {
            "property": "ID",  # Adjust to the actual field name if different
            "rich_text": {
                "equals": source_id  # This should be the page ID you're filtering by
            }
        }
    }

    try:
        # Make the POST request to fetch the source metadata
        response = requests.post(url, headers=headers, json=query_filter)

        # Check if the request was successful
        response.raise_for_status()  # Raise an exception if there's an HTTP error

        # Parse the response JSON
        data = response.json()

        if not data.get("results"):
            raise ValueError(f"No source found for ID {source_id}")

        # Extract source metadata
        source_data = data["results"][0]
        source_metadata = {
            "title": source_data['properties'].get('Full_Title', {}).get('title', [{}])[0].get('text', {}).get('content', "No Title"),
            "author": source_data['properties'].get('Author', {}).get('rich_text', [{}])[0].get('text', {}).get('content', "Unknown Author"),
            "type": source_data['properties'].get('Type', {}).get('select', {}).get('name', "Unknown Type"),
            "year": source_data['properties'].get('Year', {}).get('number', "Unknown Year"),
            "url": source_data['properties'].get('URL', {}).get('url', "No URL")
        }

        return source_metadata

    except requests.exceptions.RequestException as e:
        print(f"Error fetching source data for source_id {source_id}: {e}")
        return None
    except ValueError as e:
        print(f"Value error: {e}")
        return None
    except Exception as e:
        print(f"Unexpected error: {e}")
        return None

Key Points to Check:

URL Encoding: The URL is in the correct format: https://api.notion.com/v1/databases/{DATABASE_ID_SOURCES}/query. This matches what you’d use in cURL.
Headers:
- Ensure you’re using the correct Authorization header.
- The Notion-Version header should match the version you’re using (the current stable version at the time of writing is 2022-06-28, but you may need to update it if required).
Request Method:
- We’re using requests.post() to send a POST request, which is required for querying the database.
Request Data:
- The query filter is passed in the json parameter of the request, which correctly serializes it to JSON.
Error Handling:
- If the request fails, it will print the error message, which can help you troubleshoot further.

Troubleshooting Tips:

Print the Full URL: Before making the request, print the URL and the payload to check that everything is correct.
```
print(f"Request URL: {url}")
print(f"Payload: {json.dumps(query_filter, indent=2)}")
```
Check for Proper API Access: Ensure that the Notion API token has permission to access both databases.
Manual Testing: If the cURL command works, but Python doesn’t, compare the exact request headers and body to make sure they’re identical.

Conclusion:

This should replicate the cURL behavior using Python. Ensure that the headers, URL, and query parameters are correctly set, and the issue should be resolved. Let me know if

Ah, now I understand the issue more clearly. You’re trying to fetch a specific page by its ID, not query the database. In this case, you need to use the /v1/pages/{page_id} endpoint directly, which is different from querying the database.

The key difference is that fetching a page by ID is a GET request to /v1/pages/{page_id} and not a POST request to /v1/databases/{database_id}/query.

Let’s fix your code accordingly.

Correct Request to Fetch a Page by ID:

The correct way to fetch a single page by its page ID is:

curl -X GET https://api.notion.com/v1/pages/{PAGE_ID} \
  -H "Authorization: Bearer {YOUR_API_TOKEN}" \
  -H "Notion-Version: 2022-06-28" \
  -H "Content-Type: application/json"

Python Code to Fetch Page by ID:

Now, here’s how you can do this in Python using the requests library.

Updated Python Code:

import requests

# Define Notion API token and page ID
NOTION_API_TOKEN = 'your_notion_api_token_here'
PAGE_ID = 'your_page_id_here'  # This should be the actual page ID you want to fetch

# Set the Notion API headers
headers = {
    "Authorization": f"Bearer {NOTION_API_TOKEN}",
    "Content-Type": "application/json",
    "Notion-Version": "2022-06-28"  # Update this to the correct version if needed
}

# Fetch a single page by ID
def fetch_page_by_id(page_id):
    url = f"https://api.notion.com/v1/pages/{page_id}"

    try:
        # Make the GET request to fetch the page
        response = requests.get(url, headers=headers)

        # Check if the request was successful
        response.raise_for_status()  # Raise an exception for any HTTP error

        # Parse the response JSON
        data = response.json()

        if not data:
            raise ValueError(f"No data found for page ID {page_id}")

        # Extract relevant data from the response (for example, title)
        page_metadata = {
            "title": data['properties'].get('Name', {}).get('title', [{}])[0].get('text', {}).get('content', "No Title"),
            "author": data['properties'].get('Author', {}).get('rich_text', [{}])[0].get('text', {}).get('content', "Unknown Author"),
            "type": data['properties'].get('Type', {}).get('select', {}).get('name', "Unknown Type"),
            "year": data['properties'].get('Year', {}).get('number', "Unknown Year"),
            "url": data['properties'].get('URL', {}).get('url', "No URL")
        }

        return page_metadata

    except requests.exceptions.RequestException as e:
        print(f"Error fetching page data for page_id {page_id}: {e}")
        return None
    except ValueError as e:
        print(f"Value error: {e}")
        return None
    except Exception as e:
        print(f"Unexpected error: {e}")
        return None

# Example usage: Fetch page data
page_data = fetch_page_by_id(PAGE_ID)
if page_data:
    print(page_data)

Key Points:

GET Request: This time, the method is changed to requests.get(), as you’re making a GET request to the /v1/pages/{page_id} endpoint.
URL: The URL format is https://api.notion.com/v1/pages/{page_id}, where {page_id} should be replaced with the actual page ID you’re trying to fetch.
Headers: The headers include Authorization (with your API token) and Notion-Version. Make sure the Notion-Version is correct for the API you’re using (the current stable version is 2022-06-28).
Error Handling: We have proper error handling to catch HTTP errors and unexpected exceptions.

Common Troubleshooting:

Check the Page ID Format: The page_id should be a valid UUID-like string (e.g., 18a6716c-a1e8-8099-8900-c3d0df8b5cdd).
Permissions: Ensure that the Notion API token has permissions to access the page.
API Token: Make sure your API token is valid, and it’s correctly passed in the request header.

Example Response Data:

When fetching a page, you’ll typically get a response like this (simplified):

{
  "object": "page",
  "id": "18a6716c-a1e8-8099-8900-c3d0df8b5cdd",
  "properties": {
    "Name": {
      "title": [
        {
          "text": {
            "content": "Sample Page Title"
          }
        }
      ]
    },
    "Author": {
      "rich_text": [
        {
          "text": {
            "content": "John Doe"
          }
        }
      ]
    },
    "Type": {
      "select": {
        "name": "Article"
      }
    },
    "Year": {
      "number": 2024
    },
    "URL": {
      "url": "https://example.com"
    }
  }
}

Conclusion:

This should now properly fetch the page metadata using its page ID. If you’re getting a 400 Bad Request or other errors, double-check the Page ID and API token to ensure that they are valid and correctly formatted.

Let me know how it goes or if you need any further clarification!

Thank you for the clarification! The reason you’re seeing the Authorization: Bearer token wrapped in quotes in the cURL example is due to how the environment variable is being referenced within the shell script.

When you use "$NOTION_API_KEY" inside the shell script, it gets expanded to the value of the environment variable NOTION_API_KEY. The surrounding single quotes (') are used to ensure that the whole value is interpreted as a single string, including any special characters, spaces, or punctuation that might be part of the token.

In Python, you don’t need to do anything special like this because the string you pass to the Authorization header is simply a regular string. You just need to ensure the token is correctly passed as a string.

Here’s a Python version that matches the cURL example:

Updated Python Code (with API Token):

import requests

# Define the Notion API token and page ID
NOTION_API_TOKEN = 'your_notion_api_token_here'  # Ensure this is a string with no extra quotes
PAGE_ID = 'your_page_id_here'  # This should be the actual page ID you want to fetch

# Set the Notion API headers
headers = {
    "Authorization": f"Bearer {NOTION_API_TOKEN}",  # Correctly format the Authorization header
    "Content-Type": "application/json",
    "Notion-Version": "2022-06-28"  # Ensure you're using the correct version
}

# Fetch a single page by ID
def fetch_page_by_id(page_id):
    url = f"https://api.notion.com/v1/pages/{page_id}"

    try:
        # Make the GET request to fetch the page
        response = requests.get(url, headers=headers)

        # Check if the request was successful
        response.raise_for_status()  # Raise an exception for any HTTP error

        # Parse the response JSON
        data = response.json()

        if not data:
            raise ValueError(f"No data found for page ID {page_id}")

        # Extract relevant data from the response (for example, title)
        page_metadata = {
            "title": data['properties'].get('Name', {}).get('title', [{}])[0].get('text', {}).get('content', "No Title"),
            "author": data['properties'].get('Author', {}).get('rich_text', [{}])[0].get('text', {}).get('content', "Unknown Author"),
            "type": data['properties'].get('Type', {}).get('select', {}).get('name', "Unknown Type"),
            "year": data['properties'].get('Year', {}).get('number', "Unknown Year"),
            "url": data['properties'].get('URL', {}).get('url', "No URL")
        }

        return page_metadata

    except requests.exceptions.RequestException as e:
        print(f"Error fetching page data for page_id {page_id}: {e}")
        return None
    except ValueError as e:
        print(f"Value error: {e}")
        return None
    except Exception as e:
        print(f"Unexpected error: {e}")
        return None

# Example usage: Fetch page data
page_data = fetch_page_by_id(PAGE_ID)
if page_data:
    print(page_data)

Key Differences from the cURL Example:

Authorization Header: In cURL, you’re referencing the environment variable with "$NOTION_API_KEY". In Python, you directly use the token string (e.g., 'your_notion_api_token_here').
No Need for Extra Quotes: In Python, you don’t need to handle variable expansion with quotes like in shell scripts. You can just pass the string directly.

Notes:

Token as String: Ensure that your Notion API token is correct and in string format, like 'your_notion_api_token_here'.
Header Structure: The token is passed as part of the Authorization header using the format "Authorization: Bearer {your_token_here}". This is how it is passed in the cURL command and similarly in the Python code.

If you’re still seeing errors, please check the following:

Ensure that your API token is valid: Make sure that there are no extra spaces or special characters causing issues.
Ensure the page ID is correct: Double-check the page ID format.

Let me know if you need further help!

Got it! You want to extract all the text blocks from a page in the ITEMS database by retrieving the page content for each item.

In Notion, pages consist of various blocks, including text blocks, images, etc. Each block is stored as an individual object with a unique ID and type. To achieve this, you can follow these steps:

Steps to Fetch Text Blocks from Each Item’s Page:

Get the Item Page ID: You will first need to get the page ID of each item from the ITEMS database.
Fetch Page Content: Once you have the page ID, you’ll fetch its content using the /v1/blocks/{block_id}/children endpoint. This will allow you to retrieve all the blocks (including text blocks) inside the page.
Extract Text Blocks: From the blocks returned, you can extract the text content.

Python Code to Fetch All Text Blocks for Each Item in the `ITEMS` Database:

Below is an example Python script that:

Retrieves all items from the ITEMS database.
For each item, fetches its page content.
Extracts all text blocks from the page.

import requests

# Define your Notion API token and database ID
NOTION_API_TOKEN = 'your_notion_api_token_here'
DATABASE_ID_ITEMS = 'your_items_database_id_here'  # ID of your ITEMS database
DATABASE_ID_SOURCES = 'your_sources_database_id_here'  # ID of your SOURCES database

# Set the Notion API headers
headers = {
    "Authorization": f"Bearer {NOTION_API_TOKEN}",
    "Content-Type": "application/json",
    "Notion-Version": "2022-06-28"
}

# Helper function to make Notion API requests
def notion_request(url, params=None):
    response = requests.post(url, headers=headers, json=params)
    response.raise_for_status()  # Raise an exception for any failed request
    return response.json()

# Fetch all items from the ITEMS database
def fetch_items():
    url = f"https://api.notion.com/v1/databases/{DATABASE_ID_ITEMS}/query"
    data = notion_request(url)
    items = []

    for result in data.get('results', []):
        item = {
            "id": result['id'],
            "name": result['properties'].get('Name', {}).get('title', [{}])[0].get('text', {}).get('content'),
            "page_id": result['properties'].get('Page_ID', {}).get('rich_text', [{}])[0].get('text', {}).get('content')
        }
        items.append(item)
    return items

# Fetch page content (blocks) for the given page ID
def fetch_page_blocks(page_id):
    url = f"https://api.notion.com/v1/blocks/{page_id}/children"

    try:
        data = notion_request(url)
        blocks = []

        # Loop through all blocks and get text content from them
        for block in data.get('results', []):
            # Check if the block is a text block
            if block['type'] == 'paragraph':
                block_content = block['paragraph']['text']
                text = ''.join([t['text']['content'] for t in block_content if 'text' in t])
                blocks.append(text)

        return blocks

    except requests.exceptions.RequestException as e:
        print(f"Error fetching page blocks for page_id {page_id}: {e}")
        return None

# Combine the data and flatten it
def combine_data():
    items = fetch_items()
    all_data = []

    for item in items:
        page_id = item.get("page_id")
        if page_id:
            text_blocks = fetch_page_blocks(page_id)
            if text_blocks:
                item_data = {
                    "item_id": item['id'],
                    "item_name": item['name'],
                    "text_blocks": text_blocks
                }
                all_data.append(item_data)

    return all_data

# Write the flattened data into a JSON file
def save_to_json():
    all_data = combine_data()
    with open('notion_items_text_blocks.json', 'w') as outfile:
        json.dump(all_data, outfile, indent=4)

# Run the function to save the data
save_to_json()

Key Details:

Fetching All Items:
- fetch_items() retrieves all items from the ITEMS database. It fetches the Name and Page_ID (the actual page ID) of each item. Ensure the property Page_ID exists and holds the correct page ID for each item in the ITEMS database.
Fetching Page Content:
- The fetch_page_blocks(page_id) function is used to fetch all blocks (including text blocks) from a page by its page_id. It checks for paragraph blocks and extracts the text content.
Extracting Text:
- It iterates through each block in the page, checking if it’s a text block (a paragraph block in this case). If the block contains text, the text is extracted and added to a list of text blocks.
Combining Data:
- The combine_data() function combines the item metadata and the text blocks into a flat structure, which includes the item_id, item_name, and the text_blocks.
Saving to JSON:
- The save_to_json() function saves the combined data into a JSON file (notion_items_text_blocks.json).

Notes:

Ensure that your ITEMS database has a property called Page_ID (or whatever property holds the actual page ID) for each item. You can replace Page_ID with the correct property name in your database if it’s different.
The blocks could be of different types (e.g., heading_1, heading_2, bulleted_list, numbered_list, etc.). You can extend the logic to handle different block types.
Error Handling: If the page doesn’t have any blocks or any other issues occur, it will print an error message.

Example Output JSON:

After running this code, you should get a JSON file (notion_items_text_blocks.json) with the following structure:

[
  {
    "item_id": "item_id_1",
    "item_name": "Item 1 Name",
    "text_blocks": [
      "Text block 1 content",
      "Text block 2 content",
      "Text block 3 content"
    ]
  },
  {
    "item_id": "item_id_2",
    "item_name": "Item 2 Name",
    "text_blocks": [
      "Text block 1 content",
      "Text block 2 content"
    ]
  }
]

This array will contain each item’s item_id, item_name, and an array of its text blocks.

Let me know if you need any further adjustments!

Got it! So if the page type is Quote_List, you want to specifically search for bullet list blocks, and for each bullet list item, treat its content as a separate block.

Here’s how we can implement this logic:

Updated Approach:

Check for Page Type: If the page type is Quote_List, we will search for bullet list blocks (bulleted_list_item type) within the page.
Extract Bullet List Content: For each bullet list item, we treat its content as a separate text block.
Handle Non-Quote_List Pages: If the page type is not Quote_List, we proceed with extracting normal text blocks (as previously).

Updated Python Code:

import requests
import json

# Define your Notion API token and database ID
NOTION_API_TOKEN = 'your_notion_api_token_here'
DATABASE_ID_ITEMS = 'your_items_database_id_here'  # ID of your ITEMS database
DATABASE_ID_SOURCES = 'your_sources_database_id_here'  # ID of your SOURCES database

# Set the Notion API headers
headers = {
    "Authorization": f"Bearer {NOTION_API_TOKEN}",
    "Content-Type": "application/json",
    "Notion-Version": "2022-06-28"
}

# Helper function to make Notion API requests
def notion_request(url, params=None):
    response = requests.post(url, headers=headers, json=params)
    response.raise_for_status()  # Raise an exception for any failed request
    return response.json()

# Fetch all items from the ITEMS database
def fetch_items():
    url = f"https://api.notion.com/v1/databases/{DATABASE_ID_ITEMS}/query"
    data = notion_request(url)
    items = []

    for result in data.get('results', []):
        item = {
            "id": result['id'],
            "name": result['properties'].get('Name', {}).get('title', [{}])[0].get('text', {}).get('content'),
            "page_id": result['properties'].get('Page_ID', {}).get('rich_text', [{}])[0].get('text', {}).get('content')
        }
        items.append(item)
    return items

# Fetch page content (blocks) for the given page ID
def fetch_page_blocks(page_id):
    url = f"https://api.notion.com/v1/blocks/{page_id}/children"

    try:
        data = notion_request(url)
        blocks = []

        # Check the page type (if it contains "Quote_List")
        page_type = data.get('object', '')

        if page_type == 'page':
            block_results = data.get('results', [])

            if 'Quote_List' in data.get('properties', {}).get('Type', {}).get('select', {}).get('name', ''):
                # If it's a Quote_List, process bullet list items
                for block in block_results:
                    if block['type'] == 'bulleted_list_item':
                        bullet_text = ''.join([t['text']['content'] for t in block['bulleted_list_item']['text'] if 'text' in t])
                        blocks.append(bullet_text)
            else:
                # Otherwise, process all text blocks
                for block in block_results:
                    if block['type'] == 'paragraph':
                        paragraph_content = block['paragraph']['text']
                        text = ''.join([t['text']['content'] for t in paragraph_content if 'text' in t])
                        blocks.append(text)

        return blocks

    except requests.exceptions.RequestException as e:
        print(f"Error fetching page blocks for page_id {page_id}: {e}")
        return None

# Combine the data and flatten it
def combine_data():
    items = fetch_items()
    all_data = []

    for item in items:
        page_id = item.get("page_id")
        if page_id:
            text_blocks = fetch_page_blocks(page_id)
            if text_blocks:
                item_data = {
                    "item_id": item['id'],
                    "item_name": item['name'],
                    "text_blocks": text_blocks
                }
                all_data.append(item_data)

    return all_data

# Write the flattened data into a JSON file
def save_to_json():
    all_data = combine_data()
    with open('notion_items_text_blocks.json', 'w') as outfile:
        json.dump(all_data, outfile, indent=4)

# Run the function to save the data
save_to_json()

Key Updates:

Check for Quote_List Type:
- We check if the Type property of the page is "Quote_List". If it is, we proceed to extract bulleted list items.
- If the page type is not Quote_List, the logic falls back to the previous behavior of extracting paragraph text blocks.
Extract Bulleted List Items:
- When the page type is Quote_List, the script looks for bulleted_list_item blocks in the page.
- Each bulleted_list_item block’s content is then extracted as a separate text block and added to the blocks list.
Text Extraction:
- For both paragraph and bulleted_list_item blocks, the script extracts the text from the block and joins it into a single string.
Handling Other Page Types:
- For non-Quote_List pages, we extract the text from normal paragraph blocks.

Example Output:

For a Quote_List page, the resulting JSON will look like this:

[
  {
    "item_id": "item_id_1",
    "item_name": "Item 1 Name",
    "text_blocks": [
      "Bullet 1 content",
      "Bullet 2 content",
      "Bullet 3 content"
    ]
  },
  {
    "item_id": "item_id_2",
    "item_name": "Item 2 Name",
    "text_blocks": [
      "Text block 1 content",
      "Text block 2 content"
    ]
  }
]

For non-Quote_List pages, you will get the regular text blocks from paragraphs.

Important Notes:

Type Property: Ensure that the Type property in your ITEMS database is correctly set and contains "Quote_List" when applicable. This logic checks that property to identify Quote_List pages.
Handling Block Types: The current code handles paragraph and bulleted_list_item blocks. If your pages contain other block types, you can easily extend the logic to handle them.
Error Handling: The script contains error handling to print any errors during the API request (e.g., network errors, invalid IDs).

Let me know if you need any further adjustments!

this is what I have. I also want to catch cases where the block is a toggle heading. I want the sub-content of each toggle heading to be parsed normally as blocks

Loop through all blocks and get text content from them

    for block in data.get('results', []):
        # Check if the block is a text block
        if block['type'] == 'paragraph':
            block_content = block['paragraph']['rich_text']
            text = ''.join([t['text']['content'] for t in block_content if 'text' in t])
            blocks.append(text)
        elif block['type'] == 'bulleted_list_item':
            block_content = block['bulleted_list_item']['rich_text']
            bullet_text = ''.join([t['text']['content'] for t in block_content if 'text' in t])
            blocks.append(bullet_text)

To handle the case where a block is a toggle heading and you want to parse its sub-content as regular blocks, you will need to extend your current logic. Toggle headings (often referred to as “toggle” blocks) can contain nested content (sub-blocks), which you will need to fetch and process recursively.

How to Handle Toggle Headings:

Check if the block is a toggle: A toggle block contains a title (like a heading) and can have nested child blocks (the content inside the toggle).
Fetch and process nested blocks: If the block is a toggle, you need to recursively retrieve its child blocks and process them as normal text blocks.

Updated Python Code:

I will update your current loop to handle toggle blocks and recursively fetch the content inside them:

# Loop through all blocks and get text content from them
for block in data.get('results', []):
    # Check if the block is a text block (e.g., paragraph)
    if block['type'] == 'paragraph':
        block_content = block['paragraph']['rich_text']
        text = ''.join([t['text']['content'] for t in block_content if 'text' in t])
        blocks.append(text)

    # Check if the block is a bulleted list item
    elif block['type'] == 'bulleted_list_item':
        block_content = block['bulleted_list_item']['rich_text']
        bullet_text = ''.join([t['text']['content'] for t in block_content if 'text' in t])
        blocks.append(bullet_text)

    # Check if the block is a toggle heading
    elif block['type'] == 'toggle':
        toggle_text = ''.join([t['text']['content'] for t in block['toggle']['rich_text'] if 'text' in t])
        blocks.append(f"Toggle Heading: {toggle_text}")

        # Now, fetch and process the child blocks (sub-content within the toggle)
        if 'children' in block['toggle']:
            # Process each child block under the toggle
            for child_block in block['toggle']['children']:
                if child_block['type'] == 'paragraph':
                    child_content = child_block['paragraph']['rich_text']
                    child_text = ''.join([t['text']['content'] for t in child_content if 'text' in t])
                    blocks.append(child_text)
                elif child_block['type'] == 'bulleted_list_item':
                    child_content = child_block['bulleted_list_item']['rich_text']
                    child_bullet_text = ''.join([t['text']['content'] for t in child_content if 'text' in t])
                    blocks.append(child_bullet_text)
                # Add more block types here as needed (e.g., heading, numbered_list_item)

Key Updates:

Toggle Block Handling:
- For each toggle block, we first extract the text of the toggle heading (the title).
- We then check if the toggle has child blocks under it (which are the nested content).
- We loop through these child blocks and treat them as normal blocks (e.g., paragraphs, bulleted list items, etc.), extracting the text from them.
Child Block Parsing:
- For each child block under the toggle, the script will extract the text in the same way it does for normal text blocks.
- You can extend this logic to handle more block types (e.g., heading_1, heading_2, numbered_list_item, etc.).
Appending Toggle Heading:
- We append the toggle heading text separately, prefixed with "Toggle Heading: " to indicate that this is the header of a toggle block.

Example Output:

For a page with toggle blocks, the output might look like this:

[
  {
    "item_id": "item_id_1",
    "item_name": "Item 1 Name",
    "text_blocks": [
      "Normal paragraph text",
      "Bullet list item 1",
      "Bullet list item 2",
      "Toggle Heading: Section 1",
      "Content under Toggle Heading",
      "Bullet under toggle",
      "Normal content after toggle"
    ]
  },
  {
    "item_id": "item_id_2",
    "item_name": "Item 2 Name",
    "text_blocks": [
      "Normal paragraph text"
    ]
  }
]

Additional Notes:

Block Types: If there are other types of blocks inside the toggle (e.g., images, links, or other rich content), you’ll need to handle those block types specifically as well. For now, I’ve handled paragraph and bulleted_list_item, but you can extend it to include more types as needed.
Nested Toggles: If toggles contain further nested toggles (toggle within toggle), you can recursively process the children of each toggle in a similar fashion.

Conclusion:

This update will ensure that toggle blocks are processed correctly, including their sub-content, while still parsing other block types as normal. Let me know if you need further adjustments!

Great question! Reducing code duplication is key to making your code more maintainable and readable. The main area where there is duplication is the logic for extracting text content from blocks. We can refactor this code into a helper function that will handle text extraction for any block type. This way, we can reuse the function across different block types, reducing redundancy.

Refactor Plan:

Create a helper function for text extraction: This function will handle extracting the text from paragraph, bulleted_list_item, or any other type of block that contains text. This reduces the repeated logic for each block type.
Handle child blocks recursively: For toggle blocks, we’ll process their child blocks recursively, applying the same logic for text extraction.

Refactored Code:

# Helper function to extract text from a block's rich_text
def extract_text_from_block(block):
    # Check if the block contains rich_text
    block_type = block.get('type', '')
    rich_text_field = None

    # Identify the correct field to extract text from
    if block_type == 'paragraph':
        rich_text_field = block.get('paragraph', {}).get('rich_text', [])
    elif block_type == 'bulleted_list_item':
        rich_text_field = block.get('bulleted_list_item', {}).get('rich_text', [])
    elif block_type == 'toggle':
        rich_text_field = block.get('toggle', {}).get('rich_text', [])

    # Extract the content from the rich_text field
    if rich_text_field:
        return ''.join([t['text']['content'] for t in rich_text_field if 'text' in t])
    return ''

# Function to process a toggle and its children
def process_toggle_block(block):
    toggle_heading_text = extract_text_from_block(block)
    blocks = [f"Toggle Heading: {toggle_heading_text}"]

    # Process child blocks under the toggle
    if 'children' in block['toggle']:
        for child_block in block['toggle']['children']:
            blocks.extend(process_block(child_block))  # Recurse for child blocks
    return blocks

# Function to process any block
def process_block(block):
    block_type = block.get('type', '')

    if block_type == 'paragraph' or block_type == 'bulleted_list_item':
        # For paragraph or bulleted list, extract text using helper function
        return [extract_text_from_block(block)]

    elif block_type == 'toggle':
        # Process toggle blocks and their children
        return process_toggle_block(block)

    # Add more block types if needed
    return []

# Loop through all blocks and get text content from them
for block in data.get('results', []):
    # Process the block and append the result
    blocks.extend(process_block(block))

Explanation of Changes:

extract_text_from_block():
- This helper function abstracts away the logic of extracting the rich_text content from any block. We check the type of the block (paragraph, bulleted_list_item, or toggle) and extract the text accordingly. This function handles the text extraction uniformly and is reused for all block types.
process_toggle_block():
- When a toggle block is encountered, we extract its heading using extract_text_from_block() and then recursively process its child blocks (sub-content) using process_block().
process_block():
- This function handles any block and determines its type. It either:
  - Calls extract_text_from_block() for paragraph and bulleted_list_item blocks.
  - Calls process_toggle_block() for toggle blocks to handle both the heading and the child blocks inside them.
- You can add more block types (like numbered_list_item, heading_1, etc.) in this function as needed.
Reduce Duplication:
- Now the core logic for extracting the text from blocks is centralized in extract_text_from_block() and process_block(), reducing redundancy in the main loop and making the code more modular and reusable.

Example Output:

The output will still be the same as before:

[
  {
    "item_id": "item_id_1",
    "item_name": "Item 1 Name",
    "text_blocks": [
      "Normal paragraph text",
      "Bullet list item 1",
      "Bullet list item 2",
      "Toggle Heading: Section 1",
      "Content under Toggle Heading",
      "Bullet under toggle",
      "Normal content after toggle"
    ]
  },
  {
    "item_id": "item_id_2",
    "item_name": "Item 2 Name",
    "text_blocks": [
      "Normal paragraph text"
    ]
  }
]

Benefits of Refactoring:

Reduced Duplication: The extraction logic is now centralized, so if you need to change how text is extracted (e.g., if you need to handle additional rich text fields), you only need to update it in one place.
Readability: The code is now more modular, with each function having a clear responsibility.
Easier Maintenance: You can easily add new block types by modifying process_block(). Each new block type can either call extract_text_from_block() or be handled with custom logic in its respective function.

Conclusion:

By moving the text extraction logic into a helper function and introducing the process_block() function, we’ve greatly reduced code duplication and improved readability. This approach will be easy to extend as you encounter more block types in Notion. Let me know if you need further adjustments!

To ensure that punctuation marks (like single quotes, double quotes, and other special characters) appear correctly in the JSON output, we need to make sure that:

Text is properly encoded: Ensure that characters such as quotes, apostrophes, and other special characters are represented correctly in the string.
Escape sequences are handled: If needed, characters like \n, \r, or quotes are escaped properly, but we should also make sure that characters like ' or " are stored as-is.

Steps to Ensure Correct Text Handling:

UTF-8 Encoding: Python’s string handling with UTF-8 ensures that common characters (like punctuation and accented characters) are stored properly.
JSON Formatting: When dumping to JSON, we’ll make sure special characters are not escaped incorrectly.
Unescape any unwanted characters: We can check if there are unwanted HTML or Unicode escape codes in the text and clean them up.

Solution with Proper Text Processing:

We’ll handle escaping in such a way that punctuation marks like single quotes ('), double quotes ("), and others are included as-is.
We’ll also make sure we handle any edge cases such as unwanted escape sequences in the content.

Updated Code with Improved Text Handling:

import requests
import json
import html

# Define your Notion API token and database ID
NOTION_API_TOKEN = 'your_notion_api_token_here'
DATABASE_ID_ITEMS = 'your_items_database_id_here'  # ID of your ITEMS database
DATABASE_ID_SOURCES = 'your_sources_database_id_here'  # ID of your SOURCES database

# Set the Notion API headers
headers = {
    "Authorization": f"Bearer {NOTION_API_TOKEN}",
    "Content-Type": "application/json",
    "Notion-Version": "2022-06-28"
}

# Helper function to make Notion API requests
def notion_request(url, params=None):
    response = requests.post(url, headers=headers, json=params)
    response.raise_for_status()  # Raise an exception for any failed request
    return response.json()

# Helper function to extract text from a block's rich_text
def extract_text_from_block(block):
    block_type = block.get('type', '')
    rich_text_field = None

    # Identify the correct field to extract text from
    if block_type == 'paragraph':
        rich_text_field = block.get('paragraph', {}).get('rich_text', [])
    elif block_type == 'bulleted_list_item':
        rich_text_field = block.get('bulleted_list_item', {}).get('rich_text', [])
    elif block_type == 'toggle':
        rich_text_field = block.get('toggle', {}).get('rich_text', [])

    # Extract the content from the rich_text field
    if rich_text_field:
        text = ''.join([t['text']['content'] for t in rich_text_field if 'text' in t])
        # Decode HTML entities (like &quot; for quotes)
        text = html.unescape(text)
        return text
    return ''

# Function to process a toggle and its children
def process_toggle_block(block):
    toggle_heading_text = extract_text_from_block(block)
    blocks = [f"Toggle Heading: {toggle_heading_text}"]

    # Process child blocks under the toggle
    if 'children' in block['toggle']:
        for child_block in block['toggle']['children']:
            blocks.extend(process_block(child_block))  # Recurse for child blocks
    return blocks

# Function to process any block
def process_block(block):
    block_type = block.get('type', '')

    if block_type == 'paragraph' or block_type == 'bulleted_list_item':
        # For paragraph or bulleted list, extract text using helper function
        return [extract_text_from_block(block)]

    elif block_type == 'toggle':
        # Process toggle blocks and their children
        return process_toggle_block(block)

    # Add more block types if needed
    return []

# Loop through all blocks and get text content from them
def get_text_blocks_from_page(data):
    blocks = []
    for block in data.get('results', []):
        # Process the block and append the result
        blocks.extend(process_block(block))
    return blocks

# Fetch all items from the ITEMS database
def fetch_items():
    url = f"https://api.notion.com/v1/databases/{DATABASE_ID_ITEMS}/query"
    data = notion_request(url)
    items = []

    for result in data.get('results', []):
        item = {
            "id": result['id'],
            "name": result['properties'].get('Name', {}).get('title', [{}])[0].get('text', {}).get('content'),
            "page_id": result['properties'].get('Page_ID', {}).get('rich_text', [{}])[0].get('text', {}).get('content')
        }
        items.append(item)
    return items

# Combine the data and flatten it
def combine_data():
    items = fetch_items()
    all_data = []

    for item in items:
        page_id = item.get("page_id")
        if page_id:
            url = f"https://api.notion.com/v1/blocks/{page_id}/children"
            page_data = notion_request(url)
            text_blocks = get_text_blocks_from_page(page_data)
            if text_blocks:
                item_data = {
                    "item_id": item['id'],
                    "item_name": item['name'],
                    "text_blocks": text_blocks
                }
                all_data.append(item_data)

    return all_data

# Write the flattened data into a JSON file
def save_to_json():
    all_data = combine_data()
    with open('notion_items_text_blocks.json', 'w', encoding='utf-8') as outfile:
        json.dump(all_data, outfile, indent=4, ensure_ascii=False)

# Run the function to save the data
save_to_json()

Key Changes:

html.unescape():
- We use the html.unescape() function to decode any HTML entities, such as " for quotes or & for ampersands, ensuring that they appear as the correct characters (" and & respectively).
JSON Encoding:
- The json.dump() function now includes ensure_ascii=False, which ensures that any non-ASCII characters are output correctly in the JSON file (e.g., accented characters, quotes, etc.).
- The encoding='utf-8' argument ensures that the file is written with UTF-8 encoding, which supports all Unicode characters.
Text Handling:
- When we extract the text from each block using extract_text_from_block(), we make sure that any HTML-encoded characters are unescaped to get the correct punctuation marks and symbols.

Example Output:

After processing, your JSON will look something like this:

[
  {
    "item_id": "item_id_1",
    "item_name": "Item 1 Name",
    "text_blocks": [
      "Normal paragraph text",
      "Bullet list item 1",
      "Bullet list item 2",
      "Toggle Heading: Section 1",
      "Content under Toggle Heading",
      "Bullet under toggle",
      "Normal content after toggle"
    ]
  },
  {
    "item_id": "item_id_2",
    "item_name": "Item 2 Name",
    "text_blocks": [
      "Normal paragraph text"
    ]
  }
]

Conclusion:

With this updated approach, punctuation marks and special characters (like quotes, apostrophes, etc.) will appear correctly in the JSON output. The code also ensures that any HTML entities are decoded and handled properly.

This solution should now meet your requirements for text handling. Let me know if you need further adjustments!