Monthly Temp Faceting Plots

2025-04-02 09:30 UTC o3-mini-high Open in ChatGPT ↗

This script originally processed hourly data and grouped them and aggregated them.

I now want an extra function that works with monthly averaged data. it needs to be robust, because some entries are marked with a different day of the month. they all need to fall under the right month. I want to create multiple plots: the x axis is the year, the y axis the temperature variable (Exactly as it is on the script) and each plot is per month. 12 plots for min temperatures, 12 plots for max temperatures, and 12 for median. is there a name for this? i feel i was too verbose. here’s the script

import pandas as pd

def rank_temperature_columns(headers): keyword_scores = { ‘temperature’: 10, # Exact match gets the highest score ‘temp’: 5, # Partial match gets a moderate score }

ranked_columns = []

for header in headers:
    score = 0
    lower_header = header.lower()
    for keyword, keyword_score in keyword_scores.items():
        if keyword in lower_header:
            score += keyword_score
    if score > 0:  # Only include columns with a relevant score
        ranked_columns.append((header, score))

# Sort columns based on the scores in descending order
ranked_columns.sort(key=lambda x: x[1], reverse=True)
return ranked_columns

def analyze_temperature_by_month_and_year(file_path, output_csv_path): try: # Read the CSV file data = pd.read_csv(file_path)

    # Display headers for verification
    print("Available headers in the CSV file:")
    print(list(data.columns))

    # Rank potential temperature columns
    ranked_columns = rank_temperature_columns(data.columns)

    if not ranked_columns:
        print("No column likely refers to temperature. Please check your CSV file.")
        return

    # Select the most probable temperature column
    temperature_column = ranked_columns[0][0]
    print(f"Using '{temperature_column}' as the temperature column. (Ranking: {ranked_columns})")

    # Identify and handle the time-related column
    if 'time' not in ''.join([col.lower() for col in data.columns]):
        print("No suitable 'time' column found. Please ensure a time-related column exists.")
        return

    time_column = [col for col in data.columns if 'time' in col.lower()][0]
    data[time_column] = pd.to_datetime(data[time_column])
    data['year'] = data[time_column].dt.year
    data['month'] = data[time_column].dt.month
    data['time'] = pd.to_datetime(dict(year=data['year'], month=data['month'], day=1, hour=0))

    # Combine year and month to create a 'time' column with day=1 and hour=0
    data['time'] = pd.to_datetime(dict(year=data['year'], month=data['month'], day=1, hour=0))

    # Group data by the new 'time' column and calculate statistics
    monthly_analysis = data.groupby('time')[temperature_column].agg(
        mean_temperature='mean',
        median_temperature='median',
        min_temperature='min',
        max_temperature='max'
    ).reset_index()

    # Save results into a new CSV file
    monthly_analysis.to_csv(output_csv_path, index=False)
    print(f"Monthly temperature statistics saved to: {output_csv_path}")

    # Save results into a new CSV file
    monthly_analysis.to_csv(output_csv_path, index=False)
    print(f"Monthly temperature statistics saved to: {output_csv_path}")

    # Create a pivot table for the second CSV file
    # Each column will have the min, max, and median temperatures for each month
    pivot_table = data.groupby(['year', 'month'])[temperature_column].agg(
        min_temperature='min',
        max_temperature='max',
        median_temperature='median'
    ).reset_index()

    # Reshape the table so that months become columns with their stats split
    reshaped_table = pivot_table.pivot(index='year', columns='month',
                                       values=['min_temperature', 'max_temperature', 'median_temperature'])

    # Optional: Rename columns for clarity (e.g., "Min_January", "Max_February")
    reshaped_table.columns = [f"{stat}_{month}" for stat, month in reshaped_table.columns]

    # Reset index to get 'year' back as a column (needed for proper CSV export)
    reshaped_table.reset_index(inplace=True)

    # Save the pivot table into the second CSV file
    reshaped_csv_path = output_csv_path.replace("_monthly_analysis", "_year_month_analysis")
    reshaped_table.to_csv(reshaped_csv_path, index=False)

except Exception as e:
    print(f"An error occurred: {e}")

Example usage

file_path = “~~/BR_VICOSA_TEMPERATURE_MONTHLY_1940-2024.csv” #“~~/BR_VICOSA_TEMPERATURE_20051231-20250101.csv” output_csv_path = “BR_VICOSA_TEMPERATURE_1940-2024_monthly_analysis.csv”

Call the function to analyze and export

analyze_temperature_by_month_and_year(file_path, output_csv_path)

Below is an updated version of your script that adds a new function, plot_temperature_by_month(), which aggregates the monthly data (taking into account that some entries may have days other than the first) and then creates faceted plots—one figure per statistic (min, max, and median), with 12 subplots (one per month) where the x-axis is the year and the y-axis is the temperature value.

import pandas as pd
import matplotlib.pyplot as plt

def rank_temperature_columns(headers):
    keyword_scores = {
        'temperature': 10,  # Exact match gets the highest score
        'temp': 5,        # Partial match gets a moderate score
    }
    ranked_columns = []
    for header in headers:
        score = 0
        lower_header = header.lower()
        for keyword, keyword_score in keyword_scores.items():
            if keyword in lower_header:
                score += keyword_score
        if score > 0:  # Only include columns with a relevant score
            ranked_columns.append((header, score))
    # Sort columns based on the scores in descending order
    ranked_columns.sort(key=lambda x: x[1], reverse=True)
    return ranked_columns

def analyze_temperature_by_month_and_year(file_path, output_csv_path):
    try:
        # Read the CSV file
        data = pd.read_csv(file_path)

        # Display headers for verification
        print("Available headers in the CSV file:")
        print(list(data.columns))

        # Rank potential temperature columns
        ranked_columns = rank_temperature_columns(data.columns)
        if not ranked_columns:
            print("No column likely refers to temperature. Please check your CSV file.")
            return

        # Select the most probable temperature column
        temperature_column = ranked_columns[0][0]
        print(f"Using '{temperature_column}' as the temperature column. (Ranking: {ranked_columns})")

        # Identify and handle the time-related column
        if 'time' not in ''.join([col.lower() for col in data.columns]):
            print("No suitable 'time' column found. Please ensure a time-related column exists.")
            return
        time_column = [col for col in data.columns if 'time' in col.lower()][0]
        data[time_column] = pd.to_datetime(data[time_column])

        # Extract year and month from the time column. This is robust to different day entries.
        data['year'] = data[time_column].dt.year
        data['month'] = data[time_column].dt.month

        # Create a 'time' column with day=1 and hour=0 to represent the month
        data['time'] = pd.to_datetime(dict(year=data['year'], month=data['month'], day=1, hour=0))

        # Group data by the new 'time' column and calculate statistics
        monthly_analysis = data.groupby('time')[temperature_column].agg(
            mean_temperature='mean',
            median_temperature='median',
            min_temperature='min',
            max_temperature='max'
        ).reset_index()

        # Save results into a new CSV file
        monthly_analysis.to_csv(output_csv_path, index=False)
        print(f"Monthly temperature statistics saved to: {output_csv_path}")

        # Create a pivot table for further analysis: each column will have the min, max, and median temperatures for each month
        pivot_table = data.groupby(['year', 'month'])[temperature_column].agg(
            min_temperature='min',
            max_temperature='max',
            median_temperature='median'
        ).reset_index()

        # Reshape the table so that months become columns with their stats split
        reshaped_table = pivot_table.pivot(index='year', columns='month',
                                           values=['min_temperature', 'max_temperature', 'median_temperature'])
        # Optional: Rename columns for clarity (e.g., "Min_January", "Max_February")
        reshaped_table.columns = [f"{stat}_{month}" for stat, month in reshaped_table.columns]
        reshaped_table.reset_index(inplace=True)

        # Save the pivot table into the second CSV file
        reshaped_csv_path = output_csv_path.replace("_monthly_analysis", "_year_month_analysis")
        reshaped_table.to_csv(reshaped_csv_path, index=False)
        print(f"Pivot table saved to: {reshaped_csv_path}")

    except Exception as e:
        print(f"An error occurred: {e}")

def plot_temperature_by_month(file_path):
    """
    This function reads the CSV file (with potentially hourly entries), extracts the monthly aggregated data,
    and then creates faceted plots. For each temperature statistic (min, max, median), it creates a figure with 12 subplots,
    one for each month, where the x-axis is the year and the y-axis shows the temperature variable.
    """
    try:
        # Read data
        data = pd.read_csv(file_path)

        # Identify temperature column using the ranking function
        ranked_columns = rank_temperature_columns(data.columns)
        if not ranked_columns:
            print("No column likely refers to temperature. Please check your CSV file.")
            return
        temperature_column = ranked_columns[0][0]
        print(f"Using '{temperature_column}' as the temperature column for plotting. (Ranking: {ranked_columns})")

        # Identify the time column
        time_column_candidates = [col for col in data.columns if 'time' in col.lower()]
        if not time_column_candidates:
            print("No suitable 'time' column found for plotting.")
            return
        time_column = time_column_candidates[0]
        data[time_column] = pd.to_datetime(data[time_column])

        # Extract year and month from the time column. This handles entries with different day values correctly.
        data['year'] = data[time_column].dt.year
        data['month'] = data[time_column].dt.month

        # Group data by year and month and calculate statistics
        monthly_stats = data.groupby(['year', 'month'])[temperature_column].agg(
            min_temperature='min',
            max_temperature='max',
            median_temperature='median'
        ).reset_index()

        # Define the statistics to plot
        stats = ['min_temperature', 'max_temperature', 'median_temperature']

        # Create a faceted plot for each statistic (i.e., one figure per statistic with 12 subplots)
        for stat in stats:
            fig, axes = plt.subplots(3, 4, figsize=(15, 10), sharex=True, sharey=True)
            fig.suptitle(f"{stat.replace('_', ' ').title()} by Year for Each Month", fontsize=16)

            for ax, month in zip(axes.flatten(), range(1, 13)):
                subset = monthly_stats[monthly_stats['month'] == month]
                ax.plot(subset['year'], subset[stat], marker='o', linestyle='-')
                ax.set_title(f"Month: {month}")
                ax.set_xlabel("Year")
                ax.set_ylabel(temperature_column)
            fig.tight_layout(rect=[0, 0.03, 1, 0.95])
            plt.show()

    except Exception as e:
        print(f"An error occurred while plotting: {e}")

# Example usage:
file_path = "~/BR_VICOSA_TEMPERATURE_MONTHLY_1940-2024.csv"
output_csv_path = "BR_VICOSA_TEMPERATURE_1940-2024_monthly_analysis.csv"

# Analyze and export monthly statistics
analyze_temperature_by_month_and_year(file_path, output_csv_path)

# Create faceted plots for temperature statistics by month
plot_temperature_by_month(file_path)

Explanation

Robust Month Handling:
The script converts the time column into datetime format, then extracts the year and month regardless of the day value. This ensures that even if entries have different days, they still fall under the correct month.
Faceted Plots:
The new function, plot_temperature_by_month(), groups the data by year and month, calculates the min, max, and median, and then creates one figure per statistic. Each figure contains 12 subplots (facets) arranged in a 3×4 grid—one for each month. The x-axis represents the year, and the y-axis displays the chosen temperature column.
Reuse of Ranking Function:
The same rank_temperature_columns() function is used to automatically determine the temperature column, keeping your script consistent.

This approach, known as faceting (or using small multiples), allows you to visually compare trends across months with ease.

Would you like any further customization or explanations on any part of the script?

Below is an updated version of your script. It now includes an extra function—export_monthly_faceted_csv()—that aggregates your hourly data into monthly statistics and then exports three CSV files: one for min temperatures, one for max temperatures, and one for median temperatures. In each CSV file, the rows represent years and the columns represent months (1 through 12).

import pandas as pd

def rank_temperature_columns(headers):
    keyword_scores = {
        'temperature': 10,  # Exact match gets the highest score
        'temp': 5,          # Partial match gets a moderate score
    }
    ranked_columns = []
    for header in headers:
        score = 0
        lower_header = header.lower()
        for keyword, keyword_score in keyword_scores.items():
            if keyword in lower_header:
                score += keyword_score
        if score > 0:  # Only include columns with a relevant score
            ranked_columns.append((header, score))
    # Sort columns based on the scores in descending order
    ranked_columns.sort(key=lambda x: x[1], reverse=True)
    return ranked_columns

def analyze_temperature_by_month_and_year(file_path, output_csv_path):
    try:
        # Read the CSV file
        data = pd.read_csv(file_path)

        # Display headers for verification
        print("Available headers in the CSV file:")
        print(list(data.columns))

        # Rank potential temperature columns
        ranked_columns = rank_temperature_columns(data.columns)
        if not ranked_columns:
            print("No column likely refers to temperature. Please check your CSV file.")
            return

        # Select the most probable temperature column
        temperature_column = ranked_columns[0][0]
        print(f"Using '{temperature_column}' as the temperature column. (Ranking: {ranked_columns})")

        # Identify and handle the time-related column
        time_column_candidates = [col for col in data.columns if 'time' in col.lower()]
        if not time_column_candidates:
            print("No suitable 'time' column found. Please ensure a time-related column exists.")
            return
        time_column = time_column_candidates[0]
        data[time_column] = pd.to_datetime(data[time_column])

        # Extract year and month from the time column. This handles entries with different day values correctly.
        data['year'] = data[time_column].dt.year
        data['month'] = data[time_column].dt.month

        # Create a 'time' column with day=1 and hour=0 to represent the month
        data['time'] = pd.to_datetime(dict(year=data['year'], month=data['month'], day=1, hour=0))

        # Group data by the new 'time' column and calculate statistics
        monthly_analysis = data.groupby('time')[temperature_column].agg(
            mean_temperature='mean',
            median_temperature='median',
            min_temperature='min',
            max_temperature='max'
        ).reset_index()

        # Save results into a new CSV file
        monthly_analysis.to_csv(output_csv_path, index=False)
        print(f"Monthly temperature statistics saved to: {output_csv_path}")

        # Create a pivot table for a combined analysis (optional)
        pivot_table = data.groupby(['year', 'month'])[temperature_column].agg(
            min_temperature='min',
            max_temperature='max',
            median_temperature='median'
        ).reset_index()

        reshaped_table = pivot_table.pivot(index='year', columns='month',
                                           values=['min_temperature', 'max_temperature', 'median_temperature'])
        # Optional: Rename columns for clarity (e.g., "min_temperature_1", "max_temperature_1", etc.)
        reshaped_table.columns = [f"{stat}_{month}" for stat, month in reshaped_table.columns]
        reshaped_table.reset_index(inplace=True)

        reshaped_csv_path = output_csv_path.replace("_monthly_analysis", "_year_month_analysis")
        reshaped_table.to_csv(reshaped_csv_path, index=False)
        print(f"Pivot table saved to: {reshaped_csv_path}")

    except Exception as e:
        print(f"An error occurred: {e}")

def export_monthly_faceted_csv(file_path, output_csv_prefix):
    """
    This function reads the CSV file (which may contain hourly data), aggregates it by year and month,
    and then exports three CSV files containing pivot tables for min, max, and median temperatures.
    Each CSV file has years as rows and months (1-12) as columns.
    """
    try:
        # Read the CSV file
        data = pd.read_csv(file_path)

        # Identify temperature column using the ranking function
        ranked_columns = rank_temperature_columns(data.columns)
        if not ranked_columns:
            print("No column likely refers to temperature. Please check your CSV file.")
            return
        temperature_column = ranked_columns[0][0]
        print(f"Using '{temperature_column}' as the temperature column for export. (Ranking: {ranked_columns})")

        # Identify the time column
        time_column_candidates = [col for col in data.columns if 'time' in col.lower()]
        if not time_column_candidates:
            print("No suitable 'time' column found for export.")
            return
        time_column = time_column_candidates[0]
        data[time_column] = pd.to_datetime(data[time_column])

        # Extract year and month from the time column. This handles entries with different day values correctly.
        data['year'] = data[time_column].dt.year
        data['month'] = data[time_column].dt.month

        # Group data by year and month and calculate min, max, and median temperatures
        monthly_stats = data.groupby(['year', 'month'])[temperature_column].agg(
            min_temperature='min',
            max_temperature='max',
            median_temperature='median'
        ).reset_index()

        # Create pivot tables for each statistic with years as rows and months as columns
        min_pivot = monthly_stats.pivot(index='year', columns='month', values='min_temperature')
        max_pivot = monthly_stats.pivot(index='year', columns='month', values='max_temperature')
        median_pivot = monthly_stats.pivot(index='year', columns='month', values='median_temperature')

        # Define output file paths based on the provided prefix
        min_csv_path = f"{output_csv_prefix}_min.csv"
        max_csv_path = f"{output_csv_prefix}_max.csv"
        median_csv_path = f"{output_csv_prefix}_median.csv"

        # Export the pivot tables to CSV files
        min_pivot.to_csv(min_csv_path)
        max_pivot.to_csv(max_csv_path)
        median_pivot.to_csv(median_csv_path)

        print("Exported pivot tables to CSV:")
        print(f"Min temperatures: {min_csv_path}")
        print(f"Max temperatures: {max_csv_path}")
        print(f"Median temperatures: {median_csv_path}")

    except Exception as e:
        print(f"An error occurred during export: {e}")

# Example usage:
input_file_path = "~/BR_VICOSA_TEMPERATURE_MONTHLY_1940-2024.csv"
analysis_output_csv = "BR_VICOSA_TEMPERATURE_1940-2024_monthly_analysis.csv"
export_csv_prefix = "BR_VICOSA_TEMPERATURE_1940-2024_faceted"

# Generate aggregated monthly statistics and pivot table CSV files
analyze_temperature_by_month_and_year(input_file_path, analysis_output_csv)

# Export separate CSV files for min, max, and median temperatures with a faceted (year x month) structure
export_monthly_faceted_csv(input_file_path, export_csv_prefix)

How It Works

Reading and Ranking Columns:
The script first reads your CSV file and uses the rank_temperature_columns() function to determine which column most likely represents temperature.
Time Handling:
It converts the time-related column into a datetime format and extracts the year and month regardless of the actual day values. This ensures that all entries fall under the correct month.
Monthly Aggregation:
- The existing function analyze_temperature_by_month_and_year() aggregates the data by month and exports a CSV with overall statistics (mean, median, min, max) along with a reshaped pivot table.
- The new function, export_monthly_faceted_csv(), groups the data by year and month, calculates the min, max, and median, and then pivots each statistic into separate CSV files. Each file has years as rows and months as columns.

Feel free to adjust the output file paths or other parameters as needed.

From looking closely at the script you provided (and the symptoms in the plot), the main “bug” is that the code duplicates key lines in the analyze_temperature_by_month_and_year function, which can lead to odd grouping and output issues. In particular:

data['time'] = pd.to_datetime(dict(...)) is repeated twice in a row.
monthly_analysis.to_csv(output_csv_path, index=False) is also called twice in a row.

Either of these duplications can cause unexpected behavior—especially the repeated construction of data['time']—and may explain why “every second month” in your chart appears to have bizarre dips. Below is the relevant snippet showing the duplication:

# Identify and handle the time-related column
time_column = [col for col in data.columns if 'time' in col.lower()][0]
data[time_column] = pd.to_datetime(data[time_column])
data['year'] = data[time_column].dt.year
data['month'] = data[time_column].dt.month

# Here it is done once:
data['time'] = pd.to_datetime(dict(year=data['year'], month=data['month'], day=1, hour=0))

# ...and here it is repeated again:
data['time'] = pd.to_datetime(dict(year=data['year'], month=data['month'], day=1, hour=0))

...

# Also saving the CSV file twice:
monthly_analysis.to_csv(output_csv_path, index=False)
print(f"Monthly temperature statistics saved to: {output_csv_path}")

# Repeated immediately:
monthly_analysis.to_csv(output_csv_path, index=False)
print(f"Monthly temperature statistics saved to: {output_csv_path}")

Why This Causes Problems

Repeated data['time'] assignment
While assigning the same value to data['time'] twice might look harmless, it often indicates a copy-paste slip. If you ever modified data['year'] or data['month'] in between (or intended to), doing the same assignment again can scramble logic or cause confusion about which “time” you’re really grouping by.
Repeated CSV save
Calling to_csv twice to the same file does no further harm to the final CSV (it just overwrites itself), but it’s almost certainly not intentional and can be misleading if you intended to create two different outputs.

How To Fix It

Remove the duplicated line that reconstructs the 'time' column. You only need to do this once:

data['time'] = pd.to_datetime(dict(year=data['year'],
                                   month=data['month'],
                                   day=1,
                                   hour=0))

…and delete the second identical line.

Remove one of the two CSV saves (or change the second one to a different filename if you really intended two separate files). For example, just keep:
```
monthly_analysis.to_csv(output_csv_path, index=False)
print(f"Monthly temperature statistics saved to: {output_csv_path}")
```
and delete the duplicate immediately below it.

Once you clean up these duplications, you should no longer see that strange every-other-month dip. Essentially, the repeated code was the unintended “bug,” not the raw data itself.

From your original script and these symptoms (missing values in the pivoted year-month table), the most likely culprit is the duplicated or unnecessary line(s) that set data['time']—particularly if your source file already contains monthly (rather than hourly/daily) data. That duplication or incorrect time handling can cause rows to be grouped incorrectly, leaving many “holes” in the final pivot table.

Where the Bug Hides

In the analyze_temperature_by_month_and_year() function, notice these lines:

# (1) Creating 'time' from year/month
data['time'] = pd.to_datetime(dict(year=data['year'], month=data['month'], day=1, hour=0))

# (2) Repeated again, right after
data['time'] = pd.to_datetime(dict(year=data['year'], month=data['month'], day=1, hour=0))

They appear twice in a row. The second one is redundant—and if your data is already monthly, forcibly resetting day=1/hour=0 can sometimes lead to:

Incorrect or conflicting grouping when you later do groupby('time').
Missing data if, for example, the original data has some slight differences or partial coverage, causing some rows to be dropped or never grouped as expected.

Similarly, you also have two identical calls to monthly_analysis.to_csv(...) in a row. While that just overwrites the same file, it’s usually unintentional and can mask other mistakes.

How To Fix It

Remove the extra data['time'] = ... line.
Keep only one block that sets data['time'] from the extracted year and month. If your data is truly monthly, you may not need this step at all (unless you want a consistent “day=1” timestamp). But definitely don’t do it twice.
Remove or rename the duplicate CSV export line.
Having two identical calls to to_csv(output_csv_path, index=False) one after another is almost always a copy-paste bug.
Double-check that you truly have data for every month.
If your source data is missing certain months (for example, no data in 1955 for February), then the pivot table will show blank cells in those spots. That’s normal. If you want to see all months and all years filled, even if no data exists, you can explicitly reindex after pivoting:
```
all_years = range(data['year'].min(), data['year'].max() + 1)
all_months = range(1, 13)

min_pivot = (monthly_stats
    .pivot(index='year', columns='month', values='min_temperature')
    .reindex(index=all_years, columns=all_months))
```
Then any genuinely missing data will appear as NaN (empty) rather than simply omitting the row or column.

Once you remove the extra “day=1/hour=0” assignment and the duplicate CSV calls, you should no longer get those “every-other-month” or “holes everywhere” anomalies unless the data is actually missing.

provide a fixed version of this function

def analyze_temperature_by_month_and_year(file_path, output_csv_path): try: # Read the CSV file data = pd.read_csv(file_path)

    # Display headers for verification
    print("Available headers in the CSV file:")
    print(list(data.columns))

    # Rank potential temperature columns
    ranked_columns = rank_temperature_columns(data.columns)
    if not ranked_columns:
        print("No column likely refers to temperature. Please check your CSV file.")
        return

    # Select the most probable temperature column
    temperature_column = ranked_columns[0][0]
    print(f"Using '{temperature_column}' as the temperature column. (Ranking: {ranked_columns})")

    # Identify and handle the time-related column
    time_column_candidates = [col for col in data.columns if 'time' in col.lower()]
    if not time_column_candidates:
        print("No suitable 'time' column found. Please ensure a time-related column exists.")
        return
    time_column = time_column_candidates[0]
    data[time_column] = pd.to_datetime(data[time_column])

    # Extract year and month from the time column. This handles entries with different day values correctly.
    data['year'] = data[time_column].dt.year
    data['month'] = data[time_column].dt.month

    # Create a 'time' column with day=1 and hour=0 to represent the month
    data['time'] = pd.to_datetime(dict(year=data['year'], month=data['month'], day=1, hour=0))

    # Group data by the new 'time' column and calculate statistics
    monthly_analysis = data.groupby('time')[temperature_column].agg(
        mean_temperature='mean',
        median_temperature='median',
        min_temperature='min',
        max_temperature='max'
    ).reset_index()

    # Save results into a new CSV file
    monthly_analysis.to_csv(output_csv_path, index=False)
    print(f"Monthly temperature statistics saved to: {output_csv_path}")

    # Create a pivot table for a combined analysis (optional)
    pivot_table = data.groupby(['year', 'month'])[temperature_column].agg(
        min_temperature='min',
        max_temperature='max',
        median_temperature='median'
    ).reset_index()

    reshaped_table = pivot_table.pivot(index='year', columns='month',
                                       values=['min_temperature', 'max_temperature', 'median_temperature'])
    # Optional: Rename columns for clarity (e.g., "min_temperature_1", "max_temperature_1", etc.)
    reshaped_table.columns = [f"{stat}_{month}" for stat, month in reshaped_table.columns]
    reshaped_table.reset_index(inplace=True)

    reshaped_csv_path = output_csv_path.replace("_monthly_analysis", "_year_month_analysis")
    reshaped_table.to_csv(reshaped_csv_path, index=False)
    print(f"Pivot table saved to: {reshaped_csv_path}")

except Exception as e:
    print(f"An error occurred: {e}")

Below is a revised version of your function with the duplicate lines removed and with added comments for clarity. This version fixes the bug that was causing empty spots (by eliminating any unintentional duplication in processing the time column and CSV export):

def analyze_temperature_by_month_and_year(file_path, output_csv_path):
    try:
        # Read the CSV file
        data = pd.read_csv(file_path)
        print("Available headers in the CSV file:")
        print(list(data.columns))

        # Rank potential temperature columns and select the most likely one
        ranked_columns = rank_temperature_columns(data.columns)
        if not ranked_columns:
            print("No column likely refers to temperature. Please check your CSV file.")
            return
        temperature_column = ranked_columns[0][0]
        print(f"Using '{temperature_column}' as the temperature column. (Ranking: {ranked_columns})")

        # Identify and convert the time-related column
        time_column_candidates = [col for col in data.columns if 'time' in col.lower()]
        if not time_column_candidates:
            print("No suitable 'time' column found. Please ensure a time-related column exists.")
            return
        time_column = time_column_candidates[0]
        data[time_column] = pd.to_datetime(data[time_column])

        # Extract year and month (this works even if days differ)
        data['year'] = data[time_column].dt.year
        data['month'] = data[time_column].dt.month

        # Create a standardized 'time' column with day=1 and hour=0 for grouping by month
        data['time'] = pd.to_datetime(dict(year=data['year'], month=data['month'], day=1, hour=0))

        # Group data by the new 'time' column and calculate statistics
        monthly_analysis = data.groupby('time')[temperature_column].agg(
            mean_temperature='mean',
            median_temperature='median',
            min_temperature='min',
            max_temperature='max'
        ).reset_index()

        # Save the monthly analysis CSV file
        monthly_analysis.to_csv(output_csv_path, index=False)
        print(f"Monthly temperature statistics saved to: {output_csv_path}")

        # Create a pivot table with years as rows and months as columns for further analysis
        pivot_table = data.groupby(['year', 'month'])[temperature_column].agg(
            min_temperature='min',
            max_temperature='max',
            median_temperature='median'
        ).reset_index()

        reshaped_table = pivot_table.pivot(index='year', columns='month',
                                           values=['min_temperature', 'max_temperature', 'median_temperature'])
        # Rename columns for clarity (e.g., "min_temperature_1", "max_temperature_1", etc.)
        reshaped_table.columns = [f"{stat}_{month}" for stat, month in reshaped_table.columns]
        reshaped_table.reset_index(inplace=True)

        # Determine output path for the pivot table CSV file
        reshaped_csv_path = output_csv_path.replace("_monthly_analysis", "_year_month_analysis")
        reshaped_table.to_csv(reshaped_csv_path, index=False)
        print(f"Pivot table saved to: {reshaped_csv_path}")

    except Exception as e:
        print(f"An error occurred: {e}")

Key Fixes:

Removed duplicate assignments:
Only one assignment of data['time'] is performed (using year and month with day=1, hour=0).
Removed duplicate CSV export:
The CSV file is now saved only once for each output.

These changes ensure that data grouping is handled correctly so that each month is properly represented without unwanted empty spots.

Below is the updated function. In the pivot table section, after resetting the index, the “year” column is renamed to “time” before saving the CSV.

def analyze_temperature_by_month_and_year(file_path, output_csv_path):
    try:
        # Read the CSV file
        data = pd.read_csv(file_path)
        print("Available headers in the CSV file:")
        print(list(data.columns))

        # Rank potential temperature columns and select the most likely one
        ranked_columns = rank_temperature_columns(data.columns)
        if not ranked_columns:
            print("No column likely refers to temperature. Please check your CSV file.")
            return
        temperature_column = ranked_columns[0][0]
        print(f"Using '{temperature_column}' as the temperature column. (Ranking: {ranked_columns})")

        # Identify and convert the time-related column
        time_column_candidates = [col for col in data.columns if 'time' in col.lower()]
        if not time_column_candidates:
            print("No suitable 'time' column found. Please ensure a time-related column exists.")
            return
        time_column = time_column_candidates[0]
        data[time_column] = pd.to_datetime(data[time_column])

        # Extract year and month (this works even if days differ)
        data['year'] = data[time_column].dt.year
        data['month'] = data[time_column].dt.month

        # Create a standardized 'time' column with day=1 and hour=0 for grouping by month
        data['time'] = pd.to_datetime(dict(year=data['year'], month=data['month'], day=1, hour=0))

        # Group data by the new 'time' column and calculate statistics
        monthly_analysis = data.groupby('time')[temperature_column].agg(
            mean_temperature='mean',
            median_temperature='median',
            min_temperature='min',
            max_temperature='max'
        ).reset_index()

        # Save the monthly analysis CSV file
        monthly_analysis.to_csv(output_csv_path, index=False)
        print(f"Monthly temperature statistics saved to: {output_csv_path}")

        # Create a pivot table with years as rows and months as columns for further analysis
        pivot_table = data.groupby(['year', 'month'])[temperature_column].agg(
            min_temperature='min',
            max_temperature='max',
            median_temperature='median'
        ).reset_index()

        reshaped_table = pivot_table.pivot(index='year', columns='month',
                                           values=['min_temperature', 'max_temperature', 'median_temperature'])
        # Rename columns for clarity (e.g., "min_temperature_1", "max_temperature_1", etc.)
        reshaped_table.columns = [f"{stat}_{month}" for stat, month in reshaped_table.columns]
        reshaped_table.reset_index(inplace=True)

        # Rename 'year' to 'time' before saving the pivot table
        reshaped_table.rename(columns={'year': 'time'}, inplace=True)

        # Determine output path for the pivot table CSV file
        reshaped_csv_path = output_csv_path.replace("_monthly_analysis", "_year_month_analysis")
        reshaped_table.to_csv(reshaped_csv_path, index=False)
        print(f"Pivot table saved to: {reshaped_csv_path}")

    except Exception as e:
        print(f"An error occurred: {e}")

Explanation of Changes:

Renaming “year” column:
After resetting the index of the pivot table, the rename method is used to change the “year” column to “time”. This change reflects your requirement to have the final CSV use “time” instead of “year”.

Below is a fresh start. This function reads your file (assumed to be monthly data), extracts the year and month from your time column, groups by these fields (using the mean if multiple records exist per month), pivots the data into a table with years as rows and months (1–12) as columns, reindexes to force all months for all years (so that “gaps” are explicit as NaN rather than missing rows/columns), and finally renames the “year” column to “time” before saving.

import pandas as pd

def rank_temperature_columns(headers):
    # Simple ranking: exact match scores highest, partial match scores lower
    keyword_scores = {'temperature': 10, 'temp': 5}
    ranked = []
    for header in headers:
        score = 0
        lower = header.lower()
        for keyword, weight in keyword_scores.items():
            if keyword in lower:
                score += weight
        if score > 0:
            ranked.append((header, score))
    ranked.sort(key=lambda x: x[1], reverse=True)
    return ranked

def export_faceted_monthly_csv(file_path, output_csv_path):
    try:
        # Read the CSV file
        data = pd.read_csv(file_path)
        print("Headers found:", list(data.columns))

        # Identify the temperature column using our ranking function
        ranked = rank_temperature_columns(data.columns)
        if not ranked:
            print("No temperature column found. Check the CSV file.")
            return
        temp_col = ranked[0][0]
        print(f"Using '{temp_col}' as the temperature column. (Ranking: {ranked})")

        # Identify the time-related column (assumes a header containing 'time')
        time_candidates = [col for col in data.columns if 'time' in col.lower()]
        if not time_candidates:
            print("No time column found. Check the CSV file.")
            return
        time_col = time_candidates[0]
        data[time_col] = pd.to_datetime(data[time_col])

        # Extract year and month from the time column
        data['year'] = data[time_col].dt.year
        data['month'] = data[time_col].dt.month

        # Group data by year and month.
        # If multiple entries exist per month, take the mean temperature.
        monthly = data.groupby(['year', 'month'])[temp_col].mean().reset_index()

        # Pivot the table so rows are years and columns are months (1-12)
        pivot = monthly.pivot(index='year', columns='month', values=temp_col)

        # Reindex to force all years and all months (1-12) to appear.
        all_years = range(data['year'].min(), data['year'].max() + 1)
        all_months = list(range(1, 13))
        pivot = pivot.reindex(index=all_years, columns=all_months)

        # Reset index to convert the year index into a column,
        # and then rename 'year' to 'time' per your request.
        pivot = pivot.reset_index().rename(columns={'year': 'time'})

        # Save the complete faceted pivot table to CSV.
        pivot.to_csv(output_csv_path, index=False)
        print(f"Faceted CSV saved to: {output_csv_path}")

    except Exception as e:
        print("An error occurred:", e)

# Example usage:
input_file_path = "BR_VICOSA_TEMPERATURE_MONTHLY_1940-2024.csv"
output_csv_path = "BR_VICOSA_TEMPERATURE_faceted.csv"
export_faceted_monthly_csv(input_file_path, output_csv_path)

Explanation

Temperature Column Detection:
The function uses rank_temperature_columns() to automatically choose the column that likely contains temperature data.
Time Parsing & Extraction:
It identifies a time column (based on headers containing "time"), converts it to datetime, and then extracts the year and month.
Grouping & Pivoting:
Data are grouped by year and month (averaging the temperature if needed) and then pivoted so that each row represents a year and each column represents a month (1 to 12).
Reindexing ensures every year and every month appears—even if some months have no data (those cells will be NaN).
Renaming & Saving:
Finally, after resetting the index, the "year" column is renamed to "time" before the pivot table is saved as a CSV.

This should eliminate unwanted gaps due to missing reindexing and provide a complete faceted view of your monthly data.

Below is a “from‐scratch” version that tries to robustly read the file, ensure the time and temperature columns are parsed correctly (forcing numeric conversion on temperature), and then pivots the data so that rows are years (renamed to “time”) and columns are months (1–12). This version also reindexes to force every year/month combination to appear (if data is missing for some month, that cell will be NaN):

import pandas as pd

def rank_temperature_columns(headers):
    # Score headers: an exact match to "temperature" scores highest.
    keyword_scores = {'temperature': 10, 'temp': 5}
    ranked = []
    for header in headers:
        score = 0
        lower = header.lower()
        for keyword, weight in keyword_scores.items():
            if keyword in lower:
                score += weight
        if score > 0:
            ranked.append((header, score))
    ranked.sort(key=lambda x: x[1], reverse=True)
    return ranked

def export_faceted_monthly_csv(file_path, output_csv_path):
    try:
        # Read CSV file
        data = pd.read_csv(file_path)
        print("Headers found:", data.columns.tolist())

        # Identify the time column.
        # If a column is literally named "time", use that; otherwise, pick the first that contains "time"
        if 'time' in data.columns:
            time_col = 'time'
        else:
            time_candidates = [col for col in data.columns if 'time' in col.lower()]
            if not time_candidates:
                raise ValueError("No time column found in the CSV file.")
            time_col = time_candidates[0]

        # Convert the time column to datetime (using infer_datetime_format for robustness)
        data[time_col] = pd.to_datetime(data[time_col], errors='coerce', infer_datetime_format=True)
        # Drop any rows where the time conversion failed
        data = data.dropna(subset=[time_col])

        # Extract year and month from the time column
        data['year'] = data[time_col].dt.year
        data['month'] = data[time_col].dt.month

        # Identify the temperature column using our ranking function
        ranked = rank_temperature_columns(data.columns)
        if not ranked:
            raise ValueError("No temperature column found in the CSV file.")
        temp_col = ranked[0][0]
        print(f"Using temperature column: {temp_col}")

        # Ensure the temperature column is numeric (in case it was read as a string)
        data[temp_col] = pd.to_numeric(data[temp_col], errors='coerce')

        # Create a pivot table: index=year, columns=month, value=temperature
        # We use mean in case there are multiple entries for the same year/month.
        pivot = pd.pivot_table(data, index='year', columns='month', values=temp_col, aggfunc='mean')

        # Reindex the pivot table so that every year (from min to max) and every month (1-12) appears.
        all_years = range(data['year'].min(), data['year'].max() + 1)
        all_months = list(range(1, 13))
        pivot = pivot.reindex(index=all_years, columns=all_months)

        # Reset index so that year becomes a column, then rename it to "time"
        pivot = pivot.reset_index().rename(columns={'year': 'time'})

        # Save the pivot table to CSV
        pivot.to_csv(output_csv_path, index=False)
        print(f"Faceted monthly CSV saved to: {output_csv_path}")

    except Exception as e:
        print("An error occurred:", e)

# Example usage:
input_file_path = "BR_VICOSA_TEMPERATURE_MONTHLY_1940-2024.csv"
output_csv_path = "BR_VICOSA_TEMPERATURE_faceted.csv"
export_faceted_monthly_csv(input_file_path, output_csv_path)

How This Works

Time Column Processing:
The function searches for a column named “time” (or containing “time”) and converts it to datetime with error‑coercion. Any rows with invalid dates are dropped.
Year/Month Extraction:
It then extracts the year and month from the datetime column.
Temperature Column Identification:
Using a simple ranking based on header keywords, it picks the most likely temperature column and forces its values to be numeric.
Pivot Table Creation:
A pivot table is built using pd.pivot_table(), with years as the index and months as the columns. The table is reindexed to include every year (from the minimum to maximum) and every month (1–12).
Renaming and Saving:
After resetting the index, the “year” column is renamed to “time” before saving the complete faceted CSV.

If you still see gaps (NaN values) in the output, it likely means that your source file truly has missing data for those year/month combinations. You can choose to fill those gaps using a method such as forward-fill (e.g., pivot.fillna(method='ffill')) if that is acceptable for your analysis.

Below is a fresh version of the function. It reads your file, automatically detects the temperature column, checks whether the median temperature suggests Fahrenheit values (using a heuristic threshold), converts them to Celsius if needed, and then pivots the data so that rows (renamed to “time”) are years and columns are months (1–12).

import pandas as pd

def rank_temperature_columns(headers):
    # Score headers: an exact match to "temperature" scores highest.
    keyword_scores = {'temperature': 10, 'temp': 5}
    ranked = []
    for header in headers:
        score = 0
        lower = header.lower()
        for keyword, weight in keyword_scores.items():
            if keyword in lower:
                score += weight
        if score > 0:
            ranked.append((header, score))
    ranked.sort(key=lambda x: x[1], reverse=True)
    return ranked

def export_faceted_monthly_csv(file_path, output_csv_path):
    try:
        # Read CSV file
        data = pd.read_csv(file_path)
        print("Headers found:", data.columns.tolist())

        # Identify the time column.
        if 'time' in data.columns:
            time_col = 'time'
        else:
            time_candidates = [col for col in data.columns if 'time' in col.lower()]
            if not time_candidates:
                raise ValueError("No time column found in the CSV file.")
            time_col = time_candidates[0]

        # Convert the time column to datetime; drop rows that fail conversion.
        data[time_col] = pd.to_datetime(data[time_col], errors='coerce', infer_datetime_format=True)
        data = data.dropna(subset=[time_col])

        # Extract year and month from the time column.
        data['year'] = data[time_col].dt.year
        data['month'] = data[time_col].dt.month

        # Identify the temperature column using our ranking function.
        ranked = rank_temperature_columns(data.columns)
        if not ranked:
            raise ValueError("No temperature column found in the CSV file.")
        temp_col = ranked[0][0]
        print(f"Using temperature column: {temp_col}")

        # Ensure the temperature column is numeric.
        data[temp_col] = pd.to_numeric(data[temp_col], errors='coerce')

        # Check if the temperature values are in Fahrenheit.
        # Heuristic: if the median value is above 45, assume Fahrenheit.
        median_temp = data[temp_col].median()
        if median_temp > 45:
            print(f"Median temperature ({median_temp}) is high; converting from Fahrenheit to Celsius.")
            data[temp_col] = (data[temp_col] - 32) * 5/9
        else:
            print(f"Median temperature ({median_temp}) seems to be in Celsius; no conversion applied.")

        # Create a pivot table: rows=year and columns=month, using the average temperature for each cell.
        monthly = data.groupby(['year', 'month'])[temp_col].mean().reset_index()
        pivot = monthly.pivot(index='year', columns='month', values=temp_col)

        # Reindex to force every year (min to max) and every month (1-12) to appear.
        all_years = range(data['year'].min(), data['year'].max() + 1)
        all_months = list(range(1, 13))
        pivot = pivot.reindex(index=all_years, columns=all_months)

        # Reset index so that the year becomes a column, and rename it to "time".
        pivot = pivot.reset_index().rename(columns={'year': 'time'})

        # Save the faceted pivot table to CSV.
        pivot.to_csv(output_csv_path, index=False)
        print(f"Faceted monthly CSV saved to: {output_csv_path}")

    except Exception as e:
        print("An error occurred:", e)

# Example usage:
input_file_path = "BR_VICOSA_TEMPERATURE_MONTHLY_1940-2024.csv"
output_csv_path = "BR_VICOSA_TEMPERATURE_faceted.csv"
export_faceted_monthly_csv(input_file_path, output_csv_path)

How This Works

Time Column Processing:
The function finds and converts the time column (using either the column named "time" or one that contains "time"). Invalid dates are dropped.
Year/Month Extraction:
It extracts the year and month from the time column.
Temperature Column Identification & Conversion:
The function identifies the temperature column via a simple keyword ranking. It forces numeric conversion and then checks the median value. If the median exceeds 45, it assumes the data is in Fahrenheit and converts it to Celsius using
$ C = \frac{5}{9}(F - 32) $ Otherwise, it leaves the values as they are.
Pivot Table Creation:
Data are grouped by year and month (using the mean for months with multiple entries), then pivoted to force all year/month combinations. After pivoting, the index is reset and the “year” column is renamed to “time”.
Saving CSV:
Finally, the pivot table is saved to a CSV file.

This version should now automatically detect Fahrenheit values and convert them while producing a complete faceted CSV without unwanted gaps (except where data truly are missing).

Below is the updated version that checks if the temperature values are in Kelvin (using a heuristic based on the median value) and converts them to Celsius. The conversion used is:

$ \text{Celsius} = \text{Kelvin} - 273.15 $

If the median value is above 150 (a threshold that indicates Kelvin values for typical Earth temperatures), the script will subtract 273.15 from all temperature readings.

import pandas as pd

def rank_temperature_columns(headers):
    # Score headers: an exact match to "temperature" scores highest.
    keyword_scores = {'temperature': 10, 'temp': 5}
    ranked = []
    for header in headers:
        score = 0
        lower = header.lower()
        for keyword, weight in keyword_scores.items():
            if keyword in lower:
                score += weight
        if score > 0:
            ranked.append((header, score))
    ranked.sort(key=lambda x: x[1], reverse=True)
    return ranked

def export_faceted_monthly_csv(file_path, output_csv_path):
    try:
        # Read CSV file
        data = pd.read_csv(file_path)
        print("Headers found:", data.columns.tolist())

        # Identify the time column.
        if 'time' in data.columns:
            time_col = 'time'
        else:
            time_candidates = [col for col in data.columns if 'time' in col.lower()]
            if not time_candidates:
                raise ValueError("No time column found in the CSV file.")
            time_col = time_candidates[0]

        # Convert the time column to datetime; drop rows that fail conversion.
        data[time_col] = pd.to_datetime(data[time_col], errors='coerce', infer_datetime_format=True)
        data = data.dropna(subset=[time_col])

        # Extract year and month from the time column.
        data['year'] = data[time_col].dt.year
        data['month'] = data[time_col].dt.month

        # Identify the temperature column using our ranking function.
        ranked = rank_temperature_columns(data.columns)
        if not ranked:
            raise ValueError("No temperature column found in the CSV file.")
        temp_col = ranked[0][0]
        print(f"Using temperature column: {temp_col}")

        # Ensure the temperature column is numeric.
        data[temp_col] = pd.to_numeric(data[temp_col], errors='coerce')

        # Check if the temperature values are in Kelvin.
        # Heuristic: If the median temperature is greater than 150, assume Kelvin.
        median_temp = data[temp_col].median()
        if median_temp > 150:
            print(f"Median temperature ({median_temp}) suggests data is in Kelvin; converting to Celsius.")
            data[temp_col] = data[temp_col] - 273.15
        else:
            print(f"Median temperature ({median_temp}) appears to be in Celsius; no conversion applied.")

        # Create a pivot table: rows=year and columns=month, using the average temperature for each cell.
        monthly = data.groupby(['year', 'month'])[temp_col].mean().reset_index()
        pivot = monthly.pivot(index='year', columns='month', values=temp_col)

        # Reindex to force every year (min to max) and every month (1-12) to appear.
        all_years = range(data['year'].min(), data['year'].max() + 1)
        all_months = list(range(1, 13))
        pivot = pivot.reindex(index=all_years, columns=all_months)

        # Reset index so that the year becomes a column, and rename it to "time".
        pivot = pivot.reset_index().rename(columns={'year': 'time'})

        # Save the faceted pivot table to CSV.
        pivot.to_csv(output_csv_path, index=False)
        print(f"Faceted monthly CSV saved to: {output_csv_path}")

    except Exception as e:
        print("An error occurred:", e)

# Example usage:
input_file_path = "BR_VICOSA_TEMPERATURE_MONTHLY_1940-2024.csv"
output_csv_path = "BR_VICOSA_TEMPERATURE_faceted.csv"
export_faceted_monthly_csv(input_file_path, output_csv_path)

How This Works

Time Column Processing:
The function identifies and converts the time column (using a column named "time" or one that contains "time") to datetime. Rows that fail conversion are dropped. Then it extracts the year and month.
Temperature Column Identification & Conversion:
The temperature column is selected via a ranking function based on header keywords. After converting it to numeric, the script checks if the median value is above 150. If so, it assumes the data are in Kelvin and subtracts 273.15 from all values to convert them to Celsius.
Pivot Table Creation:
The script groups data by year and month (averaging temperatures where needed), pivots the data so that rows are years and columns are months (1–12), and reindexes to ensure that all combinations are present. Finally, it resets the index and renames the “year” column to “time”.
Saving CSV:
The complete pivot table is saved to a CSV file.

This version should now correctly convert Kelvin to Celsius and produce a faceted CSV without unintended gaps (unless data are actually missing).

I’m processing individual files, each one contains the daily values for February for 12 consecutive years.

I want to process all files, and make a csv where all those values are merged in columns. one column for each february of each year named “February {YEAR}”. Each row is a day in february.

here’s my code which needs updating

Load and concatenate datasets

datasets = [open_dataset_file(f) for f in tqdm(selected_files, desc=“Opening files”)]

dataframes = [] for dataset in datasets: # Select the nearest grid point dataset_point = dataset.interp(latitude=LATITUDE, longitude=LONGITUDE, method=‘linear’, kwargs={‘fill_value’: ‘extrapolate’})

# Process variable information
primary_var = list(dataset_point.data_vars)[0]
long_name = dataset_point[primary_var].attrs.get("long_name", primary_var)
units = dataset_point[primary_var].attrs.get("units", "")

dataset_point = refactor_dataset(dataset_point, primary_var).sortby("time")

# Adjust temperature if necessary
if VARIABLE == "TEMPERATURE":
    dataset_point = dataset_point - 273.15
    units = "C"

pretty_name = f"{long_name} ({units})"

# Convert to a pandas DataFrame and remove any rows with NaN values
df = dataset_point.to_dataframe().reset_index().dropna()

# 3. Prettify the Data Header using descriptive names and units
# --------------------------------------------------------------
# Loop over each data variable and, if available, rename using its 'long_name' and 'units'
df = df.rename(columns={primary_var: pretty_name})

# Ensure the time column is first
cols = list(df.columns)
if "time" in cols:
    cols.remove("time")
    cols = ["time"] + cols
df = df[cols]

# ---------------------------
# FORMAT THE TIME COLUMN & CLEAN UP
# ---------------------------
# Convert `time` or `valid_time` to datetime and fill missing times with default '00:00:00'
if 'valid_time' in df.columns:
    df['time'] = pd.to_datetime(df['valid_time'], errors='coerce').dt.strftime('%Y-%m-%d %H:%M:%S')
    df = df.drop(columns=['valid_time'])  # Drop valid_time after use
else:
    # Default time to '00:00:00' if only a date is present
    df['time'] = pd.to_datetime(df['time'], errors='coerce').dt.strftime('%Y-%m-%d %H:%M:%S')

# Convert back to datetime for full validation
df['time'] = pd.to_datetime(df['time'], errors='coerce')

dataframes.append(df)

Below is an updated version of your code that processes each file (each containing daily February values for 12 consecutive years) and then creates a CSV where each column is named “February {YEAR}” and each row is a day in February. This version does the following:

Opens each dataset and processes it (including nearest‐grid interpolation, temperature conversion, and header prettification).
Filters for February data.
Extracts the day and year from the time column.
Pivots the DataFrame so that the index is the day of February and each column is named “February {YEAR}” (with the variable’s pretty name as the value).
Concatenates all resulting pivot tables (from all files) horizontally.
Saves the merged DataFrame to CSV.

Below is the complete code:

import pandas as pd
from tqdm import tqdm

def rank_temperature_columns(headers):
    # Score headers: an exact match to "temperature" scores highest.
    keyword_scores = {'temperature': 10, 'temp': 5}
    ranked = []
    for header in headers:
        score = 0
        lower = header.lower()
        for keyword, weight in keyword_scores.items():
            if keyword in lower:
                score += weight
        if score > 0:
            ranked.append((header, score))
    ranked.sort(key=lambda x: x[1], reverse=True)
    return ranked

def process_all_february_files(selected_files, output_csv):
    all_pivot_dfs = []  # list to hold the pivoted DataFrame for each file

    for f in tqdm(selected_files, desc="Processing files"):
        # Open dataset (assumed to be defined elsewhere)
        dataset = open_dataset_file(f)

        # Select the nearest grid point
        dataset_point = dataset.interp(latitude=LATITUDE, longitude=LONGITUDE,
                                       method='linear', kwargs={'fill_value': 'extrapolate'})

        # Process variable information
        primary_var = list(dataset_point.data_vars)[0]
        long_name = dataset_point[primary_var].attrs.get("long_name", primary_var)
        units = dataset_point[primary_var].attrs.get("units", "")

        # Refactor and sort by time
        dataset_point = refactor_dataset(dataset_point, primary_var).sortby("time")

        # Adjust temperature if necessary
        if VARIABLE == "TEMPERATURE":
            dataset_point = dataset_point - 273.15
            units = "C"

        pretty_name = f"{long_name} ({units})"

        # Convert dataset to a pandas DataFrame and drop rows with NaN values
        df = dataset_point.to_dataframe().reset_index().dropna()

        # Rename the primary variable column to the pretty name
        df = df.rename(columns={primary_var: pretty_name})

        # Ensure the time column is first
        cols = list(df.columns)
        if "time" in cols:
            cols.remove("time")
            cols = ["time"] + cols
        df = df[cols]

        # ---------------------------
        # FORMAT THE TIME COLUMN & CLEAN UP
        # ---------------------------
        # Convert 'time' (or 'valid_time' if available) to datetime; set default time if needed.
        if 'valid_time' in df.columns:
            df['time'] = pd.to_datetime(df['valid_time'], errors='coerce').dt.strftime('%Y-%m-%d %H:%M:%S')
            df = df.drop(columns=['valid_time'])
        else:
            df['time'] = pd.to_datetime(df['time'], errors='coerce').dt.strftime('%Y-%m-%d %H:%M:%S')
        df['time'] = pd.to_datetime(df['time'], errors='coerce')

        # Filter only February values (month == 2)
        df = df[df['time'].dt.month == 2]

        # Extract day of month and year
        df['day'] = df['time'].dt.day
        df['year'] = df['time'].dt.year

        # Pivot the data: index = day, columns = year (renamed to "February {YEAR}")
        pivot_df = df.pivot(index='day', columns='year', values=pretty_name)
        pivot_df = pivot_df.rename(columns=lambda y: f"February {y}")

        # Sort the pivot index (day)
        pivot_df = pivot_df.sort_index()

        # Append the pivot table to our list
        all_pivot_dfs.append(pivot_df)

    # Concatenate all pivot tables horizontally
    # (Assumes that the day index aligns across all files)
    final_df = pd.concat(all_pivot_dfs, axis=1)

    # Optionally, sort the columns alphabetically (or customize the order)
    final_df = final_df.reindex(sorted(final_df.columns), axis=1)

    # Save the final merged DataFrame to CSV.
    final_df.to_csv(output_csv, index=True)
    print(f"Merged CSV saved to: {output_csv}")

# Example usage:
# selected_files should be a list of file paths.
selected_files = [...]  # list of file paths to process
output_csv = "merged_february_values.csv"
process_all_february_files(selected_files, output_csv)

Explanation

File Processing:
Each file is opened and processed to extract the desired variable at the nearest grid point. Temperature values are adjusted (Kelvin to Celsius) if necessary.
Time Formatting:
The time column is converted to datetime. If the file contains only a date, a default time is added. Then we filter for February (month == 2).
Pivoting:
The DataFrame is pivoted so that the index is the day (1, 2, …, 28/29) and each column represents one February year (renamed to “February {YEAR}”).
Merging:
All pivoted DataFrames from the individual files are concatenated horizontally. Finally, the merged DataFrame is saved to CSV.

This should provide you with a CSV where each column corresponds to the daily values for February of a particular year, and each row represents a day in February.

Grab the maximum temperature for each Februrary aand build a new dataframe where the time column is each year (formatted as full date) and there is just one column named February Max Temp

# Select the nearest grid point
dataset_point = dataset.interp(latitude=LATITUDE, longitude=LONGITUDE, method='linear', kwargs={'fill_value': 'extrapolate'})

# Process variable information
primary_var = list(dataset_point.data_vars)[0]
long_name = dataset_point[primary_var].attrs.get("long_name", primary_var)
units = dataset_point[primary_var].attrs.get("units", "")

# Adjust temperature if necessary
if VARIABLE == "TEMPERATURE":
    dataset_point = dataset_point - 273.15
    units = "C"

pretty_name = f"{long_name} ({units})"

# Convert to a pandas DataFrame and remove any rows with NaN values
df = dataset_point.to_dataframe().reset_index().dropna()

# 3. Prettify the Data Header using descriptive names and units
# --------------------------------------------------------------
# Loop over each data variable and, if available, rename using its 'long_name' and 'units'
df = df.rename(columns={primary_var: pretty_name})

# Ensure the time column is first
cols = list(df.columns)
if time_var in cols:
    cols.remove(time_var)
    cols = [time_var] + cols
df = df[cols]

# Filter only February values (month == 2)
df = df[df[time_var].dt.month == 2]

# Extract day of month and year
df['day'] = df[time_var].dt.day
df['year'] = df[time_var].dt.year

# Pivot the data: index = day, columns = year (renamed to "February {YEAR}")
pivot_df = df.pivot(index='day', columns='year', values=pretty_name)
pivot_df = pivot_df.rename(columns=lambda y: f"February {y}")

# Sort the pivot index (day)
pivot_df = pivot_df.sort_index()

# Append the pivot table to our list
dataframes.append(pivot_df)

Below is an updated version of your code that, instead of pivoting by day, groups the February data by year and extracts the maximum temperature for each February. In the resulting DataFrame, the “time” column is built as a full date (using February 1st of each year), and there is a single column named “February Max Temp”.

# Process each file and build a new dataframe for February maximum temperatures
# (Assuming time_var is defined, e.g., time_var = "time")

# Select the nearest grid point
dataset_point = dataset.interp(latitude=LATITUDE, longitude=LONGITUDE,
                               method='linear', kwargs={'fill_value': 'extrapolate'})

# Process variable information
primary_var = list(dataset_point.data_vars)[0]
long_name = dataset_point[primary_var].attrs.get("long_name", primary_var)
units = dataset_point[primary_var].attrs.get("units", "")

# Adjust temperature if necessary (Kelvin to Celsius)
if VARIABLE == "TEMPERATURE":
    dataset_point = dataset_point - 273.15
    units = "C"

pretty_name = f"{long_name} ({units})"

# Convert to a pandas DataFrame and remove any rows with NaN values
df = dataset_point.to_dataframe().reset_index().dropna()

# Prettify the Data Header using descriptive names and units
df = df.rename(columns={primary_var: pretty_name})

# Ensure the time column is first
cols = list(df.columns)
if time_var in cols:
    cols.remove(time_var)
    cols = [time_var] + cols
df = df[cols]

# Format the time column: if there's a 'valid_time', use it; otherwise, use 'time'
if 'valid_time' in df.columns:
    df['time'] = pd.to_datetime(df['valid_time'], errors='coerce').dt.strftime('%Y-%m-%d %H:%M:%S')
    df = df.drop(columns=['valid_time'])
else:
    df['time'] = pd.to_datetime(df['time'], errors='coerce').dt.strftime('%Y-%m-%d %H:%M:%S')
df['time'] = pd.to_datetime(df['time'], errors='coerce')

# Filter only February values (month == 2)
df = df[df[time_var].dt.month == 2]

# Group by year and get maximum temperature for February
feb_max = df.groupby(df[time_var].dt.year)[pretty_name].max().reset_index()

# Create a full date for each year (using February 1st)
feb_max["time"] = pd.to_datetime(feb_max["year"].astype(str) + "-02-01")

# Build the final dataframe: time column and one column named "February Max Temp"
result_df = feb_max[["time", pretty_name]].rename(columns={pretty_name: "February Max Temp"})

# Append this result dataframe to your list (or process it further)
dataframes.append(result_df)

Explanation

Interpolation & Variable Processing:
The code selects the nearest grid point, extracts the primary variable, adjusts temperature values (Kelvin to Celsius if needed), and renames the column to include descriptive names and units.
Time Column Formatting:
The time column is standardized (using either "time" or "valid_time" if present) and converted to a proper datetime.
Filtering for February:
Only rows where the month equals 2 are kept.
Grouping and Aggregation:
The code groups the data by the year (extracted from the time column) and computes the maximum temperature for each February.
Building a Full Date:
For each group (year), a full date is created (using February 1st as a representative date) so that the “time” column contains a full date.
Final DataFrame:
The resulting DataFrame has two columns: “time” (the full date representing the year) and “February Max Temp” (the maximum temperature for that February). This DataFrame is then appended to your list of dataframes.

You can then merge or further process the dataframes list as needed for your overall workflow.

The error occurs because when grouping by the year (extracted from the time column), the resulting series doesn’t automatically have a name (“year”). To fix this, you can explicitly assign a name to the series before grouping. Below is the updated code snippet:

# Filter only February values (month == 2)
df = df[df[time_var].dt.month == 2]

# Extract the year from the time column and explicitly name the series "year"
year_series = df[time_var].dt.year.rename("year")

# Group by the named year series and get the maximum temperature for February
feb_max = df.groupby(year_series)[pretty_name].max().reset_index()

# Create a full date for each year (using February 1st)
feb_max["time"] = pd.to_datetime(feb_max["year"].astype(str) + "-02-01")

# Build the final dataframe: time column and one column named "February Max Temp"
result_df = feb_max[["time", pretty_name]].rename(columns={pretty_name: "February Max Temp"})

# Append this result dataframe to your list (or process it further)
dataframes.append(result_df)

Explanation

Naming the Year Series:
We extract the year from the time column using
```
year_series = df[time_var].dt.year.rename("year")
```
This ensures that when we do the groupby, the resulting DataFrame has a column named "year".
Grouping and Aggregation:
We group by the year_series and compute the maximum temperature for February, then reset the index so that "year" becomes a column.
Building the Final DataFrame:
A new full date is created (using February 1st of each year) and then we build the final DataFrame with two columns: "time" (the full date) and "February Max Temp" (the maximum temperature).

This should resolve the KeyError and produce the desired output.

Below is a complete code snippet that processes each file, extracts the maximum temperature for February, and builds a single DataFrame with two columns: a “time” column (using February 1st of each year as a representative full date) and a “February Max Temp” column.

import pandas as pd
from tqdm import tqdm

# Assume these functions and variables are defined elsewhere:
# - open_dataset_file(file_path): opens the dataset file.
# - refactor_dataset(dataset, primary_var): refactors the dataset as needed.
# - LATITUDE, LONGITUDE, VARIABLE, time_var (e.g., time_var = "time")

def process_february_max_temp(selected_files):
    results = []  # Will hold the per-file result DataFrames

    for f in tqdm(selected_files, desc="Processing files"):
        # Open the dataset file
        dataset = open_dataset_file(f)

        # Select the nearest grid point
        dataset_point = dataset.interp(latitude=LATITUDE, longitude=LONGITUDE,
                                       method='linear', kwargs={'fill_value': 'extrapolate'})

        # Process variable information
        primary_var = list(dataset_point.data_vars)[0]
        long_name = dataset_point[primary_var].attrs.get("long_name", primary_var)
        units = dataset_point[primary_var].attrs.get("units", "")

        # Adjust temperature if necessary (e.g., Kelvin to Celsius)
        if VARIABLE == "TEMPERATURE":
            dataset_point = dataset_point - 273.15
            units = "C"

        pretty_name = f"{long_name} ({units})"

        # Convert to a pandas DataFrame and remove any rows with NaN values
        df = dataset_point.to_dataframe().reset_index().dropna()

        # Rename the primary variable column to a descriptive name
        df = df.rename(columns={primary_var: pretty_name})

        # Ensure the time column is first
        cols = list(df.columns)
        if time_var in cols:
            cols.remove(time_var)
            cols = [time_var] + cols
        df = df[cols]

        # Convert the time column to datetime
        if 'valid_time' in df.columns:
            df['time'] = pd.to_datetime(df['valid_time'], errors='coerce')
            df = df.drop(columns=['valid_time'])
        else:
            df['time'] = pd.to_datetime(df['time'], errors='coerce')

        # Filter only February values (month == 2)
        df = df[df[time_var].dt.month == 2]

        # Extract the year and create a new "year" column
        df['year'] = df[time_var].dt.year

        # Group by year and get the maximum temperature for February
        feb_max = df.groupby('year')[pretty_name].max().reset_index()

        # Create a full date for each year using February 1st (as representative)
        feb_max["time"] = pd.to_datetime(feb_max["year"].astype(str) + "-02-01")

        # Build the resulting DataFrame with two columns: "time" and "February Max Temp"
        result_df = feb_max[["time", pretty_name]].rename(columns={pretty_name: "February Max Temp"})

        results.append(result_df)

    # Combine the results from all files vertically into a single DataFrame.
    final_df = pd.concat(results, ignore_index=True)
    final_df = final_df.sort_values("time").reset_index(drop=True)
    return final_df

# Example usage:
selected_files = [...]  # Replace with your list of file paths
final_df = process_february_max_temp(selected_files)
final_df.to_csv("merged_february_max_temp.csv", index=False)
print("Merged CSV saved to merged_february_max_temp.csv")

Explanation

File Processing:
Each file is opened and the nearest grid point is selected. The primary variable is determined, and—if needed—the temperature is converted from Kelvin to Celsius.
DataFrame Preparation:
The dataset is converted to a DataFrame. The primary variable column is renamed to a descriptive name (including units). The time column is converted to datetime.
Filtering and Grouping:
The DataFrame is filtered for February (month == 2). A new column "year" is created from the time column, and the data are grouped by year to extract the maximum temperature for each February.
Building the Final DataFrame:
For each year, a full date is constructed using February 1st. The result is then stored as a two‑column DataFrame (with columns "time" and "February Max Temp"). Finally, the results from all files are concatenated into a single DataFrame that is sorted by time.

This code produces a single CSV file named "merged_february_max_temp.csv" containing your desired two columns.