Average Monthly Rainfall Calculation

2025-07-01 15:01 UTC o4-mini-high Open in ChatGPT ↗

Below is the approach commonly followed by national meteorological services and recommended in the WMO Guide to Climatological Practices for constructing a long‑term mean (climatology) of monthly precipitation when your raw data are daily totals.

1  Define the quantity you want

Daily data available: 1 value per calendar day, units = mm day⁻¹ (or inches day⁻¹).
Target statistic: “Average monthly rainfall for March” over 1940‑2025, units = mm per month (sometimes called the “normal” March precipitation).

Because you ultimately want a monthly amount, the daily values must first be aggregated within each year, then averaged across years.

2  Quality‑control and completeness checks

Screen the daily series for impossible or suspect values (negative rainfall, duplicate dates, etc.).
Count valid days for each March:
- n_valid_y  =  number of days in March with an accepted observation in year y.
- March always has 31 days, so the maximum is 31.
Decide an acceptance threshold. A common rule is to accept a month if at least 90 % of its days are present:

$ \text{Accept March_}y \text{ if } n_\text{valid_y} \ge 0.9 \times 31 = 28 \text{ days}. $

Keep only those years that pass the threshold. Let N be the resulting number of usable years.

3  Compute a total for each year

For every accepted year y:

$ P_{y,,\text{March}} ;=; \sum_{d=1}^{n_\text{valid_y}} p_{y,d} \tag{1} $

where $p_{y,d}$ is the daily total on day d.

If you must use a year with missing days (not recommended), scale the partial sum:

$ P_{\text{scaled},,y} ;=; P_{y,,\text{observed}} \times \frac{31}{n_\text{valid_y}}. $

4  Average those yearly totals

The climatological mean precipitation for March is then

$ \boxed{ ;\overline{P}{\text{March}} ;=; \frac{1}{N} \sum{y=1}^{N} P_{y,,\text{March}};} \tag{2} $

Units are mm month⁻¹ (or inches month⁻¹).
Every year is weighted equally, which is standard practice in climatology.

5  Alternative—but less preferred—day‑weighted method

Some software packages simply average all daily March values across the entire period and multiply by 31:

$ \overline{P}{\text{March}} ;=; 31 \times \frac{1}{\sum n\text{valid_y}} \sum_{y,d} p_{y,d}. $

Mathematically this is equivalent only if every year contributes the same number of days (complete data). When some years have more missing days than others, this day‑weighted approach gives heavier weight to years with more valid observations. For long‑term climate normals the year‑weighted formula (2) is preferred.

6  Putting it in code (illustrative)

import pandas as pd

# df has a DatetimeIndex and a column 'precip'
march = df[df.index.month == 3]

# flag complete months
march['year'] = march.index.year
valid_counts = march.groupby('year')['precip'].count()
good_years = valid_counts[valid_counts >= 28].index        # ≥90 % complete

# monthly totals per good year
march_totals = march[march['year'].isin(good_years)].groupby('year')['precip'].sum()

# climatological mean
march_climatology = march_totals.mean()    # mm per month

7  Report clearly

“The 1940‑2025 climatological mean precipitation for March is XX mm month⁻¹ (N = ## complete years).”

Include metadata: station or region name, period of record, completeness criterion, and the method (year‑weighted mean of monthly totals).

Key take‑aways

Aggregate first, average second.
Weight each year equally unless you have explicit reason not to.
Handle missing data with a completeness threshold or a scaling adjustment.
State your method so results can be reproduced and compared.

By following these steps you produce a statistically sound, internationally comparable “average monthly rainfall” for March.

Below is a compact, self‑contained Python script that will

read any number of CSV files that each contain at least two columns: a date/time column and a rainfall column;
concatenate them into one DataFrame;
convert the date/time column to a proper datetime index;
calculate the monthly precipitation total for every year in the record;
compute the long‑term (multi‑year) mean for each calendar month.

#!/usr/bin/env python3
"""
Compute monthly rainfall totals for every year and the long‑term mean
--------------------------------------------------------------------

Assumptions
-----------
* Each CSV file has:
    - a column called 'date' (or pass --date-col to override)
    - a column called 'rain' (or pass --rain-col to override)
* Daily data are in mm (or any consistent unit).
* Any duplicate rows are dropped, and non‑numeric rain values are coerced to NaN.
* A month is considered 'usable' for a given year if it has ≥ 90 % of its
  possible calendar days (28/29 for Feb, 30 or 31 otherwise).  You can change
  the completeness threshold with --threshold.

Usage
-----
$ python monthly_rainfall.py file1.csv file2.csv ...

Optional flags:
    --date-col DATECOL    name of the date column (default: date)
    --rain-col RAINCOL    name of the rainfall column (default: rain)
    --threshold T         completeness ratio (0–1), default 0.9

Outputs
-------
* Prints a table of (year, month, total) to stdout.
* Prints the long‑term mean for each month at the end.
* Returns exit status 0 on success.

Author: <you>
Date:   2025‑07‑01
"""
import argparse
from pathlib import Path

import numpy as np
import pandas as pd

def read_and_concat(files: list[Path], date_col: str, rain_col: str) -> pd.DataFrame:
    """Read multiple CSVs and concatenate them into one DataFrame."""
    frames = []
    for f in files:
        df = (
            pd.read_csv(f, parse_dates=[date_col], dayfirst=False, infer_datetime_format=True)
            .rename(columns={date_col: "date", rain_col: "rain"})
            .loc[:, ["date", "rain"]]
        )
        frames.append(df)
    if not frames:
        raise ValueError("No CSV files read.")
    data = (
        pd.concat(frames, ignore_index=True)
        .drop_duplicates(subset="date")
        .set_index("date")
        .sort_index()
    )
    # ensure numeric, force errors to NaN
    data["rain"] = pd.to_numeric(data["rain"], errors="coerce")
    return data

def monthly_totals(df: pd.DataFrame, completeness: float = 0.9) -> pd.DataFrame:
    """
    Aggregate daily rainfall to monthly totals *per calendar year*.

    A month is kept only if at least `completeness` fraction of days are present.
    """
    # 1. daily -> monthly totals (summing, but keep NaN if *any* NaNs)
    daily_counts = df["rain"].resample("D").count()           # non‑missing daily obs
    daily_sum = df["rain"].resample("D").sum(min_count=1)     # NaN if all missing

    # 2. calculate counts and sums per month
    count_monthly = daily_counts.resample("MS").sum()
    sum_monthly = daily_sum.resample("MS").sum(min_count=1)

    # 3. filter months that do not meet completeness requirement
    month_lengths = count_monthly.index.days_in_month
    good = (count_monthly / month_lengths) >= completeness

    totals = (
        sum_monthly[good]
        .to_frame(name="monthly_total")
        .assign(year=lambda x: x.index.year, month=lambda x: x.index.month)
        .reset_index(drop=True)
    )
    return totals[["year", "month", "monthly_total"]]

def long_term_mean(monthly_df: pd.DataFrame) -> pd.Series:
    """Return the long‑term mean precipitation for each calendar month."""
    return (
        monthly_df.groupby("month")["monthly_total"]
        .mean()
        .sort_index()  # months 1‑12
    )

def main(argv: list[str] | None = None) -> None:
    parser = argparse.ArgumentParser(description="Calculate monthly and climatological rainfall.")
    parser.add_argument("csv_files", nargs="+", type=Path, help="CSV file(s) with date and rain columns")
    parser.add_argument("--date-col", default="date", help="Name of the date column (default: 'date')")
    parser.add_argument("--rain-col", default="rain", help="Name of the rainfall column (default: 'rain')")
    parser.add_argument(
        "--threshold",
        type=float,
        default=0.9,
        help="Fraction of days that must be present for a month to be used (default: 0.9)",
    )
    args = parser.parse_args(argv)

    # 1. read data
    df = read_and_concat(args.csv_files, args.date_col, args.rain_col)

    # 2. monthly totals per year
    monthly = monthly_totals(df, completeness=args.threshold)

    # 3. long‑term monthly means
    climatology = long_term_mean(monthly)

    # ---- output ----
    print("Monthly totals (one row per good month):")
    print(monthly.to_string(index=False, formatters={"monthly_total": "{:.2f}".format}))

    print("\nLong‑term mean precipitation by calendar month (mm month⁻¹):")
    for m, val in climatology.items():
        print(f"{m:02d}: {val:.2f}")

    print("\nDone.")

if __name__ == "__main__":  # pragma: no cover
    main()

How the script works – key steps

Step	Purpose	`pandas` operations involved
1. Read & clean	Combine all CSVs, parse dates, coerce non‑numeric rain to `NaN`.	`read_csv`, `concat`, `set_index`, `to_numeric`
2. Daily counts/sums	Calculate: • how many valid observations for each day • daily rainfall sum	`resample("D")`
3. Monthly aggregation	For every calendar month: total rainfall and number of valid days.	`resample("MS").sum()`
4. Completeness filter	Keep month only if ≥ 90 % of days were observed (default, adjustable).	Boolean mask
5. Year‑by‑year totals	Each retained month produces a row `(year, month, total)`.	Add columns from index, `reset_index`
6. Long‑term mean	Average those year‑specific monthly totals across all years.	`groupby("month").mean()`

By separating the yearly totals from the multi‑year average, every year contributes one vote to the climatology, matching WMO‑recommended practice. You can now extend the script—for example, writing the outputs to CSV or plotting the climatology—without touching the core logic.

Below is an updated, single‑file script that

reads and concatenates the input CSVs,
sorts the combined data chronologically,
writes the merged series to a new CSV (merged_rainfall.csv by default), and then
performs the same monthly‑total and long‑term‑mean calculations as in the previous answer.

Save the code as, for example, rainfall_climatology.py.

#!/usr/bin/env python3
"""
rainfall_climatology.py
-----------------------

Create a merged, chronologically ordered rainfall time series out of multiple
CSV files and compute (a) monthly totals for every year and (b) the long‑term
mean for each calendar month.

Usage
-----
$ python rainfall_climatology.py file1.csv file2.csv ...

Optional arguments
------------------
  --date-col COL        name of the date column   (default: date)
  --rain-col COL        name of the rainfall col  (default: rain)
  --threshold X         min fraction of observed days to accept a month
                        (0 < X ≤ 1, default 0.9 → 90 %)
  --merged-out FILE     path for merged CSV       (default: merged_rainfall.csv)
  --quiet               suppress row‑by‑row screen output
"""
from __future__ import annotations

import argparse
from pathlib import Path

import numpy as np
import pandas as pd

# ──────────────────────────────── helpers ────────────────────────────────
def read_concat_sort(
    files: list[Path], date_col: str, rain_col: str
) -> pd.DataFrame:
    """Read, standardise columns, concatenate, and sort chronologically."""
    frames = []
    for f in files:
        df = (
            pd.read_csv(f, parse_dates=[date_col], dayfirst=False, infer_datetime_format=True)
            .rename(columns={date_col: "date", rain_col: "rain"})
            .loc[:, ["date", "rain"]]
        )
        frames.append(df)

    if not frames:
        raise ValueError("No CSV files supplied.")

    merged = (
        pd.concat(frames, ignore_index=True)
        .drop_duplicates(subset="date")          # keep first occurrence
        .dropna(subset=["date"])                 # guard against malformed rows
        .set_index("date")
        .sort_index()                            # chronological order
    )

    merged["rain"] = pd.to_numeric(merged["rain"], errors="coerce")
    return merged

def monthly_totals(df: pd.DataFrame, min_frac: float = 0.9) -> pd.DataFrame:
    """
    Convert a daily rainfall series to monthly totals,
    keeping only months with ≥ `min_frac` of days present.
    """
    daily_count = df["rain"].resample("D").count()
    daily_sum = df["rain"].resample("D").sum(min_count=1)

    m_count = daily_count.resample("MS").sum()                       # days available
    m_sum = daily_sum.resample("MS").sum(min_count=1)                # monthly total

    month_length = m_count.index.days_in_month
    mask = (m_count / month_length) >= min_frac

    good_months = (
        m_sum[mask]
        .to_frame("monthly_total")
        .assign(year=lambda x: x.index.year, month=lambda x: x.index.month)
        .reset_index(drop=True)
    )
    return good_months[["year", "month", "monthly_total"]]

def climatology(monthly_df: pd.DataFrame) -> pd.Series:
    """Average the monthly totals across all usable years (year‑weighted)."""
    return (
        monthly_df.groupby("month")["monthly_total"]
        .mean()
        .sort_index()
    )

# ──────────────────────────────── main ────────────────────────────────
def main(argv: list[str] | None = None) -> None:
    p = argparse.ArgumentParser(description="Merge rainfall CSVs and compute climatology.")
    p.add_argument("csv_files", nargs="+", type=Path, help="Input CSV file(s)")
    p.add_argument("--date-col", default="date", help="Date column name (default: date)")
    p.add_argument("--rain-col", default="rain", help="Rainfall column name (default: rain)")
    p.add_argument(
        "--threshold",
        type=float,
        default=0.9,
        metavar="X",
        help="Fraction of days required for a month to be accepted (default: 0.9)",
    )
    p.add_argument(
        "--merged-out",
        default="merged_rainfall.csv",
        metavar="FILE",
        help="Filename for the merged output CSV (default: merged_rainfall.csv)",
    )
    p.add_argument("--quiet", action="store_true", help="Suppress screen output")
    args = p.parse_args(argv)

    # 1. concatenate and sort
    merged = read_concat_sort(args.csv_files, args.date_col, args.rain_col)

    # 2. write merged file
    merged.to_csv(args.merged_out, index_label="date", float_format="%.3f")
    if not args.quiet:
        print(f"[info] Merged series saved to {args.merged_out} "
              f"({len(merged):,} rows, {merged.index[0].date()}–{merged.index[-1].date()}).")

    # 3. monthly totals per year
    monthly_df = monthly_totals(merged, min_frac=args.threshold)
    if not args.quiet:
        print("\nMonthly totals (filtered by completeness):")
        print(monthly_df.to_string(index=False, formatters={"monthly_total": "{:.2f}".format}))

    # 4. long‑term mean for each calendar month
    clim = climatology(monthly_df)
    if not args.quiet:
        print("\nLong‑term mean precipitation (mm per month):")
        for m, v in clim.items():
            print(f"{m:02d}: {v:.2f}")

    # If another program wants to import & use the results,
    # return them (optional, not executed when run from shell).
    return merged, monthly_df, clim

if __name__ == "__main__":
    main()

What changed compared with the previous version?

Section	Purpose	Key lines
`read_concat_sort()`	New step – concatenates all files, sorts rows by timestamp, de‑duplicates, ensures numeric rainfall.	`pd.concat`, `.drop_duplicates`, `.sort_index()`
Write merged CSV	Saves the unified time‑ordered series so you have a permanent, reproducible input for later analyses or visualisation.	`merged.to_csv(args.merged_out, …)`
Command‑line options	Added `--merged-out` and `--quiet`.	Argument parser
Return values	`main()` optionally returns the three key DataFrames/Series, making the module import‑friendly.	`return merged, monthly_df, clim`

The analysis pipeline itself (monthly aggregation + climatological mean) is unchanged; we simply inserted the merge‑and‑save step ahead of it. You can now call the script on dozens of source files, safe in the knowledge that the downstream statistics all refer to the exact, archived merged data set.

Here’s the updated rainfall_climatology.py with the merged CSV date column now written out in MM-DD-YYYY format:

#!/usr/bin/env python3
"""
rainfall_climatology.py
-----------------------

Create a merged, chronologically ordered rainfall time series out of multiple
CSV files (with dates reformatted as MM-DD-YYYY) and compute (a) monthly totals
for every year and (b) the long-term mean for each calendar month.

Usage
-----
$ python rainfall_climatology.py file1.csv file2.csv ...

Optional arguments
------------------
  --date-col COL        name of the date column   (default: date)
  --rain-col COL        name of the rainfall col  (default: rain)
  --threshold X         min fraction of observed days to accept a month
                        (0 < X ≤ 1, default 0.9 → 90 %)
  --merged-out FILE     path for merged CSV       (default: merged_rainfall.csv)
  --quiet               suppress row-by-row screen output
"""
from __future__ import annotations

import argparse
from pathlib import Path

import pandas as pd

def read_concat_sort(
    files: list[Path], date_col: str, rain_col: str
) -> pd.DataFrame:
    """Read, standardise columns, concatenate, and sort chronologically."""
    frames = []
    for f in files:
        df = (
            pd.read_csv(f, parse_dates=[date_col], infer_datetime_format=True)
              .rename(columns={date_col: "date", rain_col: "rain"})
              .loc[:, ["date", "rain"]]
        )
        frames.append(df)

    if not frames:
        raise ValueError("No CSV files supplied.")

    merged = (
        pd.concat(frames, ignore_index=True)
          .drop_duplicates(subset="date")
          .dropna(subset=["date"])
          .set_index("date")
          .sort_index()
    )
    merged["rain"] = pd.to_numeric(merged["rain"], errors="coerce")
    return merged

def monthly_totals(df: pd.DataFrame, min_frac: float = 0.9) -> pd.DataFrame:
    """
    Convert a daily rainfall series to monthly totals,
    keeping only months with ≥ `min_frac` of days present.
    """
    daily_count = df["rain"].resample("D").count()
    daily_sum   = df["rain"].resample("D").sum(min_count=1)

    m_count = daily_count.resample("MS").sum()
    m_sum   = daily_sum.resample("MS").sum(min_count=1)

    month_length = m_count.index.days_in_month
    mask = (m_count / month_length) >= min_frac

    good = (
        m_sum[mask]
        .to_frame("monthly_total")
        .assign(year=lambda x: x.index.year,
                month=lambda x: x.index.month)
        .reset_index(drop=True)
    )
    return good[["year", "month", "monthly_total"]]

def climatology(monthly_df: pd.DataFrame) -> pd.Series:
    """Average the monthly totals across all usable years (year-weighted)."""
    return monthly_df.groupby("month")["monthly_total"]\
                     .mean()\
                     .sort_index()

def main(argv: list[str] | None = None) -> None:
    p = argparse.ArgumentParser(description="Merge rainfall CSVs and compute climatology.")
    p.add_argument("csv_files", nargs="+", type=Path, help="Input CSV file(s)")
    p.add_argument("--date-col",   default="date", help="Date column name (default: date)")
    p.add_argument("--rain-col",   default="rain", help="Rainfall column name (default: rain)")
    p.add_argument("--threshold",  type=float, default=0.9,
                   metavar="X", help="Fraction of days required per month (default: 0.9)")
    p.add_argument("--merged-out", default="merged_rainfall.csv", metavar="FILE",
                   help="Filename for the merged output CSV (default: merged_rainfall.csv)")
    p.add_argument("--quiet", action="store_true", help="Suppress screen output")
    args = p.parse_args(argv)

    # 1. concatenate and sort
    merged = read_concat_sort(args.csv_files, args.date_col, args.rain_col)

    # 2. write merged file with MM-DD-YYYY date format
    merged.to_csv(
        args.merged_out,
        index_label="date",
        date_format="%m-%d-%Y",
        float_format="%.3f"
    )
    if not args.quiet:
        print(f"[info] Merged series saved to {args.merged_out} "
              f"({len(merged):,} rows, {merged.index[0].date():%m-%d-%Y}–"
              f"{merged.index[-1].date():%m-%d-%Y}).")

    # 3. monthly totals per year
    monthly_df = monthly_totals(merged, min_frac=args.threshold)
    if not args.quiet:
        print("\nMonthly totals (filtered by completeness):")
        print(monthly_df.to_string(index=False, formatters={"monthly_total": "{:.2f}".format}))

    # 4. long-term mean for each calendar month
    clim = climatology(monthly_df)
    if not args.quiet:
        print("\nLong-term mean precipitation (mm per month):")
        for m, v in clim.items():
            print(f"{m:02d}: {v:.2f}")

    return merged, monthly_df, clim

if __name__ == "__main__":
    main()

What’s changed?

In the to_csv() call we added
```
date_format="%m-%d-%Y"
```
so that the date index is written out as MM-DD-YYYY.
The info-print also reflects that format in its start/end dates.

Now when you run:

python rainfall_climatology.py data1.csv data2.csv

it will produce a merged_rainfall.csv whose date column looks like:

date,rain
03-01-1940,5.20
03-02-1940,0.00
...

Below is the updated rainfall_climatology.py that now validates the total rainfall sum in your merged DataFrame against the sum of the individual inputs before de-duplication. If the two sums differ, it will print a warning (and you can choose to raise an error instead).

#!/usr/bin/env python3
"""
rainfall_climatology.py
-----------------------

Merge multiple rainfall CSVs, validate the merge, reformat dates,
and compute monthly totals + long-term climatology.

Usage
-----
$ python rainfall_climatology.py file1.csv file2.csv ...

Optional arguments
------------------
  --date-col COL        name of the date column   (default: date)
  --rain-col COL        name of the rainfall col  (default: rain)
  --threshold X         frac of days to accept month (default: 0.9)
  --merged-out FILE     filename for merged CSV       (default: merged_rainfall.csv)
  --quiet               suppress screen output
"""
from __future__ import annotations

import argparse
from pathlib import Path

import pandas as pd

def read_concat_sort(
    files: list[Path], date_col: str, rain_col: str
) -> tuple[pd.DataFrame, float]:
    """
    Read & concatenate all CSVs, sort chronologically, drop dupes.
    Returns (merged_df, raw_sum), where raw_sum is the sum of rain from
    all inputs prior to de-duplication.
    """
    frames = []
    raw_sum = 0.0

    for f in files:
        df = (
            pd.read_csv(f, parse_dates=[date_col], infer_datetime_format=True)
              .rename(columns={date_col: "date", rain_col: "rain"})
              .loc[:, ["date", "rain"]]
        )
        # accumulate raw sum (coerce non-numeric to NaN, skip NaNs)
        df["rain"] = pd.to_numeric(df["rain"], errors="coerce")
        raw_sum += df["rain"].sum(skipna=True)
        frames.append(df)

    if not frames:
        raise ValueError("No CSV files supplied.")

    merged = (
        pd.concat(frames, ignore_index=True)
          .drop_duplicates(subset="date")  # keep only one entry per date
          .dropna(subset=["date"])
          .set_index("date")
          .sort_index()
    )
    # ensure numeric
    merged["rain"] = pd.to_numeric(merged["rain"], errors="coerce")

    return merged, raw_sum

def validate_merge(merged: pd.DataFrame, raw_sum: float) -> None:
    """
    Compare the sum of 'rain' in merged vs. the raw inputs.
    Warn if they differ.
    """
    merged_sum = merged["rain"].sum(skipna=True)
    if abs(merged_sum - raw_sum) < 1e-6:
        print(f"[ok] Validation passed: merged sum = {merged_sum:.3f}")
    else:
        print(f"[warn] Mismatch detected:")
        print(f"       sum of inputs = {raw_sum:.3f}")
        print(f"       sum after merge = {merged_sum:.3f}")
        # Uncomment to enforce failure:
        # raise RuntimeError("Merged sum does not match sum of inputs")

def monthly_totals(df: pd.DataFrame, min_frac: float = 0.9) -> pd.DataFrame:
    daily_count = df["rain"].resample("D").count()
    daily_sum   = df["rain"].resample("D").sum(min_count=1)

    m_count = daily_count.resample("MS").sum()
    m_sum   = daily_sum.resample("MS").sum(min_count=1)

    month_length = m_count.index.days_in_month
    mask = (m_count / month_length) >= min_frac

    good = (
        m_sum[mask]
        .to_frame("monthly_total")
        .assign(year=lambda x: x.index.year,
                month=lambda x: x.index.month)
        .reset_index(drop=True)
    )
    return good[["year", "month", "monthly_total"]]

def climatology(monthly_df: pd.DataFrame) -> pd.Series:
    return monthly_df.groupby("month")["monthly_total"].mean().sort_index()

def main(argv: list[str] | None = None) -> None:
    p = argparse.ArgumentParser(description="Merge rainfall CSVs and compute climatology.")
    p.add_argument("csv_files", nargs="+", type=Path, help="Input CSV file(s)")
    p.add_argument("--date-col",   default="date", help="Date column name (default: date)")
    p.add_argument("--rain-col",   default="rain", help="Rainfall column name (default: rain)")
    p.add_argument("--threshold",  type=float, default=0.9,
                   metavar="X", help="Fraction of days required per month (default: 0.9)")
    p.add_argument("--merged-out", default="merged_rainfall.csv", metavar="FILE",
                   help="Filename for the merged output CSV (default: merged_rainfall.csv)")
    p.add_argument("--quiet", action="store_true", help="Suppress screen output")
    args = p.parse_args(argv)

    # 1. read, concat, sort ➔ get raw sum of all inputs
    merged, raw_sum = read_concat_sort(args.csv_files, args.date_col, args.rain_col)

    # 2. validate
    if not args.quiet:
        validate_merge(merged, raw_sum)

    # 3. write merged file with MM-DD-YYYY date format
    merged.to_csv(
        args.merged_out,
        index_label="date",
        date_format="%m-%d-%Y",
        float_format="%.3f"
    )
    if not args.quiet:
        print(f"[info] Merged series saved to {args.merged_out} "
              f"({len(merged):,} rows, {merged.index[0].date():%m-%d-%Y}–"
              f"{merged.index[-1].date():%m-%d-%Y}).")

    # 4. monthly totals per year
    monthly_df = monthly_totals(merged, min_frac=args.threshold)
    if not args.quiet:
        print("\nMonthly totals (filtered by completeness):")
        print(monthly_df.to_string(index=False, formatters={"monthly_total": "{:.2f}".format}))

    # 5. long-term mean for each calendar month
    clim = climatology(monthly_df)
    if not args.quiet:
        print("\nLong-term mean precipitation (mm per month):")
        for m, v in clim.items():
            print(f"{m:02d}: {v:.2f}")

    return merged, monthly_df, clim

if __name__ == "__main__":
    main()

What was added?

read_concat_sort() now also returns the raw sum of all input files’ rainfall before deduplication.
A new helper validate_merge() computes the sum of merged["rain"] and compares it to the raw sum, printing a pass/fail message.
In main(), we call validate_merge() right after merging, so you immediately know if anything was dropped or changed unexpectedly.

With this in place, you’ll see:

[ok] Validation passed: merged sum = 12345.678
[info] Merged series saved to merged_rainfall.csv (### rows, 03-01-1940–03-31-2025).
...

—or, if there’s a discrepancy:

[warn] Mismatch detected:
       sum of inputs = 12350.000
       sum after merge = 12345.678

Feel free to uncomment the raise RuntimeError(...) in validate_merge() if you’d rather stop execution on mismatch.

What you’re seeing isn’t a logic bug in pandas or Python—it’s just ordinary floating-point noise.

Whenever you store decimal fractions in a binary float (the default for pandas), numbers like

0.000356278

aren’t exact in base-2, so any arithmetic (summing, subtracting, grouping, resampling…) leaves you with tiny residuals on the order of 1e-16 to 1e-9. When you do

col3 = col1 - col2

you expect all zeros, but in fact each subtraction is something like

0.000356278000000000000…_approx – 0.000356278000000000000…_approx = 1.8e-17

and when you sum hundreds or thousands of those, you get something like -1.3e-9 in your total.

How to “fix” it

Round to a fixed precision before comparing or summing:

df["monthly_total"] = df["monthly_total"].round(6)  # or however many decimal places you need

Use np.isclose or pandas.testing.assert_series_equal(..., atol=1e-8) for comparisons instead of ==.
If you need exact decimal arithmetic, switch to Python’s decimal.Decimal (at a performance cost).

But unless you’re doing financial or ISO-compliant metrology, a tiny 10⁻⁹ mm discrepancy in a 0.1 mm total is essentially zero.

There’s almost always a very prosaic reason for something like this – it isn’t pandas or Python mis-adding things, it’s almost certainly one of:

You’re summing two different ranges.
In your screenshot the very last row is itself the “SUM” row. If your Excel formula for the second column accidentally included that bottom summary cell in its own sum, you’ll get a completely meaningless number. (e.g. =SUM(B2:B30) when your data run B2:B29 and B30 is the SUM itself.)
Different numbers of rows in the two series.
Even though Col 1 and Col 2 look identical for the rows you can see, one of them may have an extra hidden row (or be missing one). Summing 28 rows vs 29 rows, even if the extra row is zero, will give you different totals.
Mixed data types or stray text/blank cells.
If one column has a blank or text cell in the middle, Excel will silently skip it when you do SUM, but if you convert to number in Python it might get coerced to NaN (and then sum(skipna=False) gives you NaN, or skipna=True gives you a shorter list).

How to pinpoint it

1) Check your row counts

In Excel, number the rows from 1 → n for both columns, and make sure they match:

row	col1	col2
1	0.000356278	0.000356278
2	…	…
…	…	…
28	0.002418643	0.002418643
29	SUM: 0.08435871	SUM: 0.132322022

If you see that one of your SUM formulas is pointing to a different start/end row than the other, fix the range so they both say, for example,

=SUM(B2:B28)
=SUM(C2:C28)

(not through …:B29 if row 29 is the SUM itself).

2) Recompute the sums in Python

If you already have your two columns in Python, this will tell you immediately whether the true sums differ:

import pandas as pd

# assume df is your Excel data read in, with columns 'col1' and 'col2'
sum1 = df['col1'].astype(float).sum()
sum2 = df['col2'].astype(float).sum()
print(f"sum1 = {sum1:.9f}")
print(f"sum2 = {sum2:.9f}")
print(f"difference = {sum1 - sum2:.9f}")

If that shows a tiny difference (1e-9 or so), your Python sums agree and the problem is in how the Excel ranges were specified. If it shows the big difference (0.084 vs 0.132), it means your DataFrame really does have a different number of non-NaN entries in col2 than in col1, so you’d next check:

print("rows in col1:", df['col1'].notna().sum())
print("rows in col2:", df['col2'].notna().sum())

…to see if one column has more valid values.

3) Watch out for that summary row

Because you visibly have your SUM sitting in the same table, make sure you exclude it when you do your final SUM in Excel (or in Python). In pandas you’d do:

data = pd.read_csv('your_excel_export.csv')
# drop the summary row by filtering out anything that isn’t a number,
# or by slicing off the last row if you know it’s the SUM:
data = data.iloc[:-1]

Once you line up the ranges exactly (and remove the SUM-row from the data), you’ll find that your two sums match to within floating-point precision.

Here’s a small addition to the end of your existing script that will write out the monthly totals and the climatology to two CSV files:

    # 5. write out the monthly sums
    monthly_out = "monthly_totals.csv"
    monthly_df.to_csv(monthly_out, index=False, float_format="%.6f")
    if not args.quiet:
        print(f"[info] Monthly totals saved to {monthly_out} "
              f"({len(monthly_df):,} rows).")

    # 6. write out the long-term mean per calendar month
    clim_out = "monthly_climatology.csv"
    # make it a DataFrame so it writes month as a column
    clim.to_frame(name="mean_precipitation_mm")\
        .reset_index()\
        .rename(columns={"month": "calendar_month"})\
        .to_csv(clim_out, index=False, float_format="%.6f")
    if not args.quiet:
        print(f"[info] Climatology saved to {clim_out} (12 rows).")

Just drop that into your main() after you compute monthly_df and clim.

– monthly_totals.csv will look like:

year	month	monthly_total
1940	3	12.345
1941	3	10.123
…	…	…

– monthly_climatology.csv will look like:

calendar_month	mean_precipitation_mm
1	23.456
2	18.234
…	…
12	15.789

Now when you run:

python rainfall_climatology.py data1.csv data2.csv

you’ll get three output files:

merged_rainfall.csv
monthly_totals.csv
monthly_climatology.csv

Here’s how you can turn your separate year / month columns back into a single “date” column (using the 1st of each month) and write that out:

# 5. write out the monthly sums
monthly_out = "monthly_totals.csv"
monthly_df.to_csv(monthly_out, index=False, float_format="%.6f")
monthly_out = "monthly_totals.csv"
# combine year+month into a proper Timestamp at the first day of each month
monthly_df["date"] = pd.to_datetime(
    dict(year=monthly_df.year, month=monthly_df.month, day=1)
)
# now drop the old columns and reorder
monthly_out_df = monthly_df[["date", "monthly_total"]]
# write with MM-DD-YYYY formatting on that single date column
monthly_out_df.to_csv(
    monthly_out,
    index=False,
    date_format="%m-%d-%Y",
    float_format="%.6f"
)
if not args.quiet:
    print(f"[info] Monthly totals saved to {monthly_out} "
          f"({len(monthly_df):,} rows).")
    print(f"[info] Monthly totals saved to {monthly_out} "
          f"({len(monthly_out_df):,} rows).")

That will give you a CSV like:

date,monthly_total
03-01-1940,12.345678
04-01-1940, 8.901234
...

If you also want your climatology file to have a “date” column (say, using some arbitrary year like 2000), you can do the same trick:

clim_df = clim.reset_index().rename(columns={"month":"m", "monthly_total":"mean_precip"})
clim_df["date"] = pd.to_datetime(
    dict(year=2000, month=clim_df.m, day=1)
)
clim_df[["date","mean_precip"]].to_csv(
    "monthly_climatology.csv",
    index=False,
    date_format="%m-%d-%Y",
    float_format="%.6f"
)

…but usually for climatologies people keep just the month number.