Reprojection Latitude Issue

2025-04-04 19:29 UTC o3-mini-high Open in ChatGPT ↗

when i do the reprojection, my minimum value snaps to zero!!! for locations above the equator!

locations below the equator are fine

for dataset in _datasets:

    da = dataset[_var_name]

    # Determine which time coordinate is available.
    time_dim = "time" if "time" in da.dims else "valid_time"
    time_values = da[time_dim].values

    # Build source transform from the dataset's longitude/latitude.
    lons = dataset['longitude'].values
    lats = dataset['latitude'].values

    # Assuming latitude is descending.
    src_left = lons[0]
    src_right = lons[-1]
    src_bottom = lats[-1]
    src_top = lats[0]
    src_width = len(lons)
    src_height = len(lats)

    src_transform = from_bounds(src_left, src_bottom, src_right, src_top, src_width, src_height)

    # Process each time frame.
    resampled_frames = []

    real_time_values = []
    real_num_frames = 0

    for t in time_values:

        # Convert timestamp to datetime.datetime
        dt = datetime.datetime.fromtimestamp(t.tolist() / 1e9)

        if dates is not None:
            if dt < dates[0] or dt > dates[1]:
                print(f"Skipping {dt.date()}")
                continue

        src_array = da.sel(**{time_dim: t}).values.astype(np.float32)
        print(np.min(src_array), np.max(src_array))

        dst_array = np.empty((dst_height, dst_width), dtype=np.float32)

        reproject(
            source=src_array,
            destination=dst_array,
            src_transform=src_transform,
            src_crs='EPSG:4326',
            dst_transform=dst_transform,
            dst_crs='EPSG:4326',
            resampling=Resampling.bilinear
        )

        print(np.min(dst_array), np.max(dst_array))

        resampled_frames.append(dst_array)

        real_num_frames += 1
        real_time_values.append(t)

    total_frame_count += real_num_frames
    total_time_values.extend(real_time_values)
    total_frames.extend(resampled_frames)

The issue is most likely due to the assumption about the latitude ordering when building your source transform. In your code, you assume that the latitude array is descending (i.e. the first element is the highest latitude), and you set:

src_top = lats[0]
src_bottom = lats[-1]

However, for locations above the equator the latitude array might be in ascending order (i.e. the first element is the lowest latitude). This means that when you use the same assumptions, you’re effectively flipping the geographic bounds, causing the reprojection function to misinterpret the data extent. The result is that areas which should have valid data are instead filled with a default value (often 0).

How to Fix It

Check the Latitude Order:
Before setting up your transform, verify whether your latitude array is ascending or descending. For example, you can do:

if lats[0] < lats[-1]:
    # Latitudes are ascending.
    src_bottom = lats[0]
    src_top = lats[-1]
else:
    # Latitudes are descending.
    src_bottom = lats[-1]
    src_top = lats[0]

Update the Source Transform:
Use the correctly ordered src_bottom and src_top when creating the transform:
```
src_transform = from_bounds(src_left, src_bottom, src_right, src_top, src_width, src_height)
```
Consider Nodata Values:
Sometimes, if no explicit nodata value is set during the reprojection, the function may default to 0. If your valid data range does not include 0, this could be another factor. You might want to pass the appropriate src_nodata and dst_nodata parameters to the reproject function if needed.

Summary

The minimum value snapping to zero is likely because the reprojection is being applied with an incorrectly computed source transform for datasets where the latitude is in ascending order. Adjusting the order of your latitude bounds based on the actual order in your data should resolve this issue.

Feel free to ask if you need more details or further assistance!

It looks like the problem may be tied to how the grids are aligned between source and destination, especially in the vertical direction. Notice that although your source and destination transforms are very similar, they have two key differences:

Pixel Size Difference:
The source has a pixel size of 0.21 while the destination uses 0.20. This means that when you resample the data, the destination grid is “tighter,” which can change how pixel values are interpolated.
Origin Shift in the Y-Direction:
The source’s top‐left is at 50.50, while the destination’s is at 50.60. This 0.10 difference shifts the destination grid relative to the source. With bilinear interpolation, pixels near the edge may be averaging valid data with areas that fall outside the source data (or areas that are implicitly nodata), which can result in zeros.

What You Can Do

Align the Grids More Precisely:
If possible, derive the destination transform using the same method (for example, using from_bounds) so that the extents and pixel alignment are consistent with your source. Even a small misalignment (like 0.10 in the y-origin) can cause the interpolation to sample areas that return nodata.
Explicitly Handle Nodata:
Check if your source data has a defined nodata value. If not, the reprojection function might be implicitly treating zeros as nodata. By explicitly setting src_nodata and dst_nodata to a value that you know isn’t valid in your data, you can prevent the interpolation from “pulling” valid values toward 0. For example:
```
reproject(
    source=src_array,
    destination=dst_array,
    src_transform=src_transform,
    src_crs='EPSG:4326',
    dst_transform=dst_transform,
    dst_crs='EPSG:4326',
    resampling=Resampling.bilinear,
    src_nodata=<your_nodata_value>,
    dst_nodata=<your_nodata_value>
)
```
Replace <your_nodata_value> with an appropriate value (often something like np.nan or another flag value that doesn’t occur in your data).
Double-Check Latitude Ordering:
Even though you’ve already checked this, ensure that your source transform correctly represents the pixel centers (especially if your latitude array’s order differs above and below the equator).

Summary

The snapping of your minimum values to zero is likely due to a combination of grid misalignment and unintended nodata interpolation. Adjusting the destination transform so that its pixel alignment (both resolution and origin) matches the source grid more closely—and ensuring that nodata values are handled explicitly—should help resolve this issue.

Let me know if you have any questions or need further help!

check the full code:

import datetime import glob import math import os import tempfile import concurrent.futures

import pandas as pd import xarray as xr import numpy as np import rasterio from rasterio.transform import from_bounds from rasterio.warp import reproject, Resampling from tqdm import tqdm

from constants import Area

Import your own helper to open datasets.

from data_utils import open_dataset_file

LOCATION = Area.CZ_TRAVNY_EXTENDED VARIABLE = “TEMPERATURE” TIME_VAR = “valid_time”

if LOCATION is Area.BR_VICOSA_EXTENDED: DATES = (datetime.datetime(2025,2,18), datetime.datetime(2025,3,28)) BOX = (-20.3, -43.40, -21.30,-42.40) elif LOCATION is Area.CZ_TRAVNY_EXTENDED: DATES = (datetime.datetime(2025,2,4), datetime.datetime(2025,3,21)) BOX = (50.60, 17.80,49.00,19.40) elif LOCATION is Area.SK_GANGWON_EXTENDED: DATES = (datetime.datetime(2025,2,24), datetime.datetime(2025,3,21)) BOX = (38.25, 127.50, 37.35,128.60)

Define the working directory and cache directory.

root = “F:/Github/ULC/ioniq-geography/gis/data/copernicus/” directory = f”{root}{LOCATION.name}_{VARIABLE}_2025-2-01-2025-3-29/” #“F:/Github/ULC/ioniq-geography/gis/data/copernicus/CZ_TRAVNY_ENSEMBLE_MONTHLYAVERAGED_1950-2025/” cache_directory = os.path.join(tempfile.gettempdir(), “cache”) os.makedirs(cache_directory, exist_ok=True)

Get list of GRIB/NetCDF files and create symlinks in the cache.

files = [file for file in glob.glob(os.path.join(directory, ”*”)) if file.endswith((‘.nc’, ‘.grib’))]

cached_files = [] for file in files: cache_path = os.path.join(cache_directory, os.path.basename(file)) if not os.path.exists(cache_path): os.symlink(file, cache_path) cached_files.append(cache_path)

def process_variable (ds, var_name, file, bbox): """ Process a single variable from an already opened dataset: - Trims to the given bounding box. - Resamples each time frame to an 8x8 grid. - Assembles the frames into a flipbook padded to power-of-two dimensions. - Saves the flipbook as a GeoTIFF. """ da = ds[var_name] # Determine which time coordinate is available. time_dim = “time” if “time” in da.dims else “valid_time” time_values = da[time_dim].values num_frames = len(time_values)

# Define target bounding box: left, bottom, right, top.
# CZ
# left, bottom, right, top = 17.86595452, 49.14841321, 19.12404888, 49.95775356

#BR
top, left, bottom, right = bbox

# Destination grid dimensions (8x8 pixels).
dst_width, dst_height = 8, 8
dst_transform = from_bounds(left, bottom, right, top, dst_width, dst_height)

# Build source transform from the dataset's longitude/latitude.
lons = ds['longitude'].values
lats = ds['latitude'].values
# Assuming latitude is descending.
src_left = lons[0]
src_right = lons[-1]
src_bottom = lats[-1]
src_top = lats[0]
src_width = len(lons)
src_height = len(lats)
src_transform = from_bounds(src_left, src_bottom, src_right, src_top, src_width, src_height)

# Process each time frame.
resampled_frames = []
for t in time_values:
    src_array = da.sel(**{time_dim: t}).values.astype(np.float32)
    dst_array = np.empty((dst_height, dst_width), dtype=np.float32)
    reproject(
        source=src_array,
        destination=dst_array,
        src_transform=src_transform,
        src_crs='EPSG:4326',
        dst_transform=dst_transform,
        dst_crs='EPSG:4326',
        resampling=Resampling.bilinear
    )
    resampled_frames.append(dst_array)

    # Create an output filename using the variable name and date.
    os.makedirs(os.path.join(directory,var_name), exist_ok=True)
    out_file = os.path.join(directory, var_name, f"{os.path.splitext(os.path.basename(file))[0]}__{var_name}__{str(t)[:10]}.tif")  # e.g., "temperature_2025-03-18.tif"

    # Write the array to a GeoTIFF.
    with rasterio.open(
            out_file,
            'w',
            driver='GTiff',
            height=dst_array.shape[0],
            width=dst_array.shape[1],
            count=1,
            dtype='float32',
            crs='EPSG:4326',  # adjust CRS if needed
            transform=dst_transform,
    ) as dst:
        dst.write(dst_array, 1)

# Assemble the flipbook from the resampled frames.
cols = 8 # math.ceil(math.sqrt(num_frames))
rows = math.ceil(num_frames / cols)

# Compute padded dimensions (power of two).
def next_power_of_two ( n ):
    return 2 ** math.ceil(math.log2(n))

orig_width = cols * dst_width
orig_height = rows * dst_height
flipbook_width = next_power_of_two(orig_width)
flipbook_height = next_power_of_two(orig_height)

# --- Compute global min and max across all frames ---
all_frames = np.stack(resampled_frames, axis=0)
global_min = float(np.min(all_frames))
global_max = float(np.max(all_frames))

# --- Normalize each frame using the global min/max ---
if var_name == "u10" or var_name == "v10":
    normalized_frames = resampled_frames
else:
    normalized_frames = [
        (frame - global_min) / (global_max - global_min) for frame in resampled_frames
    ]

# --- Assemble normalized frames into a flipbook ---
# For this example, we lay out the frames in a square grid.
flipbook = np.zeros((orig_height, orig_width), dtype=np.float32)
for idx, frame in enumerate(normalized_frames):
    r = idx // cols
    c = idx % cols
    flipbook[r * dst_height:(r + 1) * dst_height, c * dst_width:(c + 1) * dst_width] = frame

# --- Write the flipbook as a GeoTIFF ---
# Create an output filename using the file base, variable name, and time range.
start_date = str(time_values[0])[:10]
end_date = str(time_values[-1])[:10]
out_filename = f"{os.path.splitext(os.path.basename(file))[0]}__{var_name}__{rows}x{cols}_{start_date}_{end_date}.tif"
out_file = os.path.join(directory, out_filename)

# Write the flipbook as a GeoTIFF.
with rasterio.open(
        out_file,
        'w',
        driver='GTiff',
        height=flipbook.shape[0],
        width=flipbook.shape[1],
        count=1,
        dtype='float32',
        crs='EPSG:4326',
        transform=dst_transform  # Using the destination transform as a placeholder.
) as dst:
    dst.write(flipbook, 1)

# Prepare properties to store in the CSV.
properties = {
    "variable": var_name,
    "start_date": start_date,
    "end_date": end_date,
    "global_min": global_min,
    "global_max": global_max,
    "num_frames": num_frames,
    "dst_width": dst_width,
    "dst_height": dst_height,
    "flipbook_cols": cols,
    "flipbook_rows": rows,
    "output_file": out_file
}

return properties
print(f"Saved flipbook for variable '{var_name}' from file '{file}' as {out_file}")
#return out_file

def process_variable_multi_ds ( _datasets, _var_name, file, bbox, dates = None ): """ Process a single variable from an already opened dataset: - Trims to the given bounding box. - Resamples each time frame to an 8x8 grid. - Assembles the frames into a flipbook padded to power-of-two dimensions. - Saves the flipbook as a GeoTIFF. """

total_frames = []
total_time_values = []
total_frame_count = 0

top, left, bottom, right = bbox

# Destination grid dimensions (8x8 pixels).
dst_width, dst_height = 8, 8
dst_transform = from_bounds(left, bottom, right, top, dst_width, dst_height)

# Process each dataset and append to the global lists
for dataset in _datasets:

    da = dataset[_var_name]

    # Determine which time coordinate is available.
    time_dim = "time" if "time" in da.dims else "valid_time"
    time_values = da[time_dim].values

    # Build source transform from the dataset's longitude/latitude.
    lons = dataset['longitude'].values
    lats = dataset['latitude'].values

    # Assuming latitude is descending.

    if lats[0] < lats[-1]:
        # Latitudes are ascending.
        src_bottom = lats[0]
        src_top = lats[-1]
    else:
        # Latitudes are descending.
        src_bottom = lats[-1]
        src_top = lats[0]

    src_left = lons[0]
    src_right = lons[-1]

    src_width = len(lons)
    src_height = len(lats)

    src_transform = from_bounds(src_left, src_bottom, src_right, src_top, src_width, src_height)

    # Process each time frame.
    resampled_frames = []

    real_time_values = []
    real_num_frames = 0

    for t in time_values:

        # Convert timestamp to datetime.datetime
        dt = datetime.datetime.fromtimestamp(t.tolist() / 1e9)

        if dates is not None:
            if dt < dates[0] or dt > dates[1]:
                print(f"Skipping {dt.date()}")
                continue

        src_array = da.sel(**{time_dim: t}).values.astype(np.float32)
        # print(np.min(src_array), np.max(src_array))

        dst_array = np.empty((dst_height, dst_width), dtype=np.float32)

        print("SOURCE:")
        print(src_transform)
        print("dest:")
        print(dst_transform)

        reproject(
            source=src_array,
            destination=dst_array,
            src_transform=src_transform,
            src_crs='EPSG:4326',
            dst_transform=dst_transform,
            dst_crs='EPSG:4326',
            resampling=Resampling.bilinear
        )

        # print(np.min(dst_array), np.max(dst_array))

        resampled_frames.append(dst_array)

        real_num_frames += 1
        real_time_values.append(t)

    total_frame_count += real_num_frames
    total_time_values.extend(real_time_values)
    total_frames.extend(resampled_frames)

# Assemble the flipbook from the resampled frames.
cols = 8 # math.ceil(math.sqrt(num_frames))
rows = math.ceil(total_frame_count / cols)

# Compute padded dimensions (power of two).
def next_power_of_two ( n ):
    return 2 ** math.ceil(math.log2(n))

orig_width = cols * dst_width
orig_height = rows * dst_height
flipbook_width = next_power_of_two(orig_width)
flipbook_height = next_power_of_two(orig_height)

# --- Compute global min and max across all frames ---
all_frames = np.stack(total_frames, axis=0)
global_min = float(np.min(all_frames))
global_max = float(np.max(all_frames))
print(global_min, global_max)

# --- Normalize each frame using the global min/max ---
if _var_name == "u10" or _var_name == "v10":
    normalized_frames = total_frames
else:
    normalized_frames = [
        (frame - global_min) / (global_max - global_min) for frame in total_frames
    ]

# --- Assemble normalized frames into a flipbook ---
# For this example, we lay out the frames in a square grid.
flipbook = np.zeros((orig_height, orig_width), dtype=np.float32)
for idx, frame in enumerate(normalized_frames):
    r = idx // cols
    c = idx % cols
    flipbook[r * dst_height:(r + 1) * dst_height, c * dst_width:(c + 1) * dst_width] = frame

# --- Write the flipbook as a GeoTIFF ---
# Create an output filename using the file base, variable name, and time range.
start_date = str(total_time_values[0])[:10]
end_date = str(time_values[-1])[:10]
out_filename = f"{os.path.splitext(os.path.basename(file))[0]}__{_var_name}__{rows}x{cols}_{start_date}_{end_date}.tif"
out_file = os.path.join(directory, out_filename)

# Write the flipbook as a GeoTIFF.
with rasterio.open(
        out_file,
        'w',
        driver='GTiff',
        height=flipbook.shape[0],
        width=flipbook.shape[1],
        count=1,
        dtype='float32',
        crs='EPSG:4326',
        transform=dst_transform  # Using the destination transform as a placeholder.
) as dst:
    dst.write(flipbook, 1)

# Prepare properties to store in the CSV.
properties = {
    "variable": _var_name,
    "start_date": start_date,
    "end_date": end_date,
    "global_min": global_min,
    "global_max": global_max,
    "num_frames": total_frame_count,
    "dst_width": dst_width,
    "dst_height": dst_height,
    "flipbook_cols": cols,
    "flipbook_rows": rows,
    "output_file": out_file
}

return properties
print(f"Saved flipbook for variable '{_var_name}' from file '{file}' as {out_file}")
#return out_file

Process each file sequentially; for each file, process variables in parallel.

all_properties = []

MERGE

datasets = [open_dataset_file(f) for f in tqdm(files, desc=“Opening files”)] ds = xr.concat(datasets, dim=TIME_VAR)

NO MERGE

for file in cached_files:

print(“Processing file:”, file)

ds = open_dataset_file(file)

futures = []

with concurrent.futures.ThreadPoolExecutor() as executor:

for var_name in ds.data_vars:

futures.append(executor.submit(process_variable, ds, var_name, file, BOX))

for future in concurrent.futures.as_completed(futures):

try:

props = future.result()

all_properties.append(props)

except Exception as e:

print(“Error processing variable:”, e)

futures = [] with concurrent.futures.ThreadPoolExecutor() as executor: for var_name in ds.data_vars: futures.append(executor.submit(process_variable_multi_ds, datasets, var_name, file, BOX, DATES)) for future in concurrent.futures.as_completed(futures): try: props = future.result() all_properties.append(props) except Exception as e: print(“Error processing variable:”, e)

Save the variable properties to a CSV file.

df = pd.DataFrame(all_properties) csv_file = os.path.join(directory, “variable_properties.csv”) df.to_csv(csv_file, index=False) print(“Saved variable properties CSV:”, csv_file)

The issue is that your destination grid extends slightly beyond the source’s actual coverage. In your printed transforms, the source’s top is at 50.50 while your destination’s top is 50.60. This means that when reprojecting, the top row of the destination is outside the source extent and gets filled with the default (0) value.

What’s Happening

Grid Extent Mismatch:
Your source transform is built from the dataset’s longitude/latitude values. For the CZ_TRAVNY_EXTENDED dataset, if the dataset’s latitude array starts at 50.50, then the source grid only covers up to 50.50. However, your destination bounding box (derived from BOX) goes up to 50.60. That extra 0.10 degree in the destination is outside the source extent.
Nodata Filling:
Since no valid data exists for the area above 50.50 in the source, the reprojection fills that destination region with its default nodata value (which is 0). This results in your global minimum snapping to zero.

How to Fix It

Adjust the Destination Bounding Box:
Ensure that the destination bounding box is entirely within the source data extent. For example, if the source top is 50.50, then your destination top should not exceed that value.

Set and Handle Nodata Values Explicitly:
You can pass src_nodata and dst_nodata parameters to the reproject function so that areas outside the source extent aren’t filled with 0 by default. For instance:

reproject(
    source=src_array,
    destination=dst_array,
    src_transform=src_transform,
    src_crs='EPSG:4326',
    dst_transform=dst_transform,
    dst_crs='EPSG:4326',
    resampling=Resampling.bilinear,
    src_nodata=np.nan,   # or another value that you know is not part of your data
    dst_nodata=np.nan
)

Then, when you compute your global min, you can ignore these nodata values (e.g., using np.nanmin).

Review Grid Alignment:
Make sure that both your source and destination transforms are defined consistently. The slight difference in pixel size (0.21 vs 0.20) can also lead to subtle misalignments. Consider recalculating the destination bounds to exactly match the source extent if that is your intention.

By ensuring the destination grid lies completely within the source’s bounds or by handling nodata properly, you should be able to avoid the unexpected zero values in your output.