Handwritten Text Recognition Demo

2024-10-02 18:17 UTC o1-preview Open in ChatGPT ↗

Certainly! Below is a Python script that demonstrates how to perform handwritten text recognition using the state-of-the-art TrOCR (Transformer-based OCR) model developed by Microsoft. This model leverages a Transformer architecture for both the encoder and decoder, providing excellent performance on handwritten text recognition tasks.

Prerequisites:

Python 3.6 or higher
transformers library
torch library
Pillow library (for image processing)
requests library (for fetching images from URLs)

You can install the required libraries using the following commands:

pip install transformers torch pillow requests

Python Script:

from transformers import TrOCRProcessor, VisionEncoderDecoderModel
from PIL import Image
import requests

# Load the processor and the pre-trained model
processor = TrOCRProcessor.from_pretrained('microsoft/trocr-base-handwritten')
model = VisionEncoderDecoderModel.from_pretrained('microsoft/trocr-base-handwritten')

# Set the model to evaluation mode
model.eval()

# Function to perform OCR on an image
def recognize_text_from_image(image_path):
    if image_path.startswith('http://') or image_path.startswith('https://'):
        # Load image from URL
        image = Image.open(requests.get(image_path, stream=True).raw).convert("RGB")
    else:
        # Load image from local path
        image = Image.open(image_path).convert("RGB")

    # Preprocess the image
    pixel_values = processor(images=image, return_tensors="pt").pixel_values

    # Generate text using the model
    generated_ids = model.generate(pixel_values)

    # Decode the generated IDs to text
    generated_text = processor.decode(generated_ids[0], skip_special_tokens=True)

    return generated_text

# Example usage
if __name__ == "__main__":
    # Provide the path to your handwritten image here
    image_path = 'path_to_your_handwritten_image.jpg'  # Replace with your image path or URL

    recognized_text = recognize_text_from_image(image_path)
    print("Recognized Text:")
    print(recognized_text)

Instructions:

Prepare an Image:
- Use a scanned image or a photograph of handwritten text.
- Ensure the text is clear and legible.
- Save the image locally or host it at a URL accessible to the script.
Update the Image Path:
- Replace 'path_to_your_handwritten_image.jpg' with the actual file path or URL of your image.
Run the Script:
- Execute the script in your Python environment.
- The recognized text will be printed to the console.

Example:

Suppose you have an image hosted at a URL:

image_path = 'https://example.com/handwritten_sample.jpg'

Or a local image:

image_path = '/Users/yourusername/Documents/handwritten_sample.jpg'

Sample Output:

Recognized Text:
Dear John, I hope this letter finds you well. Let's catch up soon!

Explanation:

TrOCRProcessor: This class processes the input images and prepares them for the model.
VisionEncoderDecoderModel: This class loads the pre-trained TrOCR model.
Model Evaluation Mode: Setting model.eval() ensures that the model behaves correctly during inference.
Image Loading: The script can handle images from both local paths and URLs.
Text Generation: The generate method performs the OCR, and decode converts the output tokens to a readable string.

Notes:

The model 'microsoft/trocr-base-handwritten' is pre-trained specifically for handwritten text recognition.
For printed text, you can use 'microsoft/trocr-base-printed' instead.
The performance may vary depending on the quality of the input image.

Troubleshooting:

Blurry Images: Ensure the input image is not blurry and the text is well-contrasted against the background.
Unsupported Image Formats: The script supports common image formats like JPEG and PNG.
Large Images: For very large images, you might need to resize them to reduce memory consumption.

Feel free to experiment with different images and see how well the model performs on various handwriting styles!

Certainly! Let’s create a handwritten text recognition demo using the state-of-the-art DTrOCR (Deformable Transformer-based Optical Character Recognition) method in Python. DTrOCR enhances the TrOCR model by incorporating a Deformable Transformer encoder, which improves efficiency and performance, especially on high-resolution images.

Overview of DTrOCR:

DTrOCR replaces the standard Transformer encoder in TrOCR with a Deformable Transformer encoder. This allows the model to focus on relevant spatial features in the image, making it more efficient and accurate for OCR tasks involving complex or high-resolution inputs.

Prerequisites:

Python 3.6 or higher
PyTorch (compatible version with your CUDA setup)
Torchvision
timm (PyTorch Image Models)
PyYAML
OpenCV-Python
NumPy

Setup Instructions:

1. Clone the DTrOCR Repository:

git clone https://github.com/arvindrajan92/DTrOCR.git
cd DTrOCR

2. Install Required Packages:

You can install the required packages using pip:

pip install torch torchvision timm pyyaml opencv-python numpy

Ensure that you install the correct version of PyTorch for your system, especially if you have a GPU with CUDA support. Visit PyTorch’s official website for specific installation commands.

Download Pre-trained Weights:

As of my last update, pre-trained weights might not be directly available in the repository. However, if they are provided, you should:

Download the pre-trained model weights.
Place them in a directory within the DTrOCR folder, e.g., checkpoints/dtrocr.pth.

Python Script:

Below is a Python script that demonstrates how to perform handwritten text recognition using DTrOCR.

import os
import sys
import torch
import cv2
import yaml
import numpy as np
from PIL import Image

# Add the DTrOCR directory to the system path
sys.path.append(os.path.abspath('.'))

from models import build_model
from utils.tokenizer import Tokenizer

# Load configuration
def load_config(config_path):
    with open(config_path, 'r') as f:
        config = yaml.safe_load(f)
    return config

# Image preprocessing function
def preprocess_image(image_path, img_height, img_width):
    # Read image
    image = cv2.imread(image_path)
    if image is None:
        raise FileNotFoundError(f"Image not found at {image_path}")
    # Convert to RGB
    image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
    # Resize image
    image = cv2.resize(image, (img_width, img_height))
    # Normalize image
    image = image.astype(np.float32) / 255.0
    # Convert to tensor
    image = torch.from_numpy(image).permute(2, 0, 1).unsqueeze(0)
    return image

def recognize_text(image_path, model, tokenizer, device, config):
    img_height = config['dataset']['img_height']
    img_width = config['dataset']['img_width']

    # Preprocess the image
    image = preprocess_image(image_path, img_height, img_width)
    image = image.to(device)

    # Inference
    model.eval()
    with torch.no_grad():
        # Generate predictions
        outputs = model(image, max_len=config['dataset']['max_len'])
        # Get predicted tokens
        predicted_indices = outputs.argmax(dim=-1)
        predicted_indices = predicted_indices.squeeze().cpu().numpy()

    # Decode tokens to text
    recognized_text = tokenizer.decode(predicted_indices)
    return recognized_text

def main():
    # Paths to configuration and model weights
    config_path = 'config.yaml'  # Update if your config file is named differently
    model_weights_path = 'checkpoints/detrocr.pth'  # Update with the correct path

    # Load configuration
    config = load_config(config_path)

    # Device configuration
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

    # Initialize the tokenizer
    tokenizer = Tokenizer(config['dataset']['vocab_path'])

    # Build the model
    model = build_model(config)
    model = model.to(device)

    # Load pre-trained weights
    if os.path.exists(model_weights_path):
        checkpoint = torch.load(model_weights_path, map_location=device)
        model.load_state_dict(checkpoint['model_state_dict'])
        print("Loaded pre-trained model weights.")
    else:
        print(f"Model weights not found at {model_weights_path}")
        return

    # Path to the input image
    image_path = 'path_to_your_handwritten_image.jpg'  # Replace with your image path

    # Recognize text
    recognized_text = recognize_text(image_path, model, tokenizer, device, config)
    print("Recognized Text:")
    print(recognized_text)

if __name__ == "__main__":
    main()

Instructions:

1. Update the Configuration File (`config.yaml`):

Ensure that your config.yaml has the correct settings. Key parameters to check:

Image Dimensions:

dataset:
  img_height: 48
  img_width: 160

Vocabulary Path:

dataset:
  vocab_path: 'vocab.txt'  # Update if your vocab file is in a different location

Maximum Sequence Length:

dataset:
  max_len: 25  # Adjust based on your expected text length

2. Prepare the Vocabulary File (`vocab.txt`):

Create a vocab.txt file containing all the characters that the model can recognize. Each character should be on a new line. Example:

<pad>
<eos>
<unk>
a
b
c
...

3. Download or Train the Model Weights:

If pre-trained weights are available, ensure they are placed at 'checkpoints/detrocr.pth' or update the path in the script.
If not, you may need to train the model using your dataset. Refer to the repository’s training instructions.

4. Prepare Your Handwritten Image:

Use a clear image containing handwritten text.
Supported image formats include JPEG, PNG, etc.
Place the image in the project directory or provide the full path.

5. Run the Script:

Execute the script in your terminal:

python recognize_handwritten_text.py

Ensure that you replace 'recognize_handwritten_text.py' with the actual filename if it’s different.

Explanation of the Script:

Import Statements:
- Adds the project directory to sys.path to import modules correctly.
- Imports necessary modules for model building, image processing, and tokenization.
Configuration Loading:
- Reads the config.yaml file to set parameters for the model and dataset.
Model Building:
- Uses the build_model function from the models module to construct the DTrOCR model.
- Loads pre-trained weights if available.
Image Preprocessing:
- Reads and resizes the input image.
- Normalizes pixel values to [0, 1].
- Converts the image to a PyTorch tensor.
Text Recognition Function:
- Performs inference with the model.
- Decodes the output tokens to generate the recognized text.
Main Function:
- Handles the overall flow, including loading configurations, initializing components, and running the recognition.

Troubleshooting:

Module Not Found Errors:
- Ensure that the DTrOCR directory is in your Python path.
- Double-check that all modules and functions are correctly imported.
Incorrect Model Weights Path:
- Verify that the model weights file exists at the specified location.
- Update the path in the script if necessary.
CUDA Errors:
- If you don’t have a GPU or encounter CUDA errors, ensure that torch.device('cuda' if torch.cuda.is_available() else 'cpu') correctly falls back to CPU.
Empty or Incorrect Recognized Text:
- Make sure the image preprocessing steps match those used during training.
- Verify that the vocabulary file matches the one used during model training.
- The model may require fine-tuning if it’s not generalizing well to your images.

Additional Notes:

Model Architecture and Files:
- The build_model function and DTrOCR class should be defined in the repository as per the project structure.
- Ensure all necessary files (e.g., models.py, tokenizer.py) are present and correctly referenced.
Data Normalization:
- The normalization parameters might need adjustment based on how the model was trained.
- Common practices include normalizing images to have a mean and standard deviation of [0.5, 0.5, 0.5].
Adjusting Image Dimensions:
- If your input images are significantly different in size, consider adjusting the img_height and img_width parameters in the configuration.
Vocabulary Considerations:
- The tokenizer should include all possible characters expected in the input.
- Special tokens like <pad>, <eos>, and <unk> are essential for proper decoding.

References:

DTrOCR GitHub Repository:

https://github.com/arvindrajan92/DTrOCR
PyTorch Installation:

https://pytorch.org/get-started/locally/

Disclaimer:

The above script is based on the typical structure of PyTorch OCR projects and assumes certain implementations in the DTrOCR repository.
Adjustments may be necessary based on updates to the repository or specific implementation details.
Always refer to the official repository and documentation for the most accurate and up-to-date information.

I hope this helps you set up a handwritten text recognition demo using DTrOCR. If you have any questions or need further assistance with specific parts of the code or setup, feel free to ask!

this code seems to be running well, but only if the images represent a single line of text.

what can we do to first split a long document in “lines” and then process them one by one?

from transformers import TrOCRProcessor, VisionEncoderDecoderModel from PIL import Image from tqdm.auto import tqdm from urllib.request import urlretrieve from zipfile import ZipFile

import numpy as np import matplotlib.pyplot as plt import torch import os import glob

device = torch.device(‘cuda:0’ if torch.cuda.is_available else ‘cpu’)

print (device)

def download_and_unzip(url, save_path): print(f”Downloading and extracting assets…”, end="")

# Downloading zip file using urllib package.
urlretrieve(url, save_path)


try:
    # Extracting zip file using the zipfile package.
    with ZipFile(save_path) as z:
        # Extract ZIP file contents in the same directory.
        z.extractall(os.path.split(save_path)[0])


    print("Done")


except Exception as e:
    print("\nInvalid file.", e)

def read_image(image_path): """ :param image_path: String, path to the input image.

Returns:
    image: PIL Image.
"""
image = Image.open(image_path).convert('RGB')
return image

def ocr(image, processor, model): """ :param image: PIL Image. :param processor: Huggingface OCR processor. :param model: Huggingface OCR model.

Returns:
    generated_text: the OCR'd text string.
"""
# We can directly perform OCR on cropped images.
pixel_values = processor(image, return_tensors='pt').pixel_values.to(device)
generated_ids = model.generate(pixel_values)
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
return generated_text

def eval_new_data(data_path=None, num_samples=4, model=None): image_paths = glob.glob(data_path) for i, image_path in tqdm(enumerate(image_paths), total=len(image_paths)): if i == num_samples: break image = read_image(image_path) text = ocr(image, processor, model) plt.figure(figsize=(7, 4)) plt.imshow(image) plt.title(text) plt.axis(‘off’) plt.show()

URL = r”https://www.dropbox.com/scl/fi/jz74me0vc118akmv5nuzy/images.zip?rlkey=54flzvhh9xxh45czb1c8n3fp3&dl=1”

asset_zip_path = os.path.join(os.getcwd(), “images.zip”)

Download if assest ZIP does not exists.

if not os.path.exists(asset_zip_path): download_and_unzip(URL, asset_zip_path)

processor = TrOCRProcessor.from_pretrained(‘microsoft/trocr-small-printed’)

model = VisionEncoderDecoderModel.from_pretrained(‘microsoft/trocr-small-printed’).to(device)

eval_new_data(

data_path=os.path.join(‘images’, ‘newspaper’, ’*’),

num_samples=2,

model=model

)

processor = TrOCRProcessor.from_pretrained(‘microsoft/trocr-large-handwritten’) model = VisionEncoderDecoderModel.from_pretrained( ‘microsoft/trocr-large-handwritten’ ).to(device)

eval_new_data( data_path=os.path.join(‘images’, ‘handwritten’, ’*’), num_samples=5, model=model )

Certainly! When dealing with images containing multiple lines of text, we need to first segment the image into individual lines before performing OCR on each line. Here’s how you can modify your code to achieve this:

1. Segment the Image into Lines

We’ll use OpenCV and some image processing techniques to detect and extract individual lines from the image.

Function to Segment Image into Lines:

def segment_into_lines(image):
    import cv2
    import numpy as np

    # Convert the PIL image to a NumPy array and grayscale
    img_array = np.array(image.convert('L'))

    # Invert the image (since text is usually darker than background)
    img_array = 255 - img_array

    # Binarize the image using Otsu's thresholding
    _, binary = cv2.threshold(img_array, 0, 255, cv2.THRESH_BINARY | cv2.THRESH_OTSU)

    # Define a horizontal kernel to detect text lines
    kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (binary.shape[1], 1))

    # Use morphological operations to detect lines
    dilated = cv2.dilate(binary, kernel, iterations=5)

    # Find contours of the lines
    contours, _ = cv2.findContours(dilated, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)

    # Sort contours from top to bottom
    contours = sorted(contours, key=lambda ctr: cv2.boundingRect(ctr)[1])

    line_images = []

    for cnt in contours:
        x, y, w, h = cv2.boundingRect(cnt)
        # Crop the line from the original image
        line_img = image.crop((0, y, image.width, y + h))
        line_images.append(line_img)

    return line_images

2. Modify the OCR Function to Handle Multiple Lines

We’ll adjust the ocr function to process each line individually and concatenate the results.

Modified OCR Function:

def ocr_multiline(image, processor, model):
    # Segment the image into lines
    line_images = segment_into_lines(image)

    recognized_text = ""

    for line_image in line_images:
        # Process each line image
        pixel_values = processor(line_image, return_tensors='pt').pixel_values.to(device)
        generated_ids = model.generate(pixel_values)
        line_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
        recognized_text += line_text + "\n"

    return recognized_text

3. Update the Evaluation Function

We need to update eval_new_data to use the new ocr_multiline function.

Updated Evaluation Function:

def eval_new_data(data_path=None, num_samples=4, model=None):
    image_paths = glob.glob(data_path)
    for i, image_path in tqdm(enumerate(image_paths), total=len(image_paths)):
        if i == num_samples:
            break
        image = read_image(image_path)
        text = ocr_multiline(image, processor, model)
        plt.figure(figsize=(10, 6))
        plt.imshow(image)
        plt.title(text)
        plt.axis('off')
        plt.show()

4. Full Updated Code

Here is the complete code with all the modifications:

from transformers import TrOCRProcessor, VisionEncoderDecoderModel
from PIL import Image
from tqdm.auto import tqdm
from urllib.request import urlretrieve
from zipfile import ZipFile
import numpy as np
import matplotlib.pyplot as plt
import torch
import os
import glob
import cv2

device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')

def download_and_unzip(url, save_path):
    print(f"Downloading and extracting assets....", end="")
    # Downloading zip file using urllib package.
    urlretrieve(url, save_path)
    try:
        # Extracting zip file using the zipfile package.
        with ZipFile(save_path) as z:
            # Extract ZIP file contents in the same directory.
            z.extractall(os.path.split(save_path)[0])
        print("Done")
    except Exception as e:
        print("\nInvalid file.", e)

def read_image(image_path):
    """
    :param image_path: String, path to the input image.
    Returns:
        image: PIL Image.
    """
    image = Image.open(image_path).convert('RGB')
    return image

def segment_into_lines(image):
    import cv2
    import numpy as np

    # Convert the PIL image to a NumPy array and grayscale
    img_array = np.array(image.convert('L'))

    # Invert the image (since text is usually darker than background)
    img_array = 255 - img_array

    # Binarize the image using Otsu's thresholding
    _, binary = cv2.threshold(img_array, 0, 255, cv2.THRESH_BINARY | cv2.THRESH_OTSU)

    # Define a horizontal kernel to detect text lines
    kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (binary.shape[1], 1))

    # Use morphological operations to detect lines
    dilated = cv2.dilate(binary, kernel, iterations=5)

    # Find contours of the lines
    contours, _ = cv2.findContours(dilated, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)

    # Sort contours from top to bottom
    contours = sorted(contours, key=lambda ctr: cv2.boundingRect(ctr)[1])

    line_images = []

    for cnt in contours:
        x, y, w, h = cv2.boundingRect(cnt)
        # Crop the line from the original image
        line_img = image.crop((0, y, image.width, y + h))
        line_images.append(line_img)

    return line_images

def ocr_multiline(image, processor, model):
    # Segment the image into lines
    line_images = segment_into_lines(image)

    recognized_text = ""

    for line_image in line_images:
        # Process each line image
        pixel_values = processor(line_image, return_tensors='pt').pixel_values.to(device)
        generated_ids = model.generate(pixel_values)
        line_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
        recognized_text += line_text + "\n"

    return recognized_text

def eval_new_data(data_path=None, num_samples=4, model=None):
    image_paths = glob.glob(data_path)
    for i, image_path in tqdm(enumerate(image_paths), total=len(image_paths)):
        if i == num_samples:
            break
        image = read_image(image_path)
        text = ocr_multiline(image, processor, model)
        plt.figure(figsize=(10, 6))
        plt.imshow(image)
        plt.title(text)
        plt.axis('off')
        plt.show()

URL = r"https://www.dropbox.com/scl/fi/jz74me0vc118akmv5nuzy/images.zip?rlkey=54flzvhh9xxh45czb1c8n3fp3&dl=1"

asset_zip_path = os.path.join(os.getcwd(), "images.zip")

# Download if assets ZIP does not exist.
if not os.path.exists(asset_zip_path):
    download_and_unzip(URL, asset_zip_path)

processor = TrOCRProcessor.from_pretrained('microsoft/trocr-large-handwritten')
model = VisionEncoderDecoderModel.from_pretrained('microsoft/trocr-large-handwritten').to(device)

eval_new_data(
    data_path=os.path.join('images', 'handwritten', '*'),
    num_samples=5,
    model=model
)

5. Explanation of the Modifications

Segmenting the Image into Lines

Grayscale Conversion and Inversion:
- We convert the image to grayscale and invert it to make the text white on a black background, which helps in thresholding.
Thresholding:
- Otsu’s thresholding method is used to binarize the image effectively.
Morphological Operations:
- A horizontal structuring element is used in dilation to connect text components within the same line.
Contour Detection:
- Contours are detected on the dilated image, which correspond to text lines.
Cropping Lines:
- We crop each detected line from the original image.

Processing Each Line

Looping Over Lines:
- Each line image is processed individually through the OCR model.
Concatenating Results:
- The recognized text from each line is concatenated, separated by newline characters.

6. Testing and Adjustments

Visualizing Detected Lines (Optional)

To ensure that the lines are being correctly detected, you can add visualization code:

def visualize_line_detection(image, contours):
    import matplotlib.patches as patches

    # Create a figure and axis to display the image
    fig, ax = plt.subplots(figsize=(10, 6))
    ax.imshow(image)

    # Draw rectangles around detected lines
    for cnt in contours:
        x, y, w, h = cv2.boundingRect(cnt)
        rect = patches.Rectangle((0, y), image.width, h, linewidth=2, edgecolor='r', facecolor='none')
        ax.add_patch(rect)

    plt.title('Detected Lines')
    plt.axis('off')
    plt.show()

Call this function within segment_into_lines after detecting contours:

visualize_line_detection(image, contours)

Adjusting Parameters

Dilation Iterations:
- The number of iterations in cv2.dilate can be adjusted based on the spacing between lines.
- Increase the iterations if lines are not fully connected.
Kernel Size:
- The kernel size can be adjusted if the text size varies significantly.

7. Important Notes

Dependencies:

Ensure that all required libraries are installed:

pip install transformers torch torchvision pillow numpy matplotlib opencv-python tqdm

Image Quality:
- The accuracy of OCR depends on the quality of the input images.
- Clear, high-resolution images with good contrast yield better results.
Model Limitations:
- The microsoft/trocr-large-handwritten model is trained on individual lines of text.
- While segmenting into lines helps, the model may not handle complex layouts or heavily stylized handwriting well.

8. Conclusion

By segmenting the image into individual lines, we can process multi-line handwritten text images effectively. This approach allows the OCR model to focus on one line at a time, which aligns with how it was trained and improves recognition accuracy.

Feel free to ask if you have any questions or need further assistance with the code!

Certainly! To visualize the lines detected during segmentation, we’ll modify the code to draw bounding boxes around each detected line. This way, you can inspect how the image is being split into lines.

1. Modify the `segment_into_lines` Function

We’ll adjust the segment_into_lines function to return both the list of line images and the bounding boxes. We’ll also add code to draw the bounding boxes on the original image.

Updated `segment_into_lines` Function:

def segment_into_lines(image):
    import cv2
    import numpy as np

    # Convert the PIL image to a NumPy array and grayscale
    img_array = np.array(image.convert('L'))

    # Invert the image (since text is usually darker than background)
    img_array = 255 - img_array

    # Binarize the image using Otsu's thresholding
    _, binary = cv2.threshold(img_array, 0, 255, cv2.THRESH_BINARY | cv2.THRESH_OTSU)

    # Define a horizontal kernel to detect text lines
    kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (binary.shape[1], 1))

    # Use morphological operations to detect lines
    dilated = cv2.dilate(binary, kernel, iterations=5)

    # Find contours of the lines
    contours, _ = cv2.findContours(dilated, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)

    # Sort contours from top to bottom
    contours = sorted(contours, key=lambda ctr: cv2.boundingRect(ctr)[1])

    line_images = []
    bounding_boxes = []

    for cnt in contours:
        x, y, w, h = cv2.boundingRect(cnt)
        # Crop the line from the original image
        line_img = image.crop((0, y, image.width, y + h))
        line_images.append(line_img)
        bounding_boxes.append((0, y, image.width, y + h))

    return line_images, bounding_boxes

2. Update the `ocr_multiline` Function

Modify the ocr_multiline function to receive the bounding boxes and use them for visualization.

Updated `ocr_multiline` Function:

def ocr_multiline(image, processor, model):
    # Segment the image into lines and get bounding boxes
    line_images, bounding_boxes = segment_into_lines(image)

    recognized_text = ""

    for line_image in line_images:
        # Process each line image
        pixel_values = processor(line_image, return_tensors='pt').pixel_values.to(device)
        generated_ids = model.generate(pixel_values)
        line_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
        recognized_text += line_text + "\n"

    return recognized_text, bounding_boxes

3. Update the Evaluation Function to Draw Bounding Boxes

We’ll modify the eval_new_data function to draw bounding boxes on the images and display them.

Updated `eval_new_data` Function:

def eval_new_data(data_path=None, num_samples=4, model=None):
    image_paths = glob.glob(data_path)
    for i, image_path in tqdm(enumerate(image_paths), total=len(image_paths)):
        if i == num_samples:
            break
        image = read_image(image_path)
        text, bounding_boxes = ocr_multiline(image, processor, model)
        plt.figure(figsize=(10, 6))
        plt.imshow(image)
        ax = plt.gca()

        # Draw bounding boxes
        for bbox in bounding_boxes:
            x1, y1, x2, y2 = bbox
            rect = plt.Rectangle((x1, y1), x2 - x1, y2 - y1, fill=False, edgecolor='red', linewidth=2)
            ax.add_patch(rect)

        plt.title(text)
        plt.axis('off')
        plt.show()

4. Full Updated Code

Here is the complete code with all the modifications:

from transformers import TrOCRProcessor, VisionEncoderDecoderModel
from PIL import Image
from tqdm.auto import tqdm
from urllib.request import urlretrieve
from zipfile import ZipFile
import numpy as np
import matplotlib.pyplot as plt
import torch
import os
import glob
import cv2

device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')

def download_and_unzip(url, save_path):
    print(f"Downloading and extracting assets....", end="")
    # Downloading zip file using urllib package.
    urlretrieve(url, save_path)
    try:
        # Extracting zip file using the zipfile package.
        with ZipFile(save_path) as z:
            # Extract ZIP file contents in the same directory.
            z.extractall(os.path.split(save_path)[0])
        print("Done")
    except Exception as e:
        print("\nInvalid file.", e)

def read_image(image_path):
    """
    :param image_path: String, path to the input image.
    Returns:
        image: PIL Image.
    """
    image = Image.open(image_path).convert('RGB')
    return image

def segment_into_lines(image):
    import cv2
    import numpy as np

    # Convert the PIL image to a NumPy array and grayscale
    img_array = np.array(image.convert('L'))

    # Invert the image (since text is usually darker than background)
    img_array = 255 - img_array

    # Binarize the image using Otsu's thresholding
    _, binary = cv2.threshold(img_array, 0, 255, cv2.THRESH_BINARY | cv2.THRESH_OTSU)

    # Define a horizontal kernel to detect text lines
    kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (binary.shape[1], 1))

    # Use morphological operations to detect lines
    dilated = cv2.dilate(binary, kernel, iterations=5)

    # Find contours of the lines
    contours, _ = cv2.findContours(dilated, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)

    # Sort contours from top to bottom
    contours = sorted(contours, key=lambda ctr: cv2.boundingRect(ctr)[1])

    line_images = []
    bounding_boxes = []

    for cnt in contours:
        x, y, w, h = cv2.boundingRect(cnt)
        # Crop the line from the original image
        line_img = image.crop((0, y, image.width, y + h))
        line_images.append(line_img)
        bounding_boxes.append((0, y, image.width, y + h))

    return line_images, bounding_boxes

def ocr_multiline(image, processor, model):
    # Segment the image into lines and get bounding boxes
    line_images, bounding_boxes = segment_into_lines(image)

    recognized_text = ""

    for line_image in line_images:
        # Process each line image
        pixel_values = processor(line_image, return_tensors='pt').pixel_values.to(device)
        generated_ids = model.generate(pixel_values)
        line_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
        recognized_text += line_text + "\n"

    return recognized_text, bounding_boxes

def eval_new_data(data_path=None, num_samples=4, model=None):
    image_paths = glob.glob(data_path)
    for i, image_path in tqdm(enumerate(image_paths), total=len(image_paths)):
        if i == num_samples:
            break
        image = read_image(image_path)
        text, bounding_boxes = ocr_multiline(image, processor, model)
        plt.figure(figsize=(10, 6))
        plt.imshow(image)
        ax = plt.gca()

        # Draw bounding boxes
        for bbox in bounding_boxes:
            x1, y1, x2, y2 = bbox
            rect = plt.Rectangle((x1, y1), x2 - x1, y2 - y1, fill=False, edgecolor='red', linewidth=2)
            ax.add_patch(rect)

        plt.title(text)
        plt.axis('off')
        plt.show()

URL = r"https://www.dropbox.com/scl/fi/jz74me0vc118akmv5nuzy/images.zip?rlkey=54flzvhh9xxh45czb1c8n3fp3&dl=1"

asset_zip_path = os.path.join(os.getcwd(), "images.zip")

# Download if assets ZIP does not exist.
if not os.path.exists(asset_zip_path):
    download_and_unzip(URL, asset_zip_path)

processor = TrOCRProcessor.from_pretrained('microsoft/trocr-large-handwritten')
model = VisionEncoderDecoderModel.from_pretrained('microsoft/trocr-large-handwritten').to(device)

eval_new_data(
    data_path=os.path.join('images', 'handwritten', '*'),
    num_samples=5,
    model=model
)

5. Explanation of the Changes

Drawing Bounding Boxes:

Bounding Box Coordinates:
- In the segment_into_lines function, we store the bounding box coordinates (x1, y1, x2, y2) for each detected line.
Visualization in eval_new_data:
- We use Matplotlib to draw rectangles (bounding boxes) on the image.
- The plt.Rectangle function is used to create rectangles that are added to the current axis.
- Bounding boxes are drawn in red color with a line width of 2.

Returning Bounding Boxes:

Modifications in Functions:
- Both segment_into_lines and ocr_multiline functions now return bounding boxes along with their original outputs.

Displaying the Image with Bounding Boxes:

In the Evaluation Function:
- After performing OCR, we plot the original image and overlay the bounding boxes.
- The recognized text is set as the title of the plot.

6. Visual Verification

When you run the updated code, you will see the original images displayed with red bounding boxes around each detected line of text. This will allow you to visually inspect how the lines are being segmented.

7. Adjustments and Fine-Tuning

Depending on your images, you might need to adjust some parameters to improve line detection:

Dilation Iterations:
- In the cv2.dilate function, the iterations parameter controls how much the image is dilated.
- Increasing the number of iterations can help connect fragmented parts of lines but may also merge closely spaced lines.
```
dilated = cv2.dilate(binary, kernel, iterations=5)  # Try different values like 3, 5, 7
```
Kernel Size:
- The structuring element’s size can be adjusted if the text size varies.
- For very small or large text, modifying the kernel may improve detection.
```
kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (binary.shape[1], 1))  # Adjust the second parameter
```

8. Optional: Display Line Numbers

If you want to label each bounding box with a line number, you can modify the eval_new_data function:

Modified `eval_new_data` with Line Numbers:

def eval_new_data(data_path=None, num_samples=4, model=None):
    image_paths = glob.glob(data_path)
    for i, image_path in tqdm(enumerate(image_paths), total=len(image_paths)):
        if i == num_samples:
            break
        image = read_image(image_path)
        text, bounding_boxes = ocr_multiline(image, processor, model)
        plt.figure(figsize=(10, 6))
        plt.imshow(image)
        ax = plt.gca()

        # Draw bounding boxes and line numbers
        for idx, bbox in enumerate(bounding_boxes):
            x1, y1, x2, y2 = bbox
            rect = plt.Rectangle((x1, y1), x2 - x1, y2 - y1, fill=False, edgecolor='red', linewidth=2)
            ax.add_patch(rect)
            # Add line number
            ax.text(x1, y1 - 10, f'Line {idx+1}', color='blue', fontsize=12, weight='bold')

        plt.title(text)
        plt.axis('off')
        plt.show()

This will display the line numbers above each bounding box, which can be helpful for debugging and verification.

9. Testing with Your Images

To test the code with your own images:

Place Your Images:
- Save your handwritten images in a folder, e.g., my_images/.

Update the Data Path:

Modify the eval_new_data call:

eval_new_data(
    data_path=os.path.join('my_images', '*'),
    num_samples=5,
    model=model
)

Ensure Proper Image Format:
- Supported formats include JPEG, PNG, etc.
- Images should be clear and high-resolution for better results.

10. Additional Notes

Error Handling:
- The code assumes that lines are properly detected.
- If no contours are found, ensure that the image preprocessing steps are appropriate for your images.
Alternative Libraries:
- For more advanced line detection, consider using libraries like scikit-image or EasyOCR which have built-in functions for text detection.

Environment Setup:

Ensure that all necessary libraries are installed:

pip install transformers torch torchvision pillow numpy matplotlib opencv-python tqdm

11. Conclusion

By adding bounding boxes around each detected line, you can visually inspect how your images are being segmented before OCR. This can help you fine-tune the segmentation parameters and improve overall recognition accuracy.

Feel free to ask if you need further assistance or have any questions about the code modifications!