Skip to main content

Castoro

Castoro is a specialized PDF processing service that works in conjunction with Farfalla to handle complex PDF operations. It's designed to process PDFs through a series of steps while maintaining high reliability and performance.

Overview

Castoro provides advanced PDF processing capabilities including:

  • Metadata extraction (table of contents, page labels, page count)
  • Annotation layer extraction (internal/external links within text)
  • Page processing (high/low resolution image conversion and text extraction)
  • PDF optimization using Ghostscript

Architecture

Core Components

  • Service Integration: Receives PDF processing requests from Farfalla via HTTP POST
  • Processing Pipeline: Sequential processing using both PHP and Node.js
  • Infrastructure: Deployed on Laravel Vapor with optimized configurations for long-running operations
  • Technology Stack:
    • PHP 8.3 for core operations
    • Node.js 16.20.2 for PDF processing
    • Ghostscript v10 for PDF optimization
    • CPDF v2.6 for PDF manipulation
    • custom fork of pdf-extractor (based on pdf.js)

Infrastructure Notes

  • Configured for extended processing times (900s timeout)
  • Optimized for CPU/memory-intensive operations
  • Single-threaded processing (Ghostscript/pdf.js limitations)

Development Setup

Prerequisites

  • Docker

PDF Processing Pipeline

  1. PDF Optimization

    • Initial processing using Ghostscript v10
    • Improves reliability and processing speed
  2. Metadata Extraction

    • Extracts:
      • Table of contents
      • Page labels
      • Page count
  3. Annotation Processing

    • Extracts text-embedded links
    • Processes both internal and external URLs
  4. Page Processing

    • Generates multiple image resolutions
    • Extracts raw text for indexing

Deployment

Deployed using Laravel Vapor with configurations optimized for:

  • Extended processing times (900s timeout)
  • High memory availability
  • Production and staging environments with similar configurations

Future Improvements

  • Node.js upgrade to 20.x
  • Migration to TypeScript
  • pdf-extractor upgrade to pdf.js 4.x

Technical Notes

  • Supports all PDF formats
  • No specific input PDF requirements
  • Processing limitations handled by Farfalla
  • CPDF binaries included in ./app directory
  • Ghostscript managed via Docker configuration
  • Node.js managed via Docker configuration
  • PHP managed via Docker configuration

X

Graph View