Castoro
Castoro is a specialized PDF processing service that works in conjunction with Farfalla to handle complex PDF operations. It's designed to process PDFs through a series of steps while maintaining high reliability and performance.
Overview
Castoro provides advanced PDF processing capabilities including:
- Metadata extraction (table of contents, page labels, page count)
- Annotation layer extraction (internal/external links within text)
- Page processing (high/low resolution image conversion and text extraction)
- PDF optimization using Ghostscript
Architecture
Core Components
- Service Integration: Receives PDF processing requests from Farfalla via HTTP POST
- Processing Pipeline: Sequential processing using both PHP and Node.js
- Infrastructure: Deployed on Laravel Vapor with optimized configurations for long-running operations
- Technology Stack:
- PHP 8.3 for core operations
- Node.js 16.20.2 for PDF processing
- Ghostscript v10 for PDF optimization
- CPDF v2.6 for PDF manipulation
- custom fork of pdf-extractor (based on pdf.js)
Infrastructure Notes
- Configured for extended processing times (900s timeout)
- Optimized for CPU/memory-intensive operations
- Single-threaded processing (Ghostscript/pdf.js limitations)
Development Setup
Prerequisites
- Docker
PDF Processing Pipeline
-
PDF Optimization
- Initial processing using Ghostscript v10
- Improves reliability and processing speed
-
Metadata Extraction
- Extracts:
- Table of contents
- Page labels
- Page count
- Extracts:
-
Annotation Processing
- Extracts text-embedded links
- Processes both internal and external URLs
-
Page Processing
- Generates multiple image resolutions
- Extracts raw text for indexing
Deployment
Deployed using Laravel Vapor with configurations optimized for:
- Extended processing times (900s timeout)
- High memory availability
- Production and staging environments with similar configurations
Future Improvements
- Node.js upgrade to 20.x
- Migration to TypeScript
- pdf-extractor upgrade to pdf.js 4.x
Technical Notes
- Supports all PDF formats
- No specific input PDF requirements
- Processing limitations handled by Farfalla
- CPDF binaries included in
./appdirectory - Ghostscript managed via Docker configuration
- Node.js managed via Docker configuration
- PHP managed via Docker configuration