File Storage Architecture
Introduction
This document outlines the new and current file storage architecture that we will use in 2024-Q3.
Problem we had before
The old file storage architecture use one bucket for all the different use cases. While it’s was simple, it has some downsides:
- Makes it hard to understand what’s actually consuming storage
- Makes it very hard to apply different levels of security
- Makes it hard to apply different levels of permissions
- Makes it very hard to obscure some paths and not others
It could be stated that a single bucket is simpler than multiple ones. But, in reality, this is completely negated by the fact that developers need to reference the locations multiple times either inline or by configuring multiple views of the bucket. So the overall complexity remains the same and it depends on the actual setup.
The latest changes
The main change is the move from one single bucket to multiple purpose specific buckets. This allows us to use the best provider and, very easily, setup least privilege access on a bucket by bucket basis.
This also includes the use of Cloudflare’s R2, their S3 compliant service. This would be beneficial for us in terms of pricing, as R2 is cheaper per GB and also has no bandwidth cost.
Objectives
- Optimize file storage: Move from a single bucket to multiple buckets specific to each purpose.
- Improve security: Apply different security levels and permissions more effectively.
- Reduce costs: Integrate Cloudflare's R2 service to take advantage of its benefits in terms of pricing and bandwidth costs.
- Ensure the same architecture in all our environments: production, staging, and local
Context and motivators for change
The file storage architecture we had used a single bucket for all different use cases. Although this configuration was simple, it presents several significant disadvantages:
- Difficulty in storage management: It's complicated to understand what is consuming the storage.
- Limitations in security and permissions: Applying different levels of security and permissions is difficult.
- Difficulty in path obfuscation: It's very difficult to hide some paths without affecting others.
Technology Summary
- Multiple specific buckets: Each bucket will be dedicated to a specific purpose, allowing the use of the best provider and easy configuration of least privilege access.
- Use of Cloudflare's R2 service: Integrating this S3-compatible service will benefit in terms of costs, as R2 is cheaper per GB and has no bandwidth costs.
- Dockerization of Local Services: We will have different technologies similar or identical to those used in staging and production environments.
With this new structure, a significant improvement in the management, security, and costs of file storage in Farfalla is expected, as well as simplicity in understanding and scaling solutions for the development team.
Bucket Structure
Naming Convention: Bucket names will be formed as follows: {name}-{provider}-{environment}-{host/domain} (host/domain is only required for those that need an associated CNAME and there is no conflict with AWS)
Providers: aws, cf, and minio
Environments: local, staging, and production.
Examples:
- storage, aws, production, and publica.la must be: storage-aws-production.publica.la
- assets, cf, and staging must be: assets-cf-staging
- farfalla-io, minio, and local must be: farfalla-io-minio-local
Security Configuration and Policies
ACLs Configuration
(Access Control List AWS oficial doc: Eng and Esp)
We must allow access to content only from Cloudflare, because micelio project will be responsible for encrypting and securing the content. All traffic is routed through this path. For objects that we want to leave publicly available, a particular ACL will be defined for the object itself (e.g., covers).
Example of a bucket policy in AWS:
{
"Version": "2012-10-17",
"Id": "S3PolicyId1",
"Statement": [
{
"Sid": "IPAllow - Cloudflare range IPs",
"Effect": "Allow",
"Principal": "*",
"Action": "s3:GetObject",
"Resource": "arn:aws:s3:::BUCKET-NAME/*",
"Condition": {
"IpAddress": {
"aws:SourceIp": [
"2400:cb00::/32",
"2606:4700::/32",
...
]
}
}
}
]
}
Expiration Policies: Define lifecycle rules for objects in buckets
To ensure the growth and stability of the buckets, we determine expiration dates for contents that should not be permanent.
Examples:
- expire_all_90_days: Permanently delete all content after 90 days.
- expire_exports_7_days: Permanently delete export content after 7 days.
- expire_tmp_2_days: Permanently delete tmp content after 2 days.
CORS Policies
We must establish our allowed communication policies for platform interaction. Example from AWS:
[
{
"AllowedHeaders": ["*"],
"AllowedMethods": ["HEAD", "GET", "PUT", "POST"],
"AllowedOrigins": ["*"],
"ExposeHeaders": []
}
]
Example from Cloudflare R2:
[
{
"AllowedHeaders": ["*"],
"AllowedMethods": ["HEAD", "GET"],
"AllowedOrigins": ["*"]
}
]
Local Environment Simulation
- Nginx will do the work that Cloudflare does in production.
- Micelio will be the project in charge of using Hono to execute Workers as we do within Cloudflare in production.
- Minio to simulate the storage structure in our local environment, which is compatible with S3 and allows us to simulate a production environment in the best way. This is where we will do the complete creation of our buckets and particular definitions of each Policy.
- Docker container for each project/service in permanent execution (this differs from production where we have a Microservices structure with on-demand LAMBDAs)
Final Bucket Configuration
Content
Provider: AWS S3 Names: storage-aws-production.publica.la storage-staging.publica.la storage-minio-local Tenant scoped: Yes Expiration: No Purpose: Original files: PDF, EPUB, MP3 Post processing files:
- PDF: optimized PDF, large images, thumb images, text later, annotation layer
- EPUB: Unpacked EPUB files
- MP3: No change at the moment, but we might start converting to lower bitrates and/or other codecs
Content Downloads
Provider: Cloudflare R2 Names: content-downloads-cf-production content-downloads-cf-staging content-downloads-minio-local Tenant scoped: No Expiration: Yes 48 hours (expire_all_2_days) Purpose: Encrypted files for downloads, at the moment it will only include EPUB’s tarballs
Assets
Provider: Cloudflare R2 Names: assets-cf-production assets-cf-staging assets-minio-local Tenant scoped: Yes Expiration: No Purpose: Tenant’s logos, icons and their users branding logos
IO (Imports, Exports, Temporal)
Provider: AWS S3 Names: farfalla-io-aws-production farfalla-io-aws-staging farfalla-io-minio-local Tenant scoped: No Expiration: all: Yes 3 months (expire_all_90_days) exports: Yes 1 weeks (expire_exports_7_days) tmp: Yes 48 hours (expire_tmp_2_days) Purpose: import: Tenant’s imports from dashboard and embedded Nova resources exports:
- Tenant’s exports from dashboard and embedded Nova resources
- publica.la’s exports from Nova tmp: (this name is required and used by Vapor) temporal files during content, logos, icons and any kind of uploads through the dashboard
Logs
Provider: AWS S3 Names: farfalla-logs-aws-production farfalla-logs-aws-staging farfalla-logs-minio-local Tenant scoped: No Expiration: Yes 12 months (expire_all_365_days) Purpose: farfalla’s general logs
Policies and permissions in AWS Buckets
{
"Version": "2012-10-17",
"Id": "S3PolicyId1",
"Statement": [
{
"Sid": "IPAllow - Cloudflare range IPs",
"Effect": "Allow",
"Principal": "*",
"Action": "s3:GetObject",
"Resource": "arn:aws:s3:::NOMBRE-DEL-BUCKET/*",
"Condition": {
"IpAddress": {
"aws:SourceIp": [
"2400:cb00::/32",
"2606:4700::/32",
"2803:f800::/32",
"2405:b500::/32",
"2405:8100::/32",
"2a06:98c0::/29",
"2c0f:f248::/32",
"173.245.48.0/20",
"103.21.244.0/22",
"103.22.200.0/22",
"103.31.4.0/22",
"141.101.64.0/18",
"108.162.192.0/18",
"190.93.240.0/20",
"188.114.96.0/20",
"197.234.240.0/22",
"198.41.128.0/17",
"162.158.0.0/15",
"104.16.0.0/13",
"172.64.0.0/13",
"104.24.0.0/14",
"131.0.72.0/22"
]
}
}
}
]
}
CORS en Buckets de AWS S3
[
{
"AllowedHeaders": ["*"],
"AllowedMethods": ["HEAD", "GET", "PUT", "POST"],
"AllowedOrigins": ["*"],
"ExposeHeaders": []
}
]
CORS en Buckets de Cloudflare R2
[
{
"AllowedHeaders": ["*"],
"AllowedMethods": ["HEAD", "GET"],
"AllowedOrigins": ["*"]
}
]
Cloudflare CNAMES to specific Buckets
Dominio: publica.la
- storage-aws-production -> storage-aws-production.publica.la.s3.amazonaws.com
- storage-staging -> storage-staging.publica.la.s3.amazonaws.com