Project:S3-K8s

From MaRDI portal

MediaWiki S3 Image Storage with Proxy

This documentation describes how to configure MediaWiki to store images in an S3 bucket that's only accessible within the ZIB VPN, using a public proxy service to make images accessible to external users.

Architecture Overview

The solution consists of three main components:

  • MediaWiki with AWS Extension: Stores uploaded images directly to S3 instead of local filesystem
  • S3 Bucket: Private storage accessible only within ZIB's VPN
  • S3 Proxy Service: Public-facing service that authenticates with S3 and serves images to external users
User Upload → MediaWiki (AWS Extension) → Private S3 Bucket
                                               ↓
External User   →  S3 Proxy Service     → Private S3 Bucket

Benefits

  • Scalable Storage: Images stored in S3 instead of local filesystem
  • Automatic Backup: S3 provides built-in redundancy and backup
  • Stateless MediaWiki: Mediawiki replicas can be created/destroyed without data loss
  • Public Access: External users can view images despite private S3 bucket
  • Security: S3 bucket remains private within institution network

MediaWiki Configuration

Install AWS Extension

Follow the documentation in GitHub for the AWS extension.

Add the configuration in LocalSettings.php. Our specific configuration currently includes:

wfLoadExtension( 'AWS' );

$wgAWSCredentials = [
	'key' => getenv('S3_IMAGES_KEY'),
	'secret' => getenv('S3_IMAGES_SECRET'),
	'token' => false
];

$s3endpoint = getenv('S3_ENDPOINT');
$wgFileBackends['s3']['endpoint'] = 'https://' . $s3endpoint;
$wgFileBackends['s3']['use_path_style_endpoint'] = true; 

$wgAWSRegion = 'default';
$wgAWSBucketName = 'mardi-portal';

$wgAWSBucketDomain = 'images.' . getenv('WIKIBASE_HOST');

$wgAWSBucketTopSubdirectory = "/" . getenv('S3_ENVIRONMENT');
$wgAWSRepoHashLevels = '2';
$wgAWSRepoDeletedHashLevels = '3';

S3 Proxy Service

The S3 proxy service acts as a public gateway to our private S3 bucket, handling authentication and serving images with appropriate headers and caching. The public repository for the S3 proxy can be found here.

Key Features

  • Content Type Detection: Automatically determines MIME types based on file extensions
  • Caching Headers: Sets appropriate cache headers for better performance
  • Conditional Requests: Supports ETag and Last-Modified headers for efficient caching
  • Security Headers: Adds security headers for different content types
  • Health Checks: Provides health check endpoint for Kubernetes monitoring

Environment Variables

The proxy service requires these environment variables:

S3_REGION=region
S3_ENDPOINT=https://s3-endpoint
S3_BUCKET_NAME=bucket-name
S3_ACCESS_KEY_ID=access-key
S3_SECRET_ACCESS_KEY=secret-key

Kubernetes Deployment

The previous image is deployed using a Helm chart defined in our kubernetes repository.

It is also required to properly store the previously mentioned environmental variables as secrets in the cluster.

Configuration Values

The deployment can be customized by modifying values.yaml:

image:
  repository: ghcr.io/mardi4nfdi/s3-proxy
  tag: main
  pullPolicy: Always
replicas: 2
servicePort: 80
containerPort: 8000
ingress:
  host: "images.your-domain.com"
resources:
  requests:
    memory: "128Mi"
    cpu: "100m"
  limits:
    memory: "512Mi"
    cpu: "500m"

Verification

Test S3 Proxy

# Check health endpoint
curl https://images.your-domain.com/health

# Test image access
curl -I https://images.your-domain.com/path/to/image.jpg

Test MediaWiki Integration

  • Check that thumbnails are shown under Special:ListFiles
  • Upload an image through MediaWiki interface at Special:Upload
  • Verify the image appears correctly on wiki pages
  • Check that the image URL points to the proxy domain
  • Confirm the image is stored in the S3 bucket using s3cmd:
s3cmd --host=<endpoint> --host-bucket=<endpoint> --region=<region> --access_key=<access_key> --secret_key=<secret_key> ls s3://your-bucket-name/

Monitoring

The S3 proxy includes health check endpoints and structured logging. Monitor these metrics at grafana:

  • Response times and error rates
  • S3 API call success/failure rates
  • Cache hit/miss ratios
  • Resource usage (CPU/memory)