By

How EditClips.online got off Cloudflare Pages and Modal

I ran EditClips.online on Cloudflare Pages with Modal for AI compute. The combination was elegant on paper and unsustainable in practice. Here's the migration story — what we replaced it with, what we kept, and the trade-offs along the way.

I run EditClips.online, a free browser-based video editing site. A good chunk of the catalog runs on AI — background removal, license-plate blur, frame interpolation, upscaling, generative inpainting, transcription — and the rest is FFmpeg transcoding.

The short version of this post: I tried running the heavy compute on managed cloud twice, and got surprised by the bill both times. The first surprise was a $174 Cloudflare bill after one month of doing video transcoding on Cloudflare Containers. The second was ~$120/month on Modal for the AI recipes, with credits draining faster than I was comfortable with. Both times the fix was the same: move the work onto hardware I already pay for.

Today the website and transcoding run in Docker on a dedicated server I own through my hosting company, and all 47 AI recipes that used to run on Modal run on an RTX 2080 Ti on a rack in my office. Cloudflare is down to DNS proxy and R2 object storage.

This post is about both migrations — what drove them, what the real numbers were, and when you should not copy this.

Where it started

EditClips began life on Cloudflare’s stack, and for a static-ish site it was a great fit:

Web frontend + API: Astro with React islands on Cloudflare Pages, with Workers handling the API layer.

Video transcoding: Cloudflare Containers. When someone stretched a video, inverted colors, or compressed a clip, the FFmpeg job ran in a Container. This is the part that bit me.

File storage: Cloudflare R2, S3-compatible API. Uploads, processed outputs, thumbnails.

AI compute (added later): Modal. Each recipe was a Modal class with @app.cls(gpu=..., ...); the web layer called Recipe.process.remote(payload) and Modal spun up a container with the right CUDA wheels.

On paper it was clean — edge-cached frontend, serverless transcoding, cold-start-aware GPU, cheap object egress. In practice the metered-compute pieces (Containers, then Modal) were the ones that hurt.

The two metered pieces broke on two different timelines — Cloudflare Containers first, in April, then Modal about seven weeks later.

Round 1: the $174 Cloudflare bill

The first month I ran video transcoding on Cloudflare Containers, the bill came to $174.

That’s not a Cloudflare horror story — Containers are priced fairly for what they are, and the per-job cost is reasonable in isolation. The problem is the workload. EditClips is free and ad-supported, so a tool page gets a lot of one-off visitors who drop a clip, run a single FFmpeg job, and leave. Metered per-second compute against that traffic pattern is brutal: you’re paying for a cold container spin-up on every bounce. $174 in month one, on a site making a fraction of that in ad revenue, was not a curve I wanted to see compound.

I already pay for dedicated servers — I run a small hosting company, FadeHost, so spare capacity on a box is something I have lying around. The fix was obvious: stop renting metered compute and run the transcoding on a server I already own. Over the first few weeks of April 2026 I moved the whole Cloudflare stack off, piece by piece:

  1. Containers → a Bun worker on my own box. FFmpeg jobs now run in a long-lived Node/Bun process on a FadeHost dedicated server instead of a per-job Cloudflare Container. (Remove CF containers, April 2.)
  2. Workers → a self-hosted Bun server. The API layer moved off Workers into the same process. (Migrate from Cloudflare Workers to self-hosted Bun server, April 4.)
  3. CI → a self-hosted build runner, to stop burning GitHub Actions minutes on Docker image builds. (April 10.)
  4. Pages → full SSR. Switched Astro to output: 'server' with the @astrojs/node adapter and dropped prerender = true everywhere; the dynamic [slug].astro route renders all ~120 tools server-side now. (Full SSR migration, April 23.)

The end state: everything runs in Docker behind Caddy (TLS termination + reverse proxy) on a FadeHost server, with SQLite as a local file on a mounted volume. Cloudflare stays in front purely as DNS proxy + WAF, plus R2 for storage.

What moving off Pages bought me, beyond killing the Containers bill:

  • A persistent WebSocket server for live job-progress updates — something Workers can’t host, and which I’d otherwise have needed Durable Objects for.
  • A persistent file system for chunked uploads (8MB chunks reassembled before the R2 upload).
  • A single SQLite file as the source of truth, with daily backups rsynced to another machine.
  • The job dispatcher now lives in the same process as the web server, which made progress reporting and cancellation much simpler.

What I gave up was edge caching by default — but I got most of it back by setting Cache-Control: s-maxage=3600 on tool pages and letting Cloudflare’s proxy cache them for anonymous visitors. Same effect, just configured explicitly instead of for free.

R2 I kept, and have no plans to move. Free egress, solid S3 API, and at my volume it runs about $4/month. It’s the one piece of the original Cloudflare stack that was never the problem.

Round 2: the Modal bill

The AI recipes landed on Modal because the alternative — running heavy GPU models on a Cloudflare Container — was exactly the cost pattern I’d just escaped. Modal is purpose-built for GPU and far better suited to it, and for a while it was the right call.

But the underlying economics are the same as Round 1: metered compute against a free, high-bounce traffic pattern. Modal was costing me roughly $120/month at steady-state. The per-recipe cost is reasonable in isolation — on the order of half a cent per background-removal image, a couple cents per inpaint, a few cents per transcription — but the volume compounds fast when a large share of free users open an AI tool, run one job, and leave. And the bill has a sharp peak-load inflection: a busy hour provisions more containers and bills more seconds than a quiet hour at the same job count.

It was the same lesson twice. I’d already proven to myself, in April, that for this workload owning the hardware beats renting compute. The only question left was whether I had a GPU that could do the job.

What I built is unglamorous on purpose. I bought nothing — the GPU was already on a Windows desktop I was using to test PyTorch jobs. I:

  1. Installed Ubuntu 26.04 LTS on the second SSD in that machine. Dual-boot, so I can still use Windows for testing.
  2. Installed Tailscale and gave the box a stable hostname (ai-box).
  3. Set up a Python venv with uv and installed PyTorch with CUDA 12.6 for the RTX 2080 Ti.
  4. Wrote a single FastAPI server (gpu-worker/server.py) that exposes one /process endpoint and dispatches internally to per-recipe modules.
  5. Wrote a ModelRegistry with a strict single-model invariant: only one model is allowed in VRAM at a time. When a job needs a different model, the registry evicts the current one (del, gc.collect(), torch.cuda.empty_cache()) before loading the next.

The single-model rule is non-obvious but matters enormously on an 11GB GPU. The first version of the registry let multiple small models coexist, and OOM errors started showing up after about 30 jobs as the allocator fragmented and refused to find contiguous memory. Strict-evict trades startup latency (each model takes 3-15 seconds to load cold) for predictable behavior. Worth it.

Then I ported the recipes. Eight days of porting work, split into phases:

  • Phase 1: RIFE (frame interpolation) and BiRefNet (background removal). These were simple in-process PyTorch models, mostly mechanical.
  • Phase 2: Klein 9B (FLUX.2 derivative for inpainting) running through ComfyUI as a managed subprocess. The registry spawns ComfyUI on first use via systemctl start editclips-comfy, holds it warm for subsequent jobs, and evicts it when something else needs the VRAM.
  • Phase 3: Whisper and Parakeet for transcription, in-process via faster-whisper.
  • Phase 4: SeedVR2 video upscaling, also through ComfyUI.

Bonus rounds:

  • SAM 3.1 for segmentation, ported from a Modal sidecar to ComfyUI’s native SAM3_Detect and SAM3_VideoTrack nodes (which got added to ComfyUI in v0.22 — nicely timed).
  • Gemma 4 E4B for the small LLM tasks (auto-classification, prompt enrichment), running via llama.cpp’s llama-server with the Unsloth GGUF build.

When everything was ported and stable, I stopped the Modal apps on May 25 and deleted the Modal code. That single file — modal/app.py — was 11,177 lines, and it all went. I kept scripts/deploy-modal.sh for two more days as a safety blanket, then deleted it too.

The architecture today

                ┌─────────────────┐
   Browser ─→   │ Cloudflare      │ (DNS proxy, WAF, R2 storage)
                │ orange cloud    │
                └────────┬────────┘

                ┌────────▼──────────────────────┐
                │ eu1.fadehost.net              │
                │ ┌───────────────────────────┐ │
                │ │ Caddy (TLS, :443)         │ │
                │ └────┬──────────────────┬───┘ │
                │      │                  │     │
                │ ┌────▼──────────┐ ┌─────▼───┐ │
                │ │ editclips-web │ │ ytdlp   │ │
                │ │ Astro+Node    │ │ sidecar │ │
                │ │ (:4321 + WS)  │ └─────────┘ │
                │ └────┬──────────┘             │
                │      │ SQLite                 │
                │      │ /data/editclips.db     │
                └──────┼─────────────────────────┘

                       │ cloudflared tunnel
                       │ (jobs dispatched here)

                ┌──────────────────────────────┐
                │ ai-box (home office rack)    │
                │ Ubuntu 26.04 LTS             │
                │ RTX 2080 Ti (11GB)           │
                │                              │
                │ ┌────────────────────────┐   │
                │ │ editclips-gpu service  │   │
                │ │ - ModelRegistry        │   │
                │ │ - strict single-model  │   │
                │ │ - 35 recipes           │   │
                │ └──┬─────────────────────┘   │
                │    │ spawn/evict             │
                │    ▼                         │
                │ ┌────────────┐ ┌──────────┐  │
                │ │ ComfyUI    │ │ llama-   │  │
                │ │ (lazy)     │ │ server   │  │
                │ │ port 8001  │ │ port 8005│  │
                │ └────────────┘ └──────────┘  │
                └──────────────────────────────┘

When a user clicks “Remove Background” on the site:

  1. The browser uploads the image in 8MB chunks to editclips-web on eu1.
  2. eu1 stores it in R2, then dispatches a job via the local SQLite queue.
  3. The dispatcher POSTs to https://ai-box-tunnel.editclips.online/process (a Cloudflare tunnel that terminates at ai-box).
  4. The GPU worker on ai-box loads BiRefNet (evicting whatever was in VRAM), runs the model, uploads the result back to R2.
  5. The browser polls via WebSocket and gets the download URL.

A warm simple job (background removal) finishes in a few seconds; a Klein inpaint is tens of seconds of actual GPU work. Cold start adds the model-load time on top — a few seconds for the small models, longer for Klein. The one structural advantage over serverless GPU is that the model files live on a local NVMe instead of being pulled from object storage on every cold container, so the cold path is consistently shorter.

Numbers, before and after

I want to be honest about the accounting here, because there’s an easy version of this story that overstates the savings.

The web tier now runs on a dedicated server I own through my hosting company, FadeHost. My out-of-pocket cost for that box is effectively zero — but that’s not a fair comparison for anyone who’d have to rent one. At market rate, a dedicated server like this is about $60/month, so that’s what I’ll put in the “after” column. The AI runs on an RTX 2080 Ti I already had on the shelf; if you were buying one today it’s about $200 used, which is a one-time investment, not a recurring cost.

CostBeforeAfter (market rate)
Web hosting$0 (Cloudflare Pages free tier)~$60/mo (dedicated server)
AI compute~$120/mo (Modal)$0
GPU hardware~$200 one-time (already owned)
Electricity for ai-box~$8/mo (mostly idle)
R2 storage~$4/mo~$4/mo
Recurring monthly~$124~$72

So at market rate the recurring bill drops from ~$124 to ~$72 — about 42%. The big win is killing Modal entirely; moving the web tier off Cloudflare Pages onto a real server actually added cost on paper, but it bought the WebSocket server and the persistent file system I needed anyway.

Because I happen to own the hosting company, my actual out-of-pocket is closer to ~$12/month (electricity + R2). I’m not counting that as the headline number — it’s a quirk of my situation, not a repeatable result. If you don’t own a rack, the honest figure is the $72.

The $200 GPU is the interesting line. Treat it as capital expenditure: against the ~$120/month Modal bill it replaces, a used 2080 Ti pays for itself in under two months and then runs for years. For a workload that fits in 11GB, buying the card is dramatically cheaper than renting equivalent GPU-seconds over any horizon longer than a quarter.

The honest latency picture is mixed. Warm AI jobs are fast and the cold path is shorter than serverless (local model files), but I gave up autoscaling — there’s one GPU running one job at a time, so a burst of traffic queues instead of fanning out across containers. For my volume that’s fine; a queue that drains in a minute or two is an acceptable trade for killing the bill. If I were latency-critical at peak, it wouldn’t be.

The real thing I gave up is the operational safety net. Modal and Cloudflare handle hardware failure, power, and networking for you. My setup is a single box I’m responsible for — if it goes down, the AI tools go down until I notice. For a free, ad-supported tool I’ve decided that’s an acceptable risk; for anything with a paid SLA it would not be. Know which one you’re running before you copy this.

When not to do this

I want to be honest about the trade-offs because I think people read posts like this and assume they should follow.

Don’t migrate off Cloudflare Pages if your site is a static marketing site with a small CRUD admin layer. Pages does that beautifully and the operational overhead of running your own server is real. If you’re not blocked on something Pages can’t do, stay on Pages.

Don’t kill Modal if you’re an actual GPU-heavy workload running real research training jobs or you’d need to buy hardware to replicate it. A single 2080 Ti can fit a 9B-parameter model with aggressive quantization, but it can’t fit a 70B-parameter model at any quant. If your workload genuinely needs a 4090 or H100, you’re either paying Modal/Runpod for elastic capacity or you’re paying for the hardware. The math gets a lot less friendly.

Don’t run AI from your home internet if uptime matters more than savings. My homelab has fiber and a UPS but a single residential ISP outage takes the AI tools down. For a paid SLA product this is a no-go.

The reason it worked for EditClips is the workload shape: lots of small AI jobs (background remove a photo, inpaint a 1024×1024 region) that are easy to fit in 11GB, run for 5-60 seconds each, and where users tolerate a few seconds of latency. A 2080 Ti at home pinned to one job at a time can serve that pattern indefinitely.

What I’d do differently

Two things.

First, I should have built the ModelRegistry on day one. The first month of ai-box, I had a per-recipe Python process model — each recipe owned its own CUDA context and held its model resident. The CUDA context overhead alone was costing me 2-3GB of “missing” VRAM, plus the GPU was OOM’ing whenever two recipes tried to coexist. Consolidating to one Python process with strict eviction recovered that VRAM and made the failure mode “predictable wait for eviction” instead of “intermittent OOM.”

Second, I should have set up the per-job telemetry — GPU utilization, VRAM usage, model load/evict markers, CPU — on day one as well. I built it later, after a series of mystery slow jobs, and once I had the data the root cause (a model that was thrashing in and out of VRAM because two queued jobs alternated needing it) was obvious in a 60-second dashboard read. I would have caught it in a day instead of a week.

Both of these are “single-machine SaaS thinking” instead of “serverless thinking,” and the gap between them was where my own confusion lived for the first month.

Closing

The most surprising thing about this migration isn’t the cost savings — it’s how much more I understand my own product now. When the GPU is in your office, you watch it work. You see when a recipe is slow and you know exactly which one. You add a feature, you can profile it end-to-end on real hardware. You don’t have to mentally simulate what Modal is doing on the other side of the API.

I’m not advocating for self-hosting as a moral position. Modal is a great product. Cloudflare Pages is a great product. They didn’t fit the shape of what EditClips needed to become, and the migration was worth doing. That’s the whole story.

If you’re in a similar spot — small AI product, growing past the comfortable serverless zone, with hardware sitting around — the answer might be the same. If you’re not, it probably isn’t.

If you want to look at the code, the worker that runs all of this is in gpu-worker on GitHub. I’m reachable at bernis.dev or on the EditClips site at editclips.online/about.