We Automated YouTube Shorts in a Weekend. Here's the Full System.

Consistent short-form video is one of the best trust signals a business can build right now. The problem is the production cost. Script. Find footage. Edit. Mix audio. Export. Upload. Write the caption. Repeat — forever, every week, if you want the algorithm to care.

There's a better model. We built it, it runs itself, and this post covers exactly how to replicate it. The core of the system is Blotato — a video rendering and social publishing platform with an API that lets you automate the entire production pipeline. Pair it with Claude Code for orchestration and the Pexels API for free stock footage, and you can go from idea to published YouTube Short without touching a timeline editor.

We've published 24 episodes this way. Here's the full system.

The Format: POV Lifestyle Shorts

Before the tech, the creative brief. The format that works for us is POV-style aspirational content — no pain states, no before/after contrast. Just the dream moment, already happening, with the business outcome landing quietly in the voiceover.

Each Short follows a 4-beat arc:

You're in the dream moment — beach, surf, gym, golf course, wherever
Meanwhile, your business is running — framed in ambient lifestyle footage
The specific outcome lands — a booking, a payment, a lead response
Close with: "Comment automate below if you want your time back."

The voiceover carries 100% of the business narrative. The visuals just need to feel like the life. You can adapt this arc to any niche — the format is the frame, not the content.

How to Build the Pipeline: 7 Steps from Script to YouTube

Here's exactly how to set this up. The orchestration layer is a Claude Code skill, but the logic maps directly to any automation tool.

Step 1 — Manage your episode backlog

Keep your episode hooks and scene scripts in a JSON file. Each entry has a status (ready, published, scheduled), a hook line, a dream scene, and a scenes array. When you're ready to produce, load the backlog, pick an episode, and generate the scene scripts if they're not already written. Scripts should be reviewed before any footage is searched — the voiceover has to land on paper before it lands on video.

Step 2 — Search Pexels for footage

The Pexels API is free and returns portrait-oriented stock clips. Search for each scene using lifestyle-specific terms ("beach golden hour slow motion", "gym morning light", "golf course aerial sunset") and cap results at 1080p — more on why below. For single-scene episodes where one long clip covers the entire voiceover, filter by minimum duration so the clip doesn't end before the script does:

# Duration estimate for single-scene episodes
words = len(script.split())
min_seconds = (words / 180 * 60) + 5  # TTS runs ~180 wpm; +5s buffer

Step 3 — Mute the source video

Before anything goes to Blotato, strip the audio from the stock clip. If you skip this, Blotato keeps the ambient audio underneath the voiceover — wind, waves, crowd noise. The fix is one ffmpeg command piped directly from curl so the source file never saves to disk:

curl -s "<pexels_url>" | ffmpeg -y -i pipe:0 -an -c:v copy /tmp/pexels_muted.mp4

Step 4 — Render via Blotato

Upload the muted clip to Blotato via a presigned URL, then kick off the render. Blotato handles everything from here — AI voiceover (multiple voice options including Australian accents), captions with custom highlight colours, transitions, and timing. The key config settings to lock in:

aspectRatio: "9:16" — vertical for Shorts/Reels
trimToVoiceover: true — clips trim to the exact voiceover length
animateAiImages: false — if using real footage, this must be off (more below)
captionPosition: "bottom" and highlightColor: "#10b981" — brand colours

Wait 3 minutes before your first status poll, then every 60 seconds after that. Typical render time with real footage is 3–5 minutes.

Step 5 — Mix background music via ffmpeg

Blotato's template doesn't support background music natively, so you add it post-render. Download the rendered video and mix in a background track with ffmpeg. Music at 0.7 volume, starting 2 seconds into the track. Use normalize=0 on the amix filter so the voiceover stays at full volume:

ffmpeg -y \
  -i /tmp/ep_raw.mp4 \
  -ss 2 -i background.mp3 \
  -filter_complex "[1:a]volume=0.7[music];[0:a][music]amix=inputs=2:duration=first:normalize=0[aout]" \
  -map 0:v -map "[aout]" \
  -c:v copy -c:a aac -shortest \
  /tmp/ep_mixed.mp4

Step 6 — Publish to YouTube (and TikTok)

Upload the mixed video back to Blotato via another presigned URL, then use blotato_create_post to fire it to YouTube with the title, caption, hashtags, and music attribution. Blotato also supports TikTok, Instagram Reels, and LinkedIn video from the same API call — same asset, multiple platforms, no extra work.

Step 7 — Log the episode

Update your episode record with the published status, Blotato visual ID, and the YouTube URL. This gives you a history to pull from — you can see what's been published, what's scheduled, and which automation types you've already covered.

The whole thing — footage search, muting, render, music mix, publish — runs from a single prompt. No timeline editor. No manual uploads. No caption writing. Blotato is what makes the render and publish step possible without building your own video infrastructure.

Five Things That'll Catch You Out

1. UHD clips break Blotato renders

The Pexels API returns whatever resolution it has — and for many clips, that includes 4K UHD variants at 2160p or higher. Feeding those to Blotato causes the render to fail silently. Filter your footage search to cap results at 1080p wide and pick the best portrait clip under that ceiling.

If your renders are randomly failing with no clear error, check the source resolution first. This was the cause every single time.

2. Muting the source isn't optional

The audio layer in Blotato's template sits on top of the source — it doesn't replace it. If the stock clip has audio, it comes through underneath the voiceover. Muting before upload takes 3 extra seconds and saves the render.

3. Single-scene clips must cover the full voiceover

In multi-scene episodes, trimToVoiceover: true handles timing automatically — each clip trims to its scene's voiceover length. In single-scene mode, the clip must be long enough to cover the entire TTS output. If it's shorter, the video ends before the script finishes. Always estimate the voiceover duration before searching and filter clips by minimum duration.

4. `animateAiImages` must be false for real footage

Leaving animateAiImages: true when you're using real Pexels clips causes script generation to stall indefinitely — no render, no error, just a job that never moves. Set it to false whenever you're using real footage instead of AI-generated images.

5. Use Python requests for large uploads on Mac

Mac ships with LibreSSL instead of OpenSSL. For large video files, curl PUT to presigned S3 URLs fails partway through with a TLS error that's hard to trace. Switch to Python's requests library — this is a Mac-specific issue and Linux doesn't have it.

python3 -c "
import requests
with open('/tmp/pexels_muted.mp4', 'rb') as f:
    r = requests.put('<presignedUrl>', data=f, headers={'Content-Type': 'video/mp4'})
print(r.status_code)
"

What We'd Do Differently

Schedule from day one. We published the first 10 episodes the same day. That's 10 videos in one drop with no distribution spread. Blotato has a scheduling API — stagger your releases 24–48 hours apart and the same batch of content runs for two weeks instead of one day.

Cross-post from the start. We added TikTok on episode 11. The first 10 only went to YouTube. Blotato supports YouTube, TikTok, Instagram, and LinkedIn from the same API call — there's no reason to leave reach on the table from day one.

Track engagement per automation type. Each episode covers a different outcome: booking, quoting, invoicing, lead response, follow-up. Rotating types keeps the content varied, but you should measure which ones drive the most engagement and let that data feed back into your backlog prioritisation.

Tools in this build

Blotato Video rendering, AI voiceover, captions, and YouTube/TikTok publishing — all via MCP. The layer that turns a script and a video clip into a published Short.

Pexels Free stock footage via API. Portrait-oriented lifestyle clips. No affiliate program — just genuinely the best free source for cinematic stock video.

Claude Code The orchestration layer. The pov-shorts skill lives here — episode management, footage search, render coordination, and publishing logic, all in one place.

If you're spending time on content that could be running itself — or you want to build a system like this for your business — that's exactly what we do.

Get in touch →