AI video generation · April 29, 2026 · 6 min read

Sora: how to start making AI video

Access, prompt anatomy, clip length, remix and common failure modes. A practical start with Sora.

Sora turns a text description into video. Sounds simple, but week one is a stream of clips where hands have six fingers, a car phases through a wall, and the text on a shirt morphs mid-shot. That is not the tool’s fault — it is the lack of method. This guide gives you one: how to get access, how to build a prompt, where Sora beats Runway and Kling and where it loses, and how to iterate so that out of ten clips, two are actually good.

Access and plans

Sora runs through an OpenAI account, usually bundled with a ChatGPT subscription. In practice you will meet two tiers: a cheaper one (Plus-style) with a few dozen generations a month, shorter clips and lower resolution, and a pricier one (Pro-style) with a higher cap, longer clips, higher resolution and fewer watermarks in some regions. Treat these numbers as estimates — OpenAI changes limits and pricing more than once a quarter, so verify the current state in your own dashboard before paying.

Check availability in your country — the rollout is regional and occasionally paused.
Start on the cheaper plan for a week of testing before paying for the higher one.
Generation limits count every attempt, including failed ones — plan prompts, do not fire blind.
Export from the longer plan usually gives cleaner footage for downstream editing.

Anatomy of a prompt

A good video prompt is not one sentence but five layers assembled into a coherent description. Each layer drives a different part of the frame, and any layer you skip gets guessed by the model — usually badly.

Subject — who or what is in frame, with specifics (age, clothing, material, colour).
Action — what it does, one motion per clip; do not pack three actions into five seconds.
Camera — shot type and movement: static wide shot, slow dolly in, handheld tracking, top-down.
Lighting — time of day and character: soft morning light, harsh noon sun, neon night.
Style — aesthetic and medium: 35mm film, anime cel, documentary, claymation.

Order matters — lead with subject and action, because they define what the model animates. Style and lighting at the end act like a filter laid over the scene. Avoid negations (“no crowd in the background”) — video models parse them poorly and often give you exactly what you forbade. Instead, describe positively what should be there: “an empty street at dawn”.

Clip length and aspect ratios

Sora generates short clips — roughly a few to a dozen-plus seconds per generation, depending on plan and resolution. It is not a tool for shooting one three-minute take; it is a generator of building blocks you later assemble in an editor.

16:9 — YouTube, presentations, “cinematic” landscape shots.
9:16 — Reels, TikTok, Shorts; compose the action in the centre and upper part of the frame.
1:1 — feed, thumbnails, when you do not yet know the target format.

The shorter the clip, the lower the chance of morphing and physics drift, because the model has fewer frames to get wrong. If you need a longer scene, generate several short shots of the same situation and cut them together — it comes out steadier than one long generation.

Storyboard and remix

Two features that separate shooting “blind” from directing. Storyboardlets you lay a clip out as points in time and assign each its own description — the subject walks, then stops, then turns. That gives control over the sequence of events a single prompt cannot.

Remix takes an existing clip and changes one parameter while keeping the rest — the same scene but in rain; the same character but in different clothing. It is the cheapest way to iterate, because you do not start from zero. A workflow that works: one strong generation as the base, then three–four remixes changing one thing at a time. Changing many things at once means you never learn what improved the result.

Common failure modes and how to mitigate

Video models have a repeatable set of artifacts. Knowing them sidesteps most of the frustration.

Morphing — objects smoothly change shape. Shorten the clip, simplify the scene, reduce the number of moving elements.
Physics glitches — objects phasing through each other, odd gravity. Pick simple, natural motions; avoid complex collisions and crowds.
Text artifacts — captions and logos “dance” and turn illegible. Do not count on correct text in the generation; add captions in the edit.
Inconsistent hands and faces — especially in motion. Keep the subject in a medium shot and avoid fast hand gestures in the foreground.
Style drift — the aesthetic shifts mid-clip. Shorter clip and one dominant style description instead of three contradictory ones.

The general rule: every artifact worsens with clip length and scene complexity. The cheapest mitigation is less — fewer seconds, less motion, fewer elements.

Iteration workflow

Do not treat a generation like a lottery. Treat it like an experiment with one variable. A simple loop works well:

Write the prompt in five layers and generate a base version.
Judge the single weakest thing (motion? lighting? subject?).
Change only that one element — via prompt or remix.
Compare against the base; keep the better, discard the worse.
Repeat until you have an edit-ready candidate, not a “good enough”.

Save the prompts that worked — that is your private library, worth more than any guide. After two weeks you will have a set of proven shot templates you only need to tweak in one place.

Example prompt structures

Four skeletons you can fill with your own details. These are not paste-ready scripts — they are patterns of order and level of specificity.

Product on a table: subject (a ceramic mug with steaming coffee), action (steam rising slowly), camera (slow dolly in), lighting (soft morning light from a window), style (35mm, shallow depth of field).
Character in motion: subject (a runner in a red jacket), action (running along the waterfront), camera (handheld tracking from the side), lighting (golden hour), style (documentary, light grain).
Mood of a place: subject (an empty old-town alley), action (light drizzle, reflections on the cobblestones), camera (static wide shot), lighting (neon at night), style (cinematic, high contrast).
Abstract shot: subject (a drop of ink in water), action (dissolving into tendrils), camera (macro, top-down), lighting (even, studio), style (slow motion, clean background).

Sora vs Runway vs Kling

None of these tools wins at everything. You pick per shot, not once and for all.

Sora wins on scene coherence, natural motion and understanding complex, narrative descriptions — strong on “cinematic” mood shots.
Runway wins on precise control (motion brush, masks, editing existing video) and a fast, iterative workflow for editors.
Kling wins often on longer, smooth character motion and tends to be more generous with free generations — good for volume testing.

In practice many creators generate variants of the same scene in two tools and pick the better shot. That is not waste — it is cheaper than ten fixes in a single tool that happens to struggle with that type of motion.

TL;DR

Start on the cheaper plan with a week of testing. Build the prompt in five layers: subject, action, camera, lighting, style — positively, no negations. Generate short clips and assemble them in the edit instead of fighting for one long take. Iterate one variable at a time, use remix, save the prompts that work. Add text and logos outside the generation. Reach for Sora on coherent, atmospheric shots; Runway for control and editing; Kling for long motion and cheap tests.