Cognitive theory — viewers process image and sound on separate neural channels simultaneously. Explains why strong visuals can override weak dialogue.
Simultaneously processed sensory impressions — that is the core principle explaining why a three-second image conveys more than a minute of dialogue. The viewer does not absorb visuals and audio sequentially, but simultaneously through separate neurological channels. Those who understand this on set save themselves post-production editing problems and shoot more efficiently.
In practice, this means: Strong visual composition — lighting, depth of field, color dramaturgy — can compensate for weak text. You know this from experience: an actor is in the right light, in the right position in the frame, and suddenly the scene works, even though the dialogue is interchangeable. The eye works hard; the brain is busy processing the spatial context, body language, and visual tension. The soundtrack can then be minimalist — or even work against the image without the viewer perceiving it as disruptive. Think of thriller scenes: the sound is often reduced; the visuals carry the entire emotional load.
Conversely, this also works: strong original sound — voice-over, music, atmosphere — can carry a weak or even static image. Anyone who has shot a scene with only weak lighting and great dialogue notices that viewers accept it. Attention is distributed. This doesn't mean you shoot carelessly — it means you strategically decide where the visual or acoustic information should be emphasized.
This is relevant for the editing interface: a long take can be sustained if the sound is interesting. Conversely, an edit can be jumpy if the image is dramatic enough. Many young editors don't understand this and think they have to cut every few seconds. Those who have the Dual-Capacity Model in mind edit more consciously — not by feel, but by cognitive load.