How client-side rendering works

warning

Very experimental feature - expect bugs and breaking changes at any time.
Track progress on GitHub and discuss in the #web-renderer channel on Discord.

The biggest challenge of client-side rendering is that it is not possible to capture the browser viewport.
Only certain HTML elements such as <canvas>, <img>, <video> or <svg> can be captured natively.

Unlike in server-side rendering, where a pixel-perfect screenshot is made, in client-side rendering Remotion places all elements on a canvas based on how it believes they are positioned and appearing in the DOM.

For this, Remotion has developed a sophisticated algorithm for calculating the placement of the elements on the canvas.
Of course, we cannot support all web features, so only a specific subset of elements and styles are supported.

Rendering process

Initialization

First, the component is mounted in the DOM in a place where it is not visible to the user.
Simultaneously, an empty canvas is initialized.

Frame capture process

For each frame that needs to be rendered, the renderer uses element.createTreeWalker() to find all elements and text nodes in the DOM. Nodes that have display: none and their children are skipped.

For each capturable element, the renderer:

Goes up the DOM tree and resets all transform CSS properties to none.
Gets the bounding box using .getBoundingClientRect(), as well as the bounding boxes of the parent elements.
Adds up the transforms and positions to determine the original placement of the element in the DOM.
Gets the pixels of the element - for <svg>, <canvas>, <img> elements, those can be captured. For text nodes, the layout is reconstructed manually.
Draws them to the canvas according to the calculated placement.

Audio capture

Audio from mounted <Audio> and <Video> elements is captured and mixed together, and added to the audio track of the video.

Encoding

Mediabunny is used to encode the frames and processed audio into a video file.

Capturing pixels

For <svg>, <canvas>, <img> elements, the pixels can be captured natively using the widely documented techniques.

For rendering other types of elements, only a subset of properties are supported such as background, border and border-radius. These styles are drawn to the canvas manually with the Canvas 2D API.

Capturing text nodes

For text nodes, more layout calculations need to be made.

Normally, it is not possible to get the bounding box of a text node, but by wrapping a text node in a <span> element, we can call .getBoundingClientRect() on the span to get the bounding box and resolve the transforms as described above.

Then Intl.Segmenter is used to split the text into words, and each token is again wrapped in a <span>. For each token, .getBoundingClientRect() is called and the tokens are drawn to the canvas.

In the end, the DOM is reset to its original state.

Context isolation

Renders happen in the same browser environment as your app. This means CSS and Tailwind variables will automatically work, but you run the risk of conflicts with the host page.

See Limitations for more details to ensure your code works with client-side rendering.

Contributing

If you are interested in improving the web renderer, for example by adding new styles, see Contributing to client-side rendering.

Rendering process​

Initialization​

Frame capture process​

Audio capture​

Encoding​

Capturing pixels​

Capturing text nodes​

Context isolation​

Contributing​

See also​