WebRTC in WebKitGTK and WPE, status updates, part I

Some time ago we at Igalia embarked on the journey to ship a GStreamer-powered WebRTC backend. This is a long journey, it is not over, but we made some progress. This post is the first of a series providing some insights of the challenges we are facing and the plans for the next release cycle(s).

Most web-engines nowadays bundle a version of LibWebRTC, it is indeed a pragmatic approach. WebRTC is a huge spec, spanning across many protocols, RFCs and codecs. LibWebRTC is in fact a multimedia framework on its own, and it’s a very big code-base. I still remember the suprised face of Emilio at the 2022 WebEngines conference when I told him we had unusual plans regarding WebRTC support in GStreamer WebKit ports. There are several reasons for this plan, explained in the WPE FAQ. We worked on a LibWebRTC backend for the WebKit GStreamer ports, my colleague Thibault Saunier blogged about it but unfortunately this backend has remained disabled by default and not shipped in the tarballs, for the reasons explained in the WPE FAQ.

The GStreamer project nowadays provides a library and a plugin allowing applications to interact with third-party WebRTC actors. This is in my opinion a paradigm shift, because it enables new ways of interoperability between the so-called Web and traditional native applications. Since the GstWebRTC announcement back in late 2017 I’ve been experimenting with the idea of shipping an alternative to LibWebRTC in WebKitGTK and WPE. The initial GstWebRTC WebKit backend was merged upstream on March 18, 2022.

As you might already know, before any audio/video call, your browser might ask permission to access your webcam/microphone and even during the call you can now share your screen. From the WebKitGTK/WPE perspective the procedure is the same of course. Let’s dive in.

WebCam/Microphone capture

Back in 2018 for the LibWebRTC backend, Thibault added support for GStreamer-powered media capture to WebKit, meaning that capture devices such as microphones and webcams would be accessible from WebKit applications using the getUserMedia spec. Under the hood, a GStreamer source element is created, using the GstDevice API. This implementation is now re-used for the GstWebRTC backend, it works fine, still has room for improvements but that’s a topic for a follow-up post.

A MediaStream can be rendered in a <audio> or <video> element, through a custom GStreamer source element that we also provide in WebKit, this is all internally wired up so that the following JS code will trigger the WebView in natively capturing and rendering a WebCam device using GStreamer:

<html>
  <head>
    <script>
    navigator.mediaDevices.getUserMedia({video: true, audio: false }).then((mediaStream) => {
        const video = document.querySelector('video');
        video.srcObject = mediaStream;
        video.onloadedmetadata = () => {
            video.play();
        };
    });
    </script>
  </head>
  <body>
    <video/>
  </body>
</html>

When this WebPage is rendered and after the user has granted access to capture devices, the GStreamer backends will create not one, but two pipelines.

flowchart LR
    pipewiresrc-->videoscale-->videoconvert-->videorate-->valve-->appsink

flowchart LR
    subgraph mediastreamsrc
    appsrc-->srcghost[src]
    end
    subgraph playbin3
       subgraph decodebin3
       end
       subgraph webkitglsink
       end
       decodebin3-->webkitglsink
    end
    srcghost-->decodebin3

The first pipeline routes video frames from the capture device using pipewiresrc to an appsink. From the appsink our capturer leveraging the Observer design pattern notifies its observers. In this case there is only one observer which is a GStreamer source element internal to WebKit called mediastreamsrc. The playback pipeline shown above is heavily simplified, in reality more elements are involved, but what matters most is that thanks to the flexibility of Gstreamer, we can leverage the existing MediaPlayer backend that we at Igalia have been maintaining for more than 10 years, to render MediaStreams. All we needed was a custom source element, the rest of our MediaPlayer didn’t need much changes to support this use-case.

One notable change we did since the initial implementation though is that for us a MediaStream can be either raw, encoded or even encapsulated in a RTP payload. So depending on which component is going to render the MediaStream, we have enough flexibility to allow zero-copy, in most scenarios. In the example above, typically the stream will be raw from source to renderer. However, some webcams can provide encoded streams. WPE and WebKitGTK will be able to internally leverage these and in some cases allow for direct streaming from hardware device to outgoing PeerConnection without third-party encoding.

Desktop capture

There is another JS API, allowing to capture from your screen or a window, called getDisplayMedia, and yes, we also support it! Thanks to these recent years ground-breaking progress of the Linux Desktop such as PipeWire and xdg-desktop-portal we can now stream your favorite desktop environment over WebRTC. Under the hood when the WebView is granted access to the desktop capture through the portal, our backend creates a pipewiresrc GStreamer element, configured to source from the file descriptor provided by the portal, and we have a healthy raw video stream.

Here’s a demo:

WebAudio capture

What more, yes you can also create a MediaStream from a WebAudio node. On the backend side, the GStreamerMediaStreamAudioSource fills GstBuffers from the audio bus channels and notifies third-parties internally observing the MediaStream, such as outgoing media sources, or simply an <audio> element that was configured to source from the given MediaStream. I have no demo for this, you’ll have to take my word.

Canvas capture

But wait there is more. Did I hear canvas? Yes we can feed your favorite <canvas> to a MediaStream. The JS API is called captureStream, its code is actually cross-platform but defers to the HTMLCanvasElement::toVideoFrame() method which has a GStreamer implementation. The code is not the most optimal yet though due to shortcomings of our current graphics pipeline implementation. Here is a demo of Canvas to WebRTC running in the WebKitGTK MiniBrowser:

Wrap-up

So we’ve got MediaStream support covered. This is only one part of the puzzle though. We are facing challenges now on the PeerConnection implementation. MediaStreams are cool but it’s even better when you can share them with your friends on the fancy A/V conferencing websites, but we’re not entirely ready for this yet in WebKitGTK and WPE. For this reason, WebRTC is not yet enabled by default in the upcoming WebKitGTK and WPE 2.40 releases. We’re just not there yet. In the next part of these series I’ll tackle the PeerConnection backend on which we’re working hard on these days, both in WebKit and in GStreamer.

Happy hacking and as always, all my gratitude goes to my fellow Igalia comrades for allowing me to keep working on these domains and to Metrological for funding some of this work. Is your organization or company interested in leveraging modern WebRTC APIs from WebKitGTK and/or WPE? If so please get in touch with us in order to help us speed-up the implementation work.