Um ... from the paper ... "6.1 Limitations of Salsify No audio. Salsify does not...

keithwinstein · on May 1, 2018

Co-author here. Totally reasonable reaction, and we've heard this when the paper was posted elsewhere (e.g. on Reddit), but have not heard it from specialists, and honestly we suspect it's probably a red herring. Salsify's gains on the "delay" metric are mostly coming from two things: (1) the way that it restrains its compressed video to avoid building up in-network queues (which audio must also transit) and provoking packet loss, and (2) the way that it recovers more quickly from network glitches (check out the video).

If you wanted to add audio to Salsify, you would want to control a receiver-side video and audio buffer to reduce audio gaps and keep a/v in sync during periods of happy network, but this is unlikely to affect the system's ability to recover more quickly from glitches or to avoid building up in-network queues that delay audio and video alike. If you watch the video (or see Figure 6(f), Figure 7, and Figure 8), I don't think there's much reason to think audio can justify what the Chrome/webrtc.org codebase is doing -- WebRTC's frame delays are distributed over a broad range (so it's not like they're synchronized to some fixed timebase either) and are very high, especially in the seconds after a network glitch.

More to the point for our academic work, it would have been trivial to add shitty audio that made no difference to the metrics. The hard-but-necessary part is in designing an evaluation metric to assess (1) the qualify of the reconstructed audio (including how many gaps/rebuffering delays were there when the de-jitter buffer went dry), (2) the delay of the reconstructed audio, keeping in mind this is not constant over time, (3) the quality of the audio/video synchronization, which also will not be constant over time. Then measuring that in a fair way across Skype/Facetime/Hangouts/WebRTC/Salsify, and then trying to decide which compromise on those three axes is desirable. Somebody should do all that work at some point, but it's a major piece of work to bite off and pretty far from anything we've done so far.

tatersolid · on May 2, 2018

Opus, with its low delay and solid rate controls would seem to be the natural pair here. But I agree audio is likely not the real problem in this space.

Any reason you didn’t choose to start from VP9? Is the encoder still too slow overall?