A first look at AWS Media Connect and RIST

Previously on random mutterings… as well as streaming with a deliberate 20% loss to an OBD, I’m using RIST to stream to AWS Media connect, same loss profile, same 1 second buffer.

OBD can cope with 10 million packets and no loss. Can AWS Media connect?

AWS statistics are lacking – they are a graph, hiding numbers away, and they only update every minute! I prefer working at a millisecond resolution in streams of text, but we should be able to salvage something from this.

There’s something seriously wrong here, hundreds, if not thousands, of packets being dropped every minute. That’s 10-20 a second.

Am I really dropping packets that much over the internet to AWS? Time to turn off the failures. Lets just leave the ISP circuit to AWS at the only output, and monitor RTP stream leaving the encoder

OK, there’s a few retransmits then, with packet 13959 being resent between 13964 and 65, then 15941 slotting in between 15948 and 49. AWS is far happier with their graph.

So time to introduce packet loss. 20% as before. There is no reason for AWS to perform worse than the OBD — it’s only a few milliseconds away rather than 83, retransmits tend to be 3 milliseconds after the initial packet, and the natural circuit loss is minimal. Looking at plain simply TCP Dump you can see packet 32565 was sent at 344.5ms, never made it to AWS, and AWS asked for a retransmit, with the request arriving at 347.6ms. The very next packet was a resend of 32565. In theory AWS should be able to cope with 250 resends, meaning that with 20% random packet loss you would never have loss in the entire lifespan of the universe.

So throwing back 20% of packet loss and we see the retransmits. Remember we’re looking at a capture post-loss here.

13202 and 13204 is sent, a retransmit is requested (arriving 18:58:36.375312) for 13203, which is successfuly sent between packets 13209 and 13210. 05 and 06 are missing too, they are requested at 18:58:36.376396 and delivered. Clearly AWS is sending requests for retransmits, and OBE is sending those retransmits.

The tricky bit now is to find a request for a retransmit that isn’t send (in theory 1 in 5 retransmits will fail to be sent due to the 20% packet loss). I’ll capture about 20 seconds of this, then prune the log files to capture RTP packet numbers 10,000 through 50,000.

Here we see we dropped packet 10145 about .2936, retransmit request at .2962 – 2.6ms later, then dispatched within 40us, job done.

The retransmit request is 2.6ms after the expected time. Lets look at packet 10148. That should have been sent at .2951, so I’d expect a retransmit about .2977. A retransmit request came in at .2983, so 600us later, seems reasonable what with the internet, router queues, etc. I suspect that packet 19:02:26.298336 was a request for retransmit of packet 10148. That request was ignored (or rather the packet was dropped by the kernel before our monitoring point). Another request arrived a few us later for a retransmit of packet 10150, and that was dealt with

But what about packet 10148? The next retransmit request (after 10150 had been dealt with) was also for 10148. After all at this stage we’d sent 10149, 50 (on a retransmit), 51, 52, 53, 54, 55, 56, 57 59, 60, 63, 64 and 65. We were missing 10148, 58, 61, and 62. The retransmit for packet 10158 would be due to arrive 2.6ms after .3001 – about .3027. 1ms later than that time at .3037, we had a request for a retransmit, then sent packet 10148 and 10158.

So it looks like AWS is asking for retransmits of missing retransmits, that’s good. But that doesn’t explain this:

Why aren’t these packets being recovered!?

Taking the c. 40,000 RTP packets that should have been sent between 19:02.26 and 19:02:46 we can see how many were actually sent. grep c33 /tmp/rtp.log|sed -e ‘s/.*c33 *//’ -e ‘s/ .*//’ does the job for that, producing a list of the sequence numbers 40,023 RTP packets transmitted. Piping into sort -n|uniq|wc -l produces a list of 39,998 unique packet numbers that were sent, including packet 9997 (a retransmit), 50,000, and almost every packets 10000 through 50000. The missing packets were all at the end, 49986, 49989, 49995 and 49996 (all of which would have been retransmitted shortly after).

While we’re at it we can look at retransmits caused by loss on the internet, not the artificial loss being introduced by tc. grep c33 /tmp/rtp.log|sed -e ‘s/.*c33 *//’ -e ‘s/ .*//’|sort -n|uniq -c|sort -n|grep -v ” 1 “ does that for us, showing up 25 packets in the 10,000 that were transmitted by us twice. That’s a reasonable number, 0.25% packet loss on this hop.

Linux plumbing, it involves pipes

So we know every RTP packet in this 20 second period was sent to AWS, we know AWS was asking for retransmits of packets that failed twice (and perhaps more), we know that any downstream losses were being retransmitted. We know AWS is happy if we turn off the forced 20% loss. Why does AWS have such bouncy “Source not recovered packets” graphs? It’s not constant (despite pretty much constant 20% loss) – the numbers reported at 18:02 GMT was 153, and at 18:03 GMT was 30, but at 18:10 it was 1505, and at 18:30 it was zero!

A wider view

Now we only had a 20 second capture in a “low loss” time. I wish AWS had a half-decent resolution on its logs, as 1 minute is awful. 10 seconds maybe, but ideally I’d be after 1 second.

sudo tcpdump -i any -nn host -T rtp | head -50000 > /tmp/rtp.log; head -1 /tmp/rtp.log; grep c33 /tmp/rtp.log|sed -e ‘s/.*c33 *//’ -e ‘s/ .*//’|sort -n|uniq |sed -e ‘s/…$/xxx/’|uniq -c; tail -1 /tmp/rtp.log;

Generates an interesting output. You should see 1000 for most sequence number blocks above 1000. One block at the start will be <1000 as it starts partway through, another at the end will be <1000 for the same reason, stuff under 1000 will be different, and 65xxx will only go upto 536 as RTP sequence numbers top out at 65535.

But you should see a solid list of about 40 blocks of 1,000. In the view below the packets started just after 30,200, ran to 65535, then wrapped round to about 5200. But what on earth happened between 42,000 and 44,000? There should have been 2,000 RTP packets sent, even if it took 5 or 6 attempts – there was plenty of time for retransmits to be in the capture

Zooming in on 42,xxx (grep c33 /tmp/rtp.log|sed -e ‘s/.*c33 *//’ -e ‘s/ .*//’|sort -n|uniq |sed -e ‘s/.$/x/’|grep ^42|uniq -c) shows up some major flakiness. What happened?

All went Pete Tong about 42260

The problem started at 4226x, where 2 of the 10 packets were never sent even after seconds worth of retransmits, but at say 4269x only 5 packets got through. Looking at the capture, we can see a fairly standard packets going through, 0,1,3,5,6,7,8,9. This is post 20% loss remember, so nothing unusual in losing 2 of the 10 on the first go. What we should be seeing though are retransmit requests, and we do. What we don’t see is the packets actually retransmitted

Rememebr that OBE into OBD runs absolutely fine with no losses and plenty of retransmits. The order is 42260, 42254(retransmit), 42261, 42256(r), 42263,65. But look at the source port. Retransmit requests come in from source port 2001 to the ephemeral 33040.

The packets coming back from AWS in this instance are from port 2000 (the RIST port, not the RIST control port). No wonder OBE is ignoring them.

From 20:19:45 when this capture starts, until 20:19:51, all traffic from AWS to the encoder is from port 2001 to port 33020. That’s good. The final message just after we send packet 42261, and it’s a retransmit of 42256. It’s then radio silence on that port until 20:19:53, when the requests for retransmits come back in and start being handled

So why does AWS suddenly switch to sending packets for retransmits from the wrong port? Is it the wrong port? What does the RIST spec say?

Well the spec is at and is an easy read.

1. RIST senders shall transmit the RTP media packets to the configured IP address of the RIST receiver and a user-selected UDP destination port P, where P is an even number between 2 and 65534. RIST receivers shall listen on UDP port P for media. This transmission is unidirectional, from sender to receiver. The sender may choose any arbitrary source port M for the RTP flow.

2. RIST senders shall periodically transmit the compound RTCP packets specified in section 5.2.1to the configured IP address of the RIST receiver and UDP port P+1. The sender may choose any arbitrary source port R for the RTCP packets. RIST senders shall listen on port R for RTCP packets from the RIST receiver.

3.RIST receivers shall listen on UDP port P+1 for RTCP packets from the sender. These packets will have a source IP address S and a source port R’ (in the absence of NAT devices between the sender and receiver, R and R’ will be the same). RIST receivers shall send the RTCP packets they generate to IP address S and destination UDP port R’, with a source UDP port of P+1. A receiver shall use S and R’ from the last valid RTCP packet it has received from the sender.

4.RIST senders may offer the user the ability to manually configure source ports M and R.

Section 4’s fairly meaningless with many source-NAT devices scramble the source port.

That suggests to me that the receiver should not be responding with any packets to the sender on the even number port.

Update Aug 6th

AWS have confirmed there’s a bug there, so at least I’m not going mad.


A quick look at RIST retransmits

In an effort to understand RIST retransmits a little better, I setup a single RIST stream of 20mbit (2,000 packets per second), with a 1 second RIST buffer – about 10 times round trip time, from an OBE into and OBD, and looked at the RTP traffic on the decoder. The ping time from the encoder and decoder is about 83ms.

I then applied a massive 20% packet loss on the encoder with sudo tc qdisc del dev eno1 root; sudo tc qdisc add dev eno1 root netem loss 20%

On the decoder, run up “tcpdump -nn -T rtp” and look at the sequence numbers coming in. Take 65,000 lines (about 30 seconds) of output.

These are the packets coming in. the majority are in the right order, packet 0, 2, 4, 6, 7, 8, 9, 10, 12, 13, 14, 15, 17 etc. But some are missing. Manually counting the first 50 packets shows 9 missing, which is about expected from 20% random loss.

The decoder asks for these packets again, and here’s some of the retransmits at 183-76 ms later — 108 ms.
That suggests a timeline along this:

  • 14:31:11.75ms: expected arrival of packet 1,2
  • 76ms: expected arrival of packet 3,4
  • 92ms: Packet 1 is MIA, ask for retransmit
  • 100ms: Packet 3,5,11,16 is MIA, ask for a retransmit. Expected arrival of packet 49 and 50 at this stage.
  • 134ms: Retransmit request arrives for packet 1 at encoder, packet sent again
  • 142ms: Retransmit request arrives for packet 3,4,11,16 at encoder, packet sent again
  • 175ms: Packet 1 arrives on decoder
  • 183ms: Packet 3,4,11,16 arrives
  • 191ms: Packet 31 resend arrives
  • 199ms: Packet 49 resend arrives
  • 297ms: Packet 33 resend arrives
  • 413ms: packet 41 resent arrives

So it takes 25ms to realise the packet is missing, and ask for the retransmit. Now this might only be 20ms, and there’s a 5ms delay at the far end before sending.

So do we get all our retransmits? In this block we’re missing packets 1,3,5,11,16,31,33,41,49.

Packed 1 arrived at 175ms. Packet 3 arrives at 183ms, followed by 5, 11 and 16 all in the same millisecond. Packet 31 arrives at 191ms, Packet 49 at 199ms. But where are packet 33 and 41?

Well I’d expect of those 9 retranmits, about 20% would be lost, that’s packet 33 and 41 this time.

Packet 49 turned up at 199ms, on it’s first retransmit, with the retransmission sent at 116ms, 16ms after its expected arrival.

Packet 33 turned up 297ms later. I would guess that the first retransmit failed.

Packet 33 story:

  • Expected 91.5ms, between packet 32 and 34, 1ms after packet 31
  • Both packets were re-requested about 110ms, with the request arriving on the encoder about 152ms.
  • Encoder sent both packets out, but packet 33 was dropped. Packet 31 arrived on the first resend attempt at 191.7ms – 100 ms after it should have arrived (with an 83ms rtt and 15-20ms of overhead/buffer)
  • Packet 33 arrived at t=297ms, another 105ms after packet 31. That must have been a second retransmit attempt.

As such with RIST (on OBEs) it seems you need minimum of RTT+20ms of buffer to recover a single packet loss, 2xRTT+40ms for one retransmit loss, and so on.

With 20% loss how many retransmits are we expecting? Well at 20mbit that’s 2000 packets a second, or 7,200,000 packets an hour.

1.44 million will be lost on first transmit and need a retransmit (20% of 7.2M), and 288k will be lost on first and second transmit (20% of 1.44M).

57,600 will fail even with 3 transmits. 2304 with 5 transmits, 18 will fail with 9 retransmits, so you need more than 10 retransmits (RTT+15ms) to average a completely clean line over an hour, 12 over the course of a day, and 16 for the course of a year.

On this particular experiment I should theoretically be able to cope with about 10 retransmits. That would mean about either 1 or 5 unrecovered packets in 10 million. That’s on average of course.

3 in 28 million, about 1 in 10 million, so seems like I’m getting 10 resends attempts.

NB: some of these numbers look very familiar to those of us from the 90s (1.44M floppy, 28.8k baud).


When FEC can’t cope

When you are streaming low latency video (<1 second glass-glass), FEC is a great way of coping with a little packet loss along the way.

In theory Pro-MPEG FEC can cope with upto 25% packet loss, but in practice it can struggle with 0.001% loss.

FEC works in a similar way to RAID, it sends parity data along with the real data, increasing the amount of data sent, but meaning you can recover if a packet goes AWOL.

We’ve found from much observation over the years that we sometimes suffer internet problems where a circuit stops transmitting for 10, 20, even 50ms – perhaps as routes or interfaces change in an ISP you have no control over. A 10mbit RTP stream sends 1 packet every millisecond, so this can mean 50 lost packets in a row, and the standard FEC matrix sizes can’t cope with more than 20 (even assumes all the parity packets get through).

If your encoder and decoder are relatively close to each other, nowadays you’d use RIST or SRT to cope with packet loss – you lose 100 packets, but you then ask for retransmits and get them back, assuming you have a high enough receive buffer for your round trip.

If you’re trying to do an interview down the line from Austraila and your studio is 350ms away in Europe, these protocols introduce unacceptable delay though — 5xrtt on top of the normal encode process would be nearly a 2 second delay, far beyond what’s acceptable.

OBEs have a “Duplicate Stream” option. It transmits every packet again, from the same source IP, to the same destination IP, just a little later. By setting a duplicate stream of 100ms, and a receive buffer of say 120ms, the stream will be immune to outages of . The nice thing about this is that it also works into other RTP based decoders which have a configurable receiver buffer, as they toss the duplicate packet, and just treat it as out-of-order.


Cisco Firepower, Traceroute, and “Unsupported CLI”

We have a couple of Cisco firepower firewalls inside our network, protecting certain circuits that arrive from buildings we don’t control

One of the many major annoyances with them is their invisibility to traceroute. Paul Stewart’s blog post about enabling them (forcing them to decrement the TTL as packets travel through them) seems just what the doctor ordered:

Sadly our firepower management server has other opinions on the matter

While perusing some of the many, many, posts of complaint about Firepowers, I came across a similar problem with http redirects. It seems that the cisco firepower gui blacklists certain commands, but the workaround of using htt redirect outside 80 ratner than http redirect outside 80 was mentioned.

Brilliant. Sure enough adding the flexconfng command of

policy-map global_policy
class class-default
set connectio decrement-ttl

Does the same job, but avoids the GUI blocking the “set connection” line.

There are many problems with these terrible devices, but at least now my traceroutes are working.