Previously on random mutterings… as well as streaming with a deliberate 20% loss to an OBD, I’m using RIST to stream to AWS Media connect, same loss profile, same 1 second buffer.
OBD can cope with 10 million packets and no loss. Can AWS Media connect?
AWS statistics are lacking – they are a graph, hiding numbers away, and they only update every minute! I prefer working at a millisecond resolution in streams of text, but we should be able to salvage something from this.
There’s something seriously wrong here, hundreds, if not thousands, of packets being dropped every minute. That’s 10-20 a second.
Am I really dropping packets that much over the internet to AWS? Time to turn off the failures. Lets just leave the ISP circuit to AWS at the only output, and monitor RTP stream leaving the encoder
OK, there’s a few retransmits then, with packet 13959 being resent between 13964 and 65, then 15941 slotting in between 15948 and 49. AWS is far happier with their graph.
So time to introduce packet loss. 20% as before. There is no reason for AWS to perform worse than the OBD — it’s only a few milliseconds away rather than 83, retransmits tend to be 3 milliseconds after the initial packet, and the natural circuit loss is minimal. Looking at plain simply TCP Dump you can see packet 32565 was sent at 344.5ms, never made it to AWS, and AWS asked for a retransmit, with the request arriving at 347.6ms. The very next packet was a resend of 32565. In theory AWS should be able to cope with 250 resends, meaning that with 20% random packet loss you would never have loss in the entire lifespan of the universe.
So throwing back 20% of packet loss and we see the retransmits. Remember we’re looking at a capture post-loss here.
13202 and 13204 is sent, a retransmit is requested (arriving 18:58:36.375312) for 13203, which is successfuly sent between packets 13209 and 13210. 05 and 06 are missing too, they are requested at 18:58:36.376396 and delivered. Clearly AWS is sending requests for retransmits, and OBE is sending those retransmits.
The tricky bit now is to find a request for a retransmit that isn’t send (in theory 1 in 5 retransmits will fail to be sent due to the 20% packet loss). I’ll capture about 20 seconds of this, then prune the log files to capture RTP packet numbers 10,000 through 50,000.
Here we see we dropped packet 10145 about .2936, retransmit request at .2962 – 2.6ms later, then dispatched within 40us, job done.
The retransmit request is 2.6ms after the expected time. Lets look at packet 10148. That should have been sent at .2951, so I’d expect a retransmit about .2977. A retransmit request came in at .2983, so 600us later, seems reasonable what with the internet, router queues, etc. I suspect that packet 19:02:26.298336 was a request for retransmit of packet 10148. That request was ignored (or rather the packet was dropped by the kernel before our monitoring point). Another request arrived a few us later for a retransmit of packet 10150, and that was dealt with
But what about packet 10148? The next retransmit request (after 10150 had been dealt with) was also for 10148. After all at this stage we’d sent 10149, 50 (on a retransmit), 51, 52, 53, 54, 55, 56, 57 59, 60, 63, 64 and 65. We were missing 10148, 58, 61, and 62. The retransmit for packet 10158 would be due to arrive 2.6ms after .3001 – about .3027. 1ms later than that time at .3037, we had a request for a retransmit, then sent packet 10148 and 10158.
So it looks like AWS is asking for retransmits of missing retransmits, that’s good. But that doesn’t explain this:
Taking the c. 40,000 RTP packets that should have been sent between 19:02.26 and 19:02:46 we can see how many were actually sent. grep c33 /tmp/rtp.log|sed -e ‘s/.*c33 *//’ -e ‘s/ .*//’ does the job for that, producing a list of the sequence numbers 40,023 RTP packets transmitted. Piping into sort -n|uniq|wc -l produces a list of 39,998 unique packet numbers that were sent, including packet 9997 (a retransmit), 50,000, and almost every packets 10000 through 50000. The missing packets were all at the end, 49986, 49989, 49995 and 49996 (all of which would have been retransmitted shortly after).
While we’re at it we can look at retransmits caused by loss on the internet, not the artificial loss being introduced by tc. grep c33 /tmp/rtp.log|sed -e ‘s/.*c33 *//’ -e ‘s/ .*//’|sort -n|uniq -c|sort -n|grep -v ” 1 “ does that for us, showing up 25 packets in the 10,000 that were transmitted by us twice. That’s a reasonable number, 0.25% packet loss on this hop.
So we know every RTP packet in this 20 second period was sent to AWS, we know AWS was asking for retransmits of packets that failed twice (and perhaps more), we know that any downstream losses were being retransmitted. We know AWS is happy if we turn off the forced 20% loss. Why does AWS have such bouncy “Source not recovered packets” graphs? It’s not constant (despite pretty much constant 20% loss) – the numbers reported at 18:02 GMT was 153, and at 18:03 GMT was 30, but at 18:10 it was 1505, and at 18:30 it was zero!
A wider view
Now we only had a 20 second capture in a “low loss” time. I wish AWS had a half-decent resolution on its logs, as 1 minute is awful. 10 seconds maybe, but ideally I’d be after 1 second.
sudo tcpdump -i any -nn host 188.8.131.52 -T rtp | head -50000 > /tmp/rtp.log; head -1 /tmp/rtp.log; grep c33 /tmp/rtp.log|sed -e ‘s/.*c33 *//’ -e ‘s/ .*//’|sort -n|uniq |sed -e ‘s/…$/xxx/’|uniq -c; tail -1 /tmp/rtp.log;
Generates an interesting output. You should see 1000 for most sequence number blocks above 1000. One block at the start will be <1000 as it starts partway through, another at the end will be <1000 for the same reason, stuff under 1000 will be different, and 65xxx will only go upto 536 as RTP sequence numbers top out at 65535.
But you should see a solid list of about 40 blocks of 1,000. In the view below the packets started just after 30,200, ran to 65535, then wrapped round to about 5200. But what on earth happened between 42,000 and 44,000? There should have been 2,000 RTP packets sent, even if it took 5 or 6 attempts – there was plenty of time for retransmits to be in the capture
Zooming in on 42,xxx (grep c33 /tmp/rtp.log|sed -e ‘s/.*c33 *//’ -e ‘s/ .*//’|sort -n|uniq |sed -e ‘s/.$/x/’|grep ^42|uniq -c) shows up some major flakiness. What happened?
The problem started at 4226x, where 2 of the 10 packets were never sent even after seconds worth of retransmits, but at say 4269x only 5 packets got through. Looking at the capture, we can see a fairly standard packets going through, 0,1,3,5,6,7,8,9. This is post 20% loss remember, so nothing unusual in losing 2 of the 10 on the first go. What we should be seeing though are retransmit requests, and we do. What we don’t see is the packets actually retransmitted
Rememebr that OBE into OBD runs absolutely fine with no losses and plenty of retransmits. The order is 42260, 42254(retransmit), 42261, 42256(r), 42263,65. But look at the source port. Retransmit requests come in from source port 2001 to the ephemeral 33040.
The packets coming back from AWS in this instance are from port 2000 (the RIST port, not the RIST control port). No wonder OBE is ignoring them.
From 20:19:45 when this capture starts, until 20:19:51, all traffic from AWS to the encoder is from port 2001 to port 33020. That’s good. The final message just after we send packet 42261, and it’s a retransmit of 42256. It’s then radio silence on that port until 20:19:53, when the requests for retransmits come back in and start being handled
So why does AWS suddenly switch to sending packets for retransmits from the wrong port? Is it the wrong port? What does the RIST spec say?
Well the spec is at https://vsf.tv/download/technical_recommendations/VSF_TR-06-1_2018_10_17.pdf and is an easy read.
1. RIST senders shall transmit the RTP media packets to the configured IP address of the RIST receiver and a user-selected UDP destination port P, where P is an even number between 2 and 65534. RIST receivers shall listen on UDP port P for media. This transmission is unidirectional, from sender to receiver. The sender may choose any arbitrary source port M for the RTP flow.
2. RIST senders shall periodically transmit the compound RTCP packets specified in section 5.2.1to the configured IP address of the RIST receiver and UDP port P+1. The sender may choose any arbitrary source port R for the RTCP packets. RIST senders shall listen on port R for RTCP packets from the RIST receiver.
3.RIST receivers shall listen on UDP port P+1 for RTCP packets from the sender. These packets will have a source IP address S and a source port R’ (in the absence of NAT devices between the sender and receiver, R and R’ will be the same). RIST receivers shall send the RTCP packets they generate to IP address S and destination UDP port R’, with a source UDP port of P+1. A receiver shall use S and R’ from the last valid RTCP packet it has received from the sender.
4.RIST senders may offer the user the ability to manually configure source ports M and R.Section 4’s fairly meaningless with many source-NAT devices scramble the source port.
That suggests to me that the receiver should not be responding with any packets to the sender on the even number port.
Update Aug 6th
AWS have confirmed there’s a bug there, so at least I’m not going mad.