Announcement

Collapse
No announcement yet.

Diagnosing network connectivity issues

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Diagnosing network connectivity issues

    I am working on the JANOS v2.4.1 release candidate and hoping to add some capabilities to quantify and then diagnose network connectivity issues. You know, from the perspective of the JNIORs. I am wondering if you guys have any ideas?

    Searches lead you to use PING. In fact in later versions of JANOS there is the PING -F flood command which basically pings until you stop leaving a trail of decimal points on the screen when pings or their responses are lost. That tests the connection but only at the moment. That doesn't help if you had a network issue a couple of hours ago. There is also PING -V which pings all of the configured servers (like the gateway) so you can see if you are connected and your configuration is good to go. It also pings us here to test Internet connectivity (or lack thereof). None of this helps with what may have happened in the past and will in the future. A BW test is obviously not the right thing either.

    So far I have implemented 3 things. First while you can tell when the JNIOR reboots by checking the log, and you can see if that was specifically because of a reboot from that, there is no indication of periods of power down. Now there will be a table of the last 32 operational sessions. Why? Because if you can't reach the JNIOR over the network, is the network down or the JNIOR?

    Note that the syslog tab in the WebUI displays events from most recent to oldest. These screen captures are from that. At the command line CAT displays from oldest to latest (unless you use the newer CAT -R command to reverse the line order).

    shutdown.png

    This was an update and the unit came right back up but if it were powered off for a while you will see:

    powered_down.png

    My development unit has seen some reboots, eh?

    And if that isn't enough there is the (currently undocumented) command PS -H that displays the past. This shows boot time and duration of operation.
    Code:
    jrbarn /> ps -h
    Process History
    
       0  07/28/23 14:14:39  1 Day 20 Hours 04:48.940
       1  07/30/23 10:19:55  6 Hours 03:10.303
       2  07/30/23 16:30:20  41:01.713
       3  07/30/23 17:11:45  18 Hours 48:55.328
       4  07/31/23 12:01:03  58:26.291
       5  07/31/23 12:59:53  21 Hours 52:07.940
       6  08/01/23 10:52:26  22 Hours 49:53.002
       7  08/02/23 09:42:44  1 Hour 08:40.623
    
    jrbarn />​
    On the network side I am adding round-trip and retransmission information to the NETSTAT connection table.
    Code:
    jrbarn /> netstat
    LAN connection active (100 Mbps)
    Server/Connection count: 12
       LocPrt  RemPrt  Remote IP        rtt  var  rtx  State
     1:    21    ----  ---------                       LISTEN        FTP
     2:  9220    ----  ---------                       LISTEN        JMP Service
     3:  9200    ----  ---------                       LISTEN        JNIOR Protocol
     4:    80    ----  ---------                       LISTEN        HTTP
     5:   443*   ----  ---------                       LISTEN        HTTPS
     6:    23    ----  ---------                       LISTEN        Telnet
     7:    80   38046  192.168.2.100      3            ESTABLISHED
     8: 62623     502  192.168.2.91      18    1       ESTABLISHED
     9: 64604     502  192.168.2.92      18    1       ESTABLISHED
    10: 54604     502  192.168.2.93      18    3       ESTABLISHED
    11: 50936     502  192.168.2.94      16    2       ESTABLISHED
    12:    80   49247  209.195.188.17   152   61       ESTABLISHED
    * TLSv1.2 encrypted socket
    
    jrbarn />​
    Here 'rtt' is the SRTT (smoothed round trip time) and 'var' is the RTTVAR (variance) as defined by RFC 6298 both in milliseconds. Along with this there is a count of packet retransmissions for that connection (rtx). So you can see if something is slow or struggling. This has already been used to identify a poor network bridge and then have it replaced.

    Um, this unit is using MODBUS (port 502) to monitor 4 solar inverters.

    And, last (so far) but not least, I have added 4 new network availability and BW statistics to the NETSTAT -A adapter statistics. This includes a packet retransmit per hour rate. These new statistics are accumulated from the last 24 hours of operation. These are logged by the minute. The other adapter stats are since power-up.
    Code:
    jrbarn /> netstat -a
    LAN connection active (100 Mbps)
               Connects : 1
       Packets Received : 23481
           Packets Sent : 19506
       Packets Captured : 34021
       Multicast Frames : 8951
         Bytes Received : 4255.3 KB
             Bytes Sent : 2870.6 KB
           Ping Replies : 0
         Receive Errors : 0
               Overruns : 0
           Availability : 99.9966 %
             Average BW : 9.0 Kbps
                Peak BW : 186.7 Kbps
        Retransmit Rate : 0.2 per hour
    
    jrbarn />  ​
    There is a non-volatile bandwidth table for the past 24 hours by the minute. It seems that a plot of this would be helpful. You could see network down/dead times. Or, high RTX times? But there is no plot yet. Some of these statistics you will likely never get from any other TCPIP stack. Just saying.

    So if you have any other ideas or thoughts I am open to suggestions.
    Last edited by Bruce Cloutier; 08-02-2023, 09:30 AM.

  • #2
    Neat! I can't speak to its usefulness in your product, but iperf3 can be a useful tool, if there is a straightforward way to implement that. Something like tcpdump or Wireshark would be great, too, but is probably overkill for what I am assuming is an embedded device and not a general-purpose computer.

    Comment


    • #3
      I am not familiar with iperf3. Is that Linux or Windblows? Given that JANOS is neither (with not one byte in common not to mention the processor) I would need to pick out the attractive features and look to replicate those. Is there one aspect of iperf3 that you rely upon? We are trying to create passive diagnostic tools/statistics so you can notice network connectivity issues perhaps before they impact your operation and without having to go out of your way looking for them.

      As for the other suggestion I will refer you to this topic http://www.film-tech.com/vbb/forum/m...1184#post31184.

      Along these lines the idea of augmenting the serial transmission logs supported by our IOLOG command has come up. You can display/dump the recent serial communications for the AUX port but there isn't a real-time watch-it-as-it-happens mode for that. There isn't the log for the COM/RS-232 port which sometimes also is used for interfacing. You can see the Sensor Port communications but that takes way too much work to interpret (binary).

      On the network the re-transmission count I am presenting counts when the JNIOR re-transmits a packet after timing out waiting for the associated ACK. I could count when we receive a duplicate transmission as you might if the remote client re-transmits after not receiving our ACK. But there are some cases where packets are purposely repeated to enhance reliability. I am not at all certain about that but I believe that some wireless routers can do that.

      Comment


      • #4
        iperf3 is freely available under the BSD license. It should run on pretty much anything: https://iperf.fr/ -- it is mostly a network speed/performance testing tool. You can run "iperf3 -s" on one device on your network and "iperf3 -c <ip addr. of other device>" on another device and it will send a bunch of traffic and take speed measurements. It does many other things as well.

        Comment


        • #5
          It seems still a bit focused on available BW. The RTTVAR (var) value we've added to NETSTAT is related to jitter. There is also the fact that there is only one processor and the timing is affected by the activity of any of the other processes. Obviously less so on a multi-GHz machine. The JNIOR runs down at 100 MHz. So there might appear to be more jitter but you are really in the middle of a show and have run a few macros. (Series 5 if we ever get that moving is targeted to run on a 300 MHz processor.)

          We want to measure/monitor the network without adding additional traffic to the equation.

          For instance, right now, I have inbound and outbound kbps for each minute for the past 24 hours of operation. Typically a minute doesn't go by without some inbound activity (broadcasts). So would it be a log-able event if suddenly there is a minute or more of zero inbound? Did the network go away? Note that we do also know if the LAN is connected or not. At that point should JANOS get concerned and maybe try to ping something to see if the lights have gone out?

          I have to be conscious of those running JNIORs out on cellular modems where network traffic costs real money. Yeah, they are out there on remote gas meters and the like.

          Also, if the media server, core or whatever normally makes a persistent connection, could the JNIOR become aware of that and then log an event if that server now drops off or reconnects abnormally?

          And by "logging" I mean into the system log (jniorsys.log) but there could be a case made that it would be worth an email alert message or something more aggressive. You can configure a SYSLOG Server if you have one setup (there are freeware servers) which would also receive the messages added to the local system log. If you want to play with a syslog server, support here will be glad to help you out. Kevin runs one he wrote himself and has most of the gazillion JNIORs on our network reporting to it.

          You know the JNIORs you all have could also be monitoring other things and when it suspects an issue toggle on its own version of red tail lights. You can add a temperature sensor and monitor projector operation for over-heating (or even the lack of heating - e.g. bulb out).

          I think it was Carsten that referred to the JNIOR as being under-utilized.



          Comment


          • #6
            Just a note that the features I described in the original post above have been included in the JANOS v2.4.1 release (coming soon - 9/5/23) for the Series 4 JNIOR.

            If any of you have questions or suggestions please feel free to contact our support. We are always looking to add to, or augment, the diagnostic tools in the product line. Not only should our product perform reliably it should also be helpful.

            Comment


            • #7
              A network issue I see on occasion is where the switch sends my equipment a lot of packets that are not addressed to me and are not broadcast packets. I had a captioning system fail every night at 7pm. It turns out that was when an Atmos movie in another auditorium was starting, and the captioning equipment was receiving Atmos data (I don't recall if it was the audio data, but it was a LOT). This prevented the captioning equipment from being able to reach the server.

              I found this with Wire Shark. But some simple program that identified packets that are not addressed for this piece of equipment and are not broadcast and thus should not be sent to this piece of equipment might be nice.

              I just recently heard of issues with various pieces of equipment on "busy networks." Unless there is a lot of traffic for me, I should not see a busy network.

              Harold

              Comment


              • #8
                Harold, I would presume this would be due to someone using a "flat network" with no IGMP snooping allowing multicast traffic to flood the switch. Thus far, all of my Dolby Atmos systems have used Q-SYS to finish the audio chain. As such, I've used managed switches with IGMP query/snooping set so only ports that deal with such traffic remain open. I do NOT put it on the normal LAN network, where the UPC28c systems reside. Oddly, Dolby products do not respond to IGMP queries (particularly the IMS3000 with 3.5.13 software)! As such, I have to permanently lock those ports open (declare them a static router port) or the switch will close the Atmos port.

                Now, I've seen people set it all up as a flat network but with switches that have both the bandwidth to handle the heavier loads (video and audio) as well as things like IGMP and QoS to ensure things don't get flooded while ensuring time sensitive data makes it through. Those switches are notably more expensive though.

                Comment


                • #9
                  Interesting. We have had a couple of situations where something on the network caused JNIOR performance issues. In some cases causing repeated product reboots.

                  I have not been able to create a network environment in which I can definitively generate specific JUMBO packets so as to test our network stack with those present. With these recent network issues I had first assumed use of the huge packets thinking that we needed further debugging on that front. The hope is that switches are configured to pass those only to devices that use them.

                  It turned out however to be some ambitious IT people aggressively running vulnerability scans on their network. These were the kind of scans that basically attacked everything in every way known to them. Kudos go out to Kevin who distilled the issue down at the customer's site and then replicated it in here causing the issue for debugging. That turned out to be one very malicious URL format in an HTTP request which tripped an unknown memory issue that we had. The attack wasn't successful in our case unless its purpose was just to crash the target.

                  That debugging effort led to some advanced memory checking features that we have implemented beginning with JANOS v2.4. Most memory issues like small buffer overruns normally occur unnoticed. For example, either the allocated block is still big enough to accommodate overrunning the logical buffer size or the memory beyond the block is not in use. Those memory tests quickly identified the URL issue and tagged a couple of other non-issue issues. Since v2.4 was released only one other memory issue had been tagged. That has been fixed in this week's v2.4.1 release. Definitely update your (Series 4) JNIORs.

                  But in your case Harold, a switch will forward traffic it receives to all of its ports until it can determine which port services that particular destination. After that it will just pass the traffic down that route. Good switches have logic to handle people swapping and otherwise moving ports around. So it sounds like the Atmos system was transmitting to another device that hadn't yet been located on your theatre-wide network. So the traffic flooded the whole place.

                  With the current OS on the JNIOR you can set the IpConfig/Promiscuous registry key to true to allow us to capture traffic not addressed to the JNIOR. Then you could use the NETSTAT -S scanner from the command line with a filter to exclude the JNIOR's MAC address and see anything not addressed to the unit. Probably need to filter broadcasts too.

                  The new bandwidth statistics in NETSTAT -A reflect traffic addressed to, and therefore handled by, the JNIOR. We could calculate the bandwidth for packets not addressed or broadcasted to the unit and display that there as well. You would have to know to check there. No one would be looking at that.

                  The networks are complicated dynamic environments and in likely every network there is problematic traffic, malconfigured devices and whatnot that no one ever sees. IT people want you to think they have it under control (job security). But in some cases (as we have seen) they are the cause of the problem. Even if there were some AI-based network oversight device it would have to be everywhere in the network (basically part of every switch). It would drive paranoid people crazy (crazier?). At a minimum its notifications would be annoying and eventually ignored anyway.

                  Comment


                  • #10
                    Originally posted by Bruce Cloutier View Post
                    But in your case Harold, a switch will forward traffic it receives to all of its ports until it can determine which port services that particular destination. After that it will just pass the traffic down that route. Good switches have logic to handle people swapping and otherwise moving ports around. So it sounds like the Atmos system was transmitting to another device that hadn't yet been located on your theatre-wide network. So the traffic flooded the whole place.
                    I know enough about network details to be dangerous, but it SEEMS that if a device is looking for another device on the network, it would send out broadcast ARP requests and not send any real data until it knows where it is going (ethernet address of the destination). Further, the switch should know the IP address of any device on a particular port by sniffing any packet originated by the device(s) on that port. In general, it seems the switch should not send any data (except broadcast) out a port until it knows who's there.

                    Harold

                    Comment


                    • #11
                      Of course that's correct too. The best thing about standards is that there are plenty to choose from.

                      Equipment doesn't always work like you think it should or it may up until tables fill or it gets confused. It is the boundary situations that sometimes aren't handled so eloquently. If you designed a switch you would think there is something uncomfortable about just discarding packets because some sequence hadn't occurred as you thing it should. You wouldn't want your switch to cause someone to think the network wasn't functional. Maybe it is better to pass the traffic than to not. But I am not trying to defend the equipment. Like Steve says it might be the way your network is configured.

                      Comment

                      Working...
                      X