Troubleshooting The Public Switched Telephone Network
The Public Switched Telephone Network (PSTN) is truly one of the marvels of the 20th century. Broadcasters rely on this technology extensively; not just for call in programming, but for high quality links over ISDN, and for low cost IFB. Broadcast Engineers are forced to become familiar with this technology to do their jobs.
Many readers will be familiar with the in and outs of troubleshooting ISDN and POTS (plain old telephone service) circuits. However, once your line and equipment have been eliminated as the source of problems, troubleshooting can get very difficult. Only through a thorough understanding of the network is it possible to conclusively locate the source of the problem. Moreover, since the Telco is bound to be skeptical, it is best to be sure of the problem before going to them.
Figure 1 - Simplified view of the telephone network
Usually we can ignore the network when considering Telephone problems. In figure 1, we see the usual simplified view of the network. Normally, this simplistic view is sufficient to allow us to solve our difficulties - After all, everything in the cloud is the Telco's responsibility, right? True enough, and network problems will frequently resolve of their own accord. However, sometimes we can't wait for someone else to discover and fix the problem. Put on your Sherlock Holmes hat and let's investigate!
Trunk groups & hunting
The key to understanding network behavior, is the understanding of trunk groups and how they function. Lets look at the simplest telephone connection that includes network trunking. Figure 2 shows a diagram of a simple connection between telephone "line A" and telephone "line B". Line A is served by Telephone Central Office Exchange "CO1" while Line B is served by a second Central Office, which we'll call "CO2". The call between A and B is a local call, so there is no long distance carrier involved.
Figure 2 -Local telephone network - one direction shown
When A dials B's telephone number, CO1 consults a routing table to determine a path or "route" to CO2. In this example, there is a "trunk group" directly between these two central offices. The switch's next step is to choose an available channel on that trunk group for this call. This process is called "hunting", a term that dates back to the days when an Operator scanned or "hunted" through the trunks until finding a vacant channel. The usual hunt method is to begin at the lowest numbered channel, and then sequentially check each channel to see if it is in use. Once a vacant channel is located, the call can proceed. This is called "bottom up" hunting. "Top down" hunting works identically, except that the highest numbered trunk is tested first, and the switch hunts downward.
Trunk group behavior when a trunk is bad
Trunk hunting is key to understanding the patterns of behavior typically experienced when the network is the cause of the problems. Using the example shown in Figure 2, lets note what would happen if different "members" of the trunk group were having problems. Recall that this is a bottom up trunk group. First assume that the first member of this group has a problem where it has audio in one direction only - trunks are nearly always 4-wire circuits, so problems that occur in only one direction are the usual case. Line A dials B, and gets no response. S/he then hangs up and dials again. Since this is the first member of the group, this will happen repeatedly until, by chance, another call is on member 1 while A is dialing, causing A's call to hunt up to member 2 or higher.
In this case the, symptoms will happen quite often, and will occur frequently regardless of how busy the network is. During the busiest time of day (called the "busy hour") the problem will happen slightly less, since the odds will be better that someone else will have the bad channel when A makes the call.
Now, let's assume that instead of member 1 that has the problem, member 6 has the problem. In this case, during slow times of the day, the problem will not occur; it will only occur once 5 other calls are up. Once this occurs, the symptoms will be similar to the first case. The average incidence will be lower than the previous example, even during the busy hour, since whenever a party on members 1-5 hangs up, the call will go through rather than hunting all the way to member 6.
Figure 3 -Trunk groups between local central offices
One-way versus Two-way trunk groups
In Figure 3 we see our example again, however this time we have included provisions for B to call A. The most common case is where two "one-way" hunt groups are used between CO1 and CO2. Trunk group 1-2 is used for calls from CO1 to CO2, and trunk group 2-1 is used for calls in the opposite direction. In this case, a bad trunk in group 1-2 would result in some percentage of failures when A calls B. However, since trunk group 2-1 is unaffected, 100% of calls from B to A will succeed. This behavior, where the direction of the call consistently affects the results, is a classic symptom of network-related problems.
Figure 4 -Local central offices with two-way trunks
If traffic between CO1 and CO2 is small enough, a "two-way" trunk group may be used instead of two one-way groups as shown in Figure 4. This increases network efficiency. In this case, calls in the two directions hunt in opposite directions: Calls from CO1 to CO2 hunt bottom up and calls from CO2 to CO1 hunt top down. If member 2 were bad, we will see, on average, many more failed calls when A calls B versus B calling A. On the other hand, if member 3 were bad, we'd see similar failure rates in both directions. Fortunately, two-way trunk groups are rare, which makes life simpler when troubleshooting.
Introduction to troubleshooting network problems
The key to troubleshooting network problems is persistence. If we make enough calls, and we eventually get one that does not fail, this tells us several things:
The problem is not our equipment. Terminal equipment (such as a telephone or codec) should not care how many calls you make - it should act similarly in each case.
The problem is "acting like" a "network" (e.g. a "trunk") problem in that it is non-absolute; rather, it is probabilistic.
Generally, we will want to make 15 calls, carefully keeping track of the number of calls where the problem occurs (we can then calculate a "success rate" from this raw data). Next, we reverse the direction of the call, and place 15 calls. If the success rates are markedly different, we can be very suspicious this is a network problem. The logic for this conclusion is as follows: On each call the same customer equipment and same Central Office switches will be used. However, as we have seen, trunk selection is dynamic. Another clue that the problem may be network related, is if the success rate varies substantially depending on the time of day. You will also sometimes note that the success of Circuit Switched Data (CSD) calls at 56 kbps may differ versus CSD calls at 64 kbps, and both will usually act differently versus voice calls.
THE NETWORK - THE BIG PICTURE
Before we go on to more detailed troubleshooting, let's examine the network in greater detail. We will examine the USA network, but you will find similar topology in other parts of the world.
Long distance access
Tandem switches and trunks, Figure 5, add the network facilities to allow A to make long distance calls from CO1. USA telecommunications policy requires that users be permitted "equal access" to various competing long distance carriers. Therefore, the local Telcos have something called an "Access Tandem Switch" that allows for this flexibility. This means that you may observe the somewhat paradoxical situation where a problem that occurs only with long distance calls is actually due to the local Telco.
Figure 5 - Local central offices with connection to the Interexchange carriers’ long distance networks
All long distance calls are routed by the CO switch to the Tandem Switch over "Tandem Trunks". These trunks are never used for local calls, but are used for all long distance calls, regardless of the long distance carrier. The local CO informs the Tandem Switch of the user's PIC (Primary Interchange Carrier - the established long distance carrier) when the call is placed. Or, if a "casual access" carrier identification code is included when dialing (a 101xxxxx code), the tandem routes the call to a carrier based on that information. The tandem switch then examines the routing tables for that carrier, and routes the call over a trunk to their long distance network. At the far end, the reverse process occurs; the long distance carrier routes the call to a tandem switch, which then routes it over tandem trunks to the destination CO of the called party.
Multiple trunk groups and overflow trunks
As we have seen, trunks are arranged as groups, and each group has members. When a call is routed to a particular group, hunting dynamically determines the member to be used for the call. In our examples so far, we have shown only a single trunk group for each route.
Often, multiple trunks groups are present between two locations. In this case, there is a hunt for an available trunk group, and then hunting within the members of that group. Note, this is not a "concentrating" function, just a "grouping" function. This arrangement is shown in Figure 6. If all trunks are busy there may be an "overflow" or "alternate" trunk group. In some cases, those alternate paths may be indirect routes that go through a third switch. Or, frequently, a two-way trunk group will be the overflow route when none of the one-way trunks groups are available.
Figure 6 - Hunting across multiple trunk groups
More on Routing Tables
As mentioned earlier, once a switch determines the destination of a call, it consults a routing table to determine where to route the call. This can be relevant for several reasons. In cases where a new area code or exchange has been added to the network, your local switch may not have an entry for this type of call to this destination.
Separate routing tables exist for Voice calls, 56 kbps Circuit Switched Data (CSD) calls, and 64 kbps Circuit Switched Data calls. The fact that the routing tables are not the same for Voice calls, and the fact that Circuit Switched Data calls are a small percentage of calls over the network, implies that that data calls are more prone to network problems. And, they will be less likely to be noticed by routine network surveillance by the Telcos.
Note that in some cases the trunks for these different types of calls may be the same, but not necessarily. Digital 64 kbps trunks can handle all three types of calls. However, each type of call is routed according to separate routing tables. Therefore, 56kbps CSD calls may take one route, while 64kbps channels are reserved for 64 kbps CSD calls. In cases where older robbed bit facilities are still in place, these trunks can handle voice and 56 kbps CSD, but not CSD at the higher rate.
TYPICAL NETWORK PROBLEMS
The following symptoms are typical of network-related problems:
Circuit Switched Data calls (e.g. codecs)
Clean data in one direction. Dropouts or corrupted data in the other direction (intermittent loss of, or no, codec "lock" on one end).
Having your own data looped back to you. This is surprisingly common, and is caused by some piece of gear accidentally being left in a diagnostic mode. In some cases, codec A will receive its own audio back, while codec B receives A's audio. Or, in other cases, codec A will receive its own audio back while codec B receives its own audio back. (If either side gets a mix of both audio sources, the problem is not a network problem).
Corrupted data in both directions (neither codec locks).
Calls fail to complete, and the ISDN cause code indicates "no route to destination" or "incompatible bearer cap" (we've seen other cause codes as well).
Calls fail to complete and calls sits at "proceeding".
Voice calls (including modems)
Audio in one direction only.
Distorted, very loud, or very soft, audio in one direction.
Echo, singing, howling on one or both ends.
Poor hybrid performance/leakage.
Unusually low modem connect speeds.
Calls fail to complete. Busy, fast busy (reorder), silence, or an intercept message, is received.
Note that some of these symptoms can be caused by a bad local line, or equipment, so you need to eliminate those causes first. The key to a network problem is not so much the symptoms, but the fact that they have certain additional characteristics that can only be explained by the nature of the network. For example, if the problem occurs only when A calls B, but never occurs when B calls A.
Don't forget that you must do multiple tests. If you dial from A to B and the call fails, and then try dialing from B to A and it works, you have proved very little. However, if you find that 7 of 15 calls dialing from A to B fail, while 15 out of 15 calls from B to A succeed, then you have a very important clue.
Eliminate the easy stuff first
Test the line and equipment. If you are using ISDN, you have the ability to dial from one channel to the other. Do so on each end, using a mode that requires both B channels. It's fairly easy to eliminate both the codec and the line. Then do end-to end tests using the same mode used for the local test (your codec might have a problem specific to a certain mode).
If the problem is on a POTS line, see if the same problem occurs on another line from the same CO switch. Also determine if the problem occurs dialing between two lines on the same switch (if so, it's not a network problem).
Does the problem only occur when dialing in one of the two directions? If so, it's likely to be a network problem.
If the problem is occurring only with long distance calls, try some local calls. Or, vice versa. If the problem is limited to only one of these types of calls, you probably have a network problem.
If the problem happens only on long distance, try placing the problem calls with several 101xxx access codes. If the problem occurs only with a certain carrier contact them, explain the problem, and work with them to solve it (see below). Not all carriers can handle CSD calls, so we suggest the following for troubleshooting those calls: 1010222, 1010288, and 1010333.
If the problem occurs only with in-bound long distance calls (from multiple sites), and changing the carrier used (at the far end) to place the calls makes no difference, then the problem is in the local network at the end with problems receiving calls. This would be a problem with the path between the Access Tandem and the CO.
If the problem occurs only on out-bound long distance calls (to multiple sites), and changing the carrier used to place the calls makes no difference, then the problem is in the local network at the end placing the calls. This would be a problem with the path between the CO and the Access Tandem.
Working with the phone company to solve your problem
If you have followed our advice, you have already eliminated the local line, the equipment, and all sources of problems other than the network. And you have done many test calls, and noted one or all of the following attributes that indicate a network problem:
Problem limited to local or long distance, but not both.
Probability of the problem occurring varies significantly between incoming versus outgoing calls.
Probability of the problem occurring varies significantly between CSD calls versus voice calls.
Probability of the problem occurring varies significantly between CSD calls at 56 kbps vs. 64 kbps.
Probability of the problem occurring varies significantly depending on the time of day.
Now it's time to contact the phone company (local dial tone provider or long distance company). Remember, "Everybody blames the phone company". The old adage that "you will catch more flies with honey than with vinegar" applies here. You will probably need to be quite insistent at some point in the process. But, being pleasant, and keeping a good sense of humor, will help keep the Telco interested in solving your problem.
If they try to close out the "trouble ticket" before the problem is solved, insist that they do a conference call with the manufacturer or manufacturer's representative before doing so.
Working with long Distance Carriers. Here your task is reasonably straightforward. Nearly all of their problems are network related, so they won't show as much disbelief in your claims as the local Telcos do. Generally, all you need to tell them is that: "the problem only occurs when I use your network. If I dial with 101xxxx I don't have this problem" and they will start investigating.
You will need to be prepared to place calls until the problem occurs. If the problem only occurs on inbound calls be sure to have someone standing by elsewhere who can work with you. The process is that the tester sets up a "trace" to capture information about calls from the originating number. They will want you to keep placing calls until the problem occurs. S/he will then examine the routing information for that call and then tell you to hang up and dial again. After 3-6 "bad" call s/he should be able to notice what is common about the failed calls. At that point, the next step is usually to "busy out" the trunk group in question, to prevent calls from using the effected trunks. At that point you should have no additional failures. Make sure to test this based on your previous investigation (e.g. if only 1 of 10 calls failed in your tests, then you'd better make 20 calls just to be sure. On the other hand, if 12 of 15 calls were failing then you only need to make 4 or 5 successful calls to know the problem is solved. Once the problem trunk group is found, they will leave it busied out and fix it later.
Working with local dial tone providers. Generally speaking, if you have done your homework, there is no reason for the Telco to dispatch a Tech to your site. In fact, this is likely to slow getting the problem resolved. However, understand that you now probably know more about the telephone network than the person who enters the repair tickets. Explaining in detail the troubleshooting you have done (and even the nature of the problem) is likely to be futile. Your best bet is to give a very brief description of the problem, and politely insist that they have this ticket referred to the "trunking group" because it is a "network problem" and to have someone call you back. When they do call you back, be sure to inquire if they are with the trunking group. Then explain to them in detail what the problem is, the testing you have done and the results thereof, and the fact you have been unable to find any explanation other than a network problem.
In some cases, local procedure will insist that a Tech be dispatched. You should make a "friendly" attempt to explain that this wasting their money and your time, but don't make too big an issue of this. However, when the Tech arrives, stay with him/her and be friendly and helpful as they work. Offer coffee (or pizza if it's lunch time). Explain that you are sorry that this is probably a waste of their time, as you believe it is a network problem. Offer to show them the air studio(s) - chances are they'll be interested to see your station(s). Be sure to mention all of the stations at your location, as they may be a listener. And be sure to give him/her T-shirts, coffee mugs, or whatever you have before s/he leaves. Make this person your ally and be sure to find out how to get back in touch later. The goal is for this person to think you are ok, and understand that the problem you are working together to solve is having an effect on the listeners. It is also good to build up a "blue collar" rapport with the Tech. Let him/her know that the boss is breathing down your neck (particularly if things have already dragged on for some time).
In the meanwhile, pay attention to what the Tech is doing. If the problem really is a network problem, they may be unable to find any "problems with the line" without your assistance. Don't let him or her depart at this point (that is one reason you have to stay with the Tech). Now explain to the Tech how to create the problem and do so. This may mean having the far end call into the Tech's test equipment, or dialing long distance, etc. Or, you may have to demonstrate the problem using your equipment.
The goal is to convince this person that there is a problem and to have him/her call to the trunking group to get it fixed (not bad as s/he gets to sit around drinking coffee chatting with you and seeing the radio station while waiting for this to happen).
Once you have someone from the trunking group involved, explain to them the tests you have done, and the results. If s/he tries to blame your equipment explain why this can't be. Then ask if s/he has an alternative explanation. If there is a Telco Tech at your site, s/he should be able to verify that they saw the results you describe.
At this point, the procedure is identical to what we described above for working with long distance carriers. The person from the trunking group will set up a trace and you will demonstrate the problem (see above) so they can determine where the problem lies.
Case 1 - Long Distance Access Problems
The following case history is from a codec user in Phoenix. Customer reported problems placing Circuit Switched Data calls to a particular site. This site was a long distance call. We made some test calls and rapidly collected the following information on the symptoms:
Problem occurred with outbound long distance calls only. Local and inbound calls worked 100% of the time. We were able to duplicate the same pattern calling to a third site in a third city. · The problem occurred nearly 100% of the time. Out of 20 test calls made (at roughly 10 am to 1 PM, plus a few calls the morning before) 19 showed the problem. · Each of the codecs obtained "lock" (frame) and received its own audio back. Careful examination showed no mixing of audio from one site to the other (the network has no ability to mix two data streams together, so this proved the problem was not due to improper mix minuses). · The same results were obtained with three different long distance carriers. · Voice calls to the third location were unaffected (voice calls to the site of the original destination were not attempted).
We concluded that the problem was that some network element had been left in a diagnostic mode (bi-lateral loopback). Based on the fact that only long distance calls were involved, combined with the fact it occurred with multiple long distance carriers, indicated the problem was with the trunks to the Access Tandem (refer to Figure 5). We recommended that the customer contact the local phone company and that he should request that this matter be referred to their trunking group. That this was "a tandem trunk problem and involved data calls only".
After over 24 hours of wrangling, the customer was able to talk to someone in the trunking group. This person was skeptical, so the customer conferenced on one of our Support Engineers (by this time it was 5:30 PM). A few calls were attempted, but none failed. Telco Tech was very skeptical and wanted to close out the ticket. Our tech called me in. I verified that the Telco had a trace setup and immediately had the customer repeatedly dial a number at our office in hopes of having the problem occur again.
I explained to the Telco that there was no known way that each of the codecs would get their own audio back were it not for something in the network that was in loopback mode. He gave the usual reply that "if there were something wrong in the network we'd get thousands of complaints" and added "this would show up on our trouble logs".
I asked if they had separate trunk groups for voice and data. Reply was no…all trunks are 64 kbps capable. However, I reminded him that there might be a subset of trunks used for data. He denied this, until I reminded him that without looking at the routing tables it was impossible to tell. I persisted, and ask him to explain how these symptoms could occur. He could not. At this point, the customer had made over 20 calls and the problem had not occurred. I was beginning to think we'd need to resume the next day as apparently traffic had diminished to the point the problem was not going to occur.
Fortunately, the problem began occurring again. The Telco traced several calls and was seeing good and bad calls going over the same trunk groups (we both knew that it was unlikely a single trunk would be in loopback, more likely it would be at least 24 channels - e.g. a T1).
The Telco was becoming rather defensive by now. I suggested the customer contact another station in Phoenix to see if the problem occurred dialing from there. We did, and that location was not experiencing the problem, however we were seeing it on about 60% of calls from the original site at that point.
The Telco determined that the second customer's line was served out of a different CO from the customer having the problems. However, he insisted that they shared the same tandem trunks and tandem switch. After asking a number of questions, we learned that the customer's CO did not have any tandem trunks, rather it had interoffice trunks to another CO, that then routes calls to the tandem trunks. The configuration is shown in figure 7.
I told him to trace the test calls on these interoffice trunks, and the problem was narrowed down in about 4 calls to a single T1 that carried 24 members of a much larger trunk group. "Busying out" that T1 caused the problem to cease, proving that this was where the problem lay.
We learned that because the trunk group was substantially larger than the 24 channels of this T1, the statistics on "trunk group occupancy" had not revealed anything due to the small percentage of trunks involved, and hence did not show up on the Telco's trouble logs. Score 1 for persistence/common sense and 0 for the Telco surveillance systems.
Case 2 - International Dialing Problems
This customer is part of a large international news organization. From one of their offices, they were unable connect using Circuit Switched Data calls to codecs in any other country. The network would return a cause 65 (incompatible bearer capability) when the calls failed. Incoming calls, both international and domestic, were fine, however. The odd thing was that the problem occurred with all three long distance carriers tried. Since their International long distance rates from the USA are substantially cheaper than overseas this was costing quite a bit of money in addition to creating logistical problems.
The customer contacted the local Telco who insisted that they had to send a Tech to the site. The Tech came and made some calls and left without informing the customer. The customer contacted the Telco and informed them the problem was still present, and was told that the Tech had reported "no problems making International calls". At our request, he inquired if their Tech had made voice calls or Circuit Switched Data calls - The Telco was unable to determine what type of calls had been made by their Tech.
We informed the customer that he should report the problem to the long distance carrier and set up a conference call between the local Telco, the long distance carrier, and our support engineer. That getting this group of people together would be required for resolution.
However, the local Telco insisted on dispatching a Tech to the site again. This time the customer observed the Tech make some calls, and conferenced us on with the Tech. When asked if these were voice or data calls, the Tech was unsure (had not been trained on a new model of the ISDN test set) but he thought they were data calls. Ever suspicious, we asked him to call into a codec that was configured to accept data calls only. Tech said that these calls "did not go through".
We helped the Tech determine how to make a data call with his test set. As expected, domestic data calls went through without problems, however International call attempts returned a cause 65 - At last, the Telco witnessed the problem.
The Tech then conferenced us with someone from the local Telco's trunking group, and we also conferenced on the long distance carrier.
The problem was with the local Telco. For some inexplicable reason, the routing of calls to the long distance carriers is different for International calls. The Tech from the long distance company immediately informed the local trunk Tech over what trunks leaving the tandem these calls should be routed, so he could update the routing tables. The local trunk guy took so long getting this accomplished, that the long distance Tech put in a complaint to his supervisor. However, we got it working that evening.
Case 3 - Voice talent can't make local calls.
This customer, a professional voiceover artist, called us because he could no longer place calls to one of his clients, a local radio station. The calls were about 20 miles, but since he was located in the New York metro area, it represented over an hour's commute. The Telco had recently added a new area code to his part of New Jersey.
First we verified that his line was ok, and that long distance was ok. On a hunch, we tried dialing this local call with 1+ area code and the call worked. This workaround was sufficient to save this guy the commute. He contacted the phone company and complained that he did not appreciate having to pay long distance on a local call, and they resolved the problem within a few days.
Fortunately, network problems are rarer than line problems, which is a good thing as they are harder to resolve. The key to success is to "probe" the situation thoroughly, and analyze the results carefully. Never forget that hunt groups act probabilistically, and that results will vary depending on network traffic. Stick to your guns, keep your brain in high gear, and the frustration of tracking the problem will be replaced by the warm feeling of success.