Network performance measurements for NASA’s Earth Observation System

Download Network performance measurements for NASA’s Earth Observation System

Post on 25-Jun-2016




2 download


Network performance measurements for NASA�s Earth Observation System The EOS project makes extensive use of networks to support data acquisition, data production, and data…


Network performance measurements for NASA�s Earth Observation System The EOS project makes extensive use of networks to support data acquisition, data production, and data distribu- tion. Many of these functions impose requirements on the networks, including throughput and availability. In order to NASA Earth Science (ES) missions [1] typically involve a satellite in low Earth orbit (LEO) that hosts a small number of remote-sensing instru- ments, all of which collect data pertaining to a eserved. * Corresponding author. E-mail address: (A. Ger- main). Computer Networks 46 (20 1389-1286/$ - see front matter � 2004 Elsevier B.V. All rights r verify that these requirements are being met, and be pro-active in recognizing problems, NASA conducts on-going per- formance measurements. The purpose of this paper is to examine techniques used by NASA to measure the perform- ance of the networks used by EOSDIS (EOS Data and Information System) and to indicate how this performance information is used. � 2004 Elsevier B.V. All rights reserved. Keywords: Network measurement; Network performance; Data distribution 1. Introduction Joe Loiacono a, Andy Germain b,*, Jeff Smith c a Computer Sciences Corporation, NASA Goddard Space Flight Center, Greenbelt, MD 20771, USA b Swales Aerospace, Inc., NASA Goddard Space Flight Center, Greenbelt, MD 20771, USA c NASA Goddard Space Flight Center, Greenbelt, MD 20771, USA Available online 14 July 2004 Abstract NASA�s Earth Observation System (EOS) Project studies all aspects of planet Earth from space, including climate change, and ocean, ice, land, and vegetation characteristics. It consists of about 20 satellite missions over a period of about a decade. Extensive collaboration is used, both with other US agencies (e.g., National Oceanic and Atmospheric Administration, NOAA; United States Geological Survey, USGS; Department of Defense, DoD), and international agencies (e.g., European Space Agency, ESA; Japan Aerospace Exploration Agency, JAXA), to improve cost effective- ness and obtain otherwise unavailable data. Scientific researchers are located at research institutions worldwide, prima- rily government research facilities and research universities. doi:10.1016/j.comnet.2004.06.017 04) 299–320 single scientific theme. For example, the Aura mis- sion [2], scheduled for launch in 2004, is an atmos- pheric research mission that will host four instruments. LEO satellites circle the Earth at about 200–500 miles high, completing an orbit in approximately 90–100 min. Once or twice each or- bit, when the satellite is in contact with one of the ground stations, data collected by the instruments is downlinked to the ground. After some initial processing, the data products are sent to one of the Distributed Active Archive Centers (DAACs). These DAACs then perform higher-level process- ing, to generate data products more useful to researchers. There are seven DAACs, located throughout the US, which store data products from ES missions and which support interactive and interoperable retrieval and distribution of these data products. Table 1 presents a list of the DAACs, their locations, and the type of informa- and international networks to reach international partners. Figs. 1 and 2 show the sites serviced by EOSDIS within the US, and internationally, respectively. These sites include, of course, NASA centers involved in Earth Science, other Federal agency partners, international agency partners, and the DAACs. At NASA Goddard Space Flight Center (GSFC) we have developed an EOSDIS network performance measurement system, called ‘‘EN- SIGHT’’ (EOS Networks Statistics and Informa- tion Gathering HTML-based visualization Tool) [5]. In this paper we address network monitoring of data transfer to and from the DAACs. Trans- mission of data to support instrument control and satellite downlink from spacecraft are both outside the scope of this paper. Two classes of measurements are taken: passive measurements and active measurements. Passive measurements atory arch C ce Da O ervati (EDC Fair ce Fli Labo 300 J. Loiacono et al. / Computer Networks 46 (2004) 299–320 tion stored at each one. The purpose of EOSDIS is to provide scientific and other users access to data from NASA�s Earth Science Enterprise. These users are located throughout the world. Several types of networks make up EOSDIS, including NASA Integrated Services Network (NISN) [3], various research net- works (such as the Internet2 Abilene network [4]), Table 1 Distributed active archive centers Name Location Physical Oceanography Jet Propulsion Labor Pasadena, CA Atmospheric Sciences Data Center NASA Langley Rese Hampton, VA Snow and Ice National Snow and I (NSIDC), Boulder, C Land Processes Earth Resources Obs (EROS) Data Center South Dakota Alaska Synthetic Aperature Radar (SAR) Facility (ASF) University of Alaska, Goddard Space Flight Center (GSFC) Earth Sciences (GES) NASA Goddard Spa Greenbelt, Maryland Oak Ridge National Laboratory (ORNL) DAAC* Oak Ridge National Oak Ridge, TN * This DAAC is not serviced by EOSDIS networks. consist of data about actual ‘‘user’’ flows, collected from operational network elements, generally routers. Passive metrics are not intended to add significant flows to the network, although the re- sults are generally acquired ‘‘in-band.’’ Active measurements, on the other hand, do intentionally add traffic to the network, to measure the re- sponse. Thus passive measurements generally re- Type of data , Oceanic processes, air–sea interactions enter, Radiation budget, clouds, aerosols, tropospheric chemistry ta Center Snow and ice, cryosphere, climate on Systems ), Sioux Falls, Land-related data banks, AL SAR data, sea ice, polar processes, geophysics ght Center, Upper atmosphere, atmospheric dynamics, global land biosphere, global precipitation, ocean color ratory, Biogeochemical and ecological data useful for studying environmental processes er Ne J. Loiacono et al. / Comput flect user operations, while active measurements show a snapshot of the network capabilities avail- able beyond the level of usage at the time. Passive metrics include the amount of data flowing over various time periods, for specific cir- cuits, interfaces, protocols, sources and destina- tions. Other metrics can relate to the number of various types of error conditions. Active metrics focus on throughput capability, and also include Fig. 1. Domes Fig. 2. Internati tworks 46 (2004) 299–320 301 round-trip time (RTT), packet-loss percentage, and the number of hops in the route. They are measured end-to-end, from one end system to an- other, with little knowledge of the network in be- tween (unless intermediate nodes along the path also participate in the active testing). At the top level of the ENSIGHT performance- measurement system, there are the active-data-col- lection and passive-data-collection subsystems, tic sites. onal sites. er Ne which make and collect the measurements. The re- sults are sent to a database subsystem for storage. A set of graphing programs extracts data from the database, and produces graphical displays. In addition to graphs showing either just active or just passive results, it has been found useful to combine related active and passive measurements into ‘‘integrated’’ graphs. The graphs are inte- grated into Web pages, and sent to the Web server; the Web pages are accessed through a Web proxy server. The ENSIGHT system is hosted on an i686 dual-processor-based computer with 60 Gigabyte disk space running Red Hat Linux 7.0. With the exception of the Oracle database engine, the ENSIGHT system has been developed largely using open-source software components. All data- base accesses however are standard SQL and an open-source database (e.g., mySQL) could easily be used instead. The system is generally driven by Perl scripts, which are often executed through the use of cron jobs. Some of the open-source products include Perl v5.6.1, SourceForge�s Net::SNMP package [6], Lincoln D. Stein�s GD::Graph package [7], the DBD and DBI data- base interface packages [8] by Tim Bunce, Mark Fullmer�s Flow Tools [9], and the Apache Web server [10]. Graphs of active measurements, passive meas- urements, and integrated measurements can all be accessed from the ENSIGHT Web site at Passive measurements that involve proprietary IP addresses are provided on a separate Web server and are accessible only to authorized EOSDIS networking personnel. A comprehensive view of network performance can be obtained from the combination of these tech- niques. The different types of results available from ENSIGHT are described in more detail in the remainder of this paper. Section 2 describes the passive-measurement subsystem. Section 3 addresses active measure- ment, while Section 4 addresses integrated meas- urement. In all three of these sections we present specific techniques used for performance measure- ment and discuss the types of information that can be gleaned from these measurements. Section 5 302 J. Loiacono et al. / Comput concludes the paper. 2. Passive measurements Passive measurements are derived from infor- mation stored on network devices and are either polled from the database collector periodically, or pushed to the collector from the devices them- selves. An overview of the passive-measurement subsystem is shown in Fig. 3. The passive measure- ments fall into two categories: Simple Network Management Protocol (SNMP) objects data and NetFlow [11] flow-statistics data. Each of these categories is discussed below. 2.1. Object measurements This data primarily consists of SNMP interface input and output byte counters. The gathered data is manipulated into graphs that depict interface load over time. This is of course a very standard report, e.g., Multi Router Traffic Grapher (MRTG) [12], and we have modeled the collection and round-robin storage after two very capable open-source toolsets; the MRTG and the Round- Robin Database [13] developed by Tobi Oetiker. This capability was redeveloped in order to store this data in the Oracle database and have it readily accessible for future tool developments or en- hancements. Data can be collected for any available SNMP object. Currently measurements such as interface utilization, router CPU usage, and router memory availability are tracked. The toolset provides a Web-based user interface for adding or modifying tracked objects. A sample is shown in Fig. 4. This figure lists attributes of a particular SNMP object. Several attributes are of interest. The col- lection interval indicates how often the device should be queried for this object; the graphing interval indicates how often a new graph should be created from the collected data; the graphing partner indicates whether a second object should be graphed on the same graph and what it is (this is useful for comparing inbound and outbound interface utilization). The collection offset permits objects to be col- lected at different times in order to spread out the load on the collector host resources; same with tworks 46 (2004) 299–320 the graphing offset. Otherwise objects are collected er Ne J. Loiacono et al. / Comput every collection interval from 12:00 midnight. The maximum value is useful for preventing spikes that would dominate the graph. The OID type indi- cates whether the object is tracked as a counter or a gauge. Counter objects are reduced to gauge readings upon input. The archive method of ‘‘NORMAL’’ will pro- duce 4 MRTG-like graphs showing data collected every 5 min, and then averaged over longer peri- ods of time (half-hour, 2 h, 24 h). Fig. 5 shows a sample of the 5 min graph, while Fig. 6 shows averages calculated over the maximum 24 h time period. Fig. 4. Managing Fig. 3. Passive-measureme tworks 46 (2004) 299–320 303 Note that Fig. 6 indicates a gradual increase in daily utilization and that it is butting up against the 300 Megabits per second ATM VBR PVC (asynchronous transfer mode variable bit rate per- manent virtual channel) limit for this circuit (also seen in Fig. 5). A graph like this becomes useful for capacity planning, indicating the potential need to increase the ATM circuit contract. 2.2. Flow measurements The flow-measurements portion of the passive- measurements capability collects and graphs data an object. nt system overview. er Ne 304 J. Loiacono et al. / Comput pertaining to specified flows through the EOSDIS system. This information is useful for tracking par- ticular contributors to overall network usage. The capability is provided to track flows from as spe- cific as host/port to host/port to as large as an aggregate of data transported from one DAAC to another. In addition, daily reports are generated which identify the individual components and their contribution to the aggregate flows. Finally a Web-based interface permits the user to interro- gate the stored NetFlow data in a variety of ways. This capability is heavily dependent on the excellent Flow Tools suite developed by Mark Fullmer. Flow Tools captures NetFlow data ex- ported from devices throughout the network and Fig. 5. Five minu Fig. 6. Twenty four tworks 46 (2004) 299–320 stores this data for several months at a time. The length of time NetFlow data is archived is merely dependent on available storage and the quantity of data exported from the devices. The Flow Tools suite offers a myriad of ways of interrogating the stored NetFlow data. A flow is identified by its two endpoints. In order to begin tracking a particular flow the end- points must first be defined and assigned to the flow. Fig. 7 shows the Web page provided for defining an endpoint. In Fig. 7 a Network/Mask:Port type of end- point is defined. Other types of endpoints that can be defined include Network/Mask, Interface, Interface:Port, and Autonomous System. Once te averages. hour averages. of specific flows. The graph looks very much like a standard interface-utilization graph but is limited to a specific flow. Fig. 12, another rate graph, tracks the delivery of data from the EDC DAAC to the University of Maryland. Fig. 13 shows a typical daily report. The daily J. Loiacono et al. / Computer Ne both endpoints of a flow have been defined (some endpoints can be re-used) a flow can now be de- fined (Fig. 8). The defined flow in Fig. 9 is intended to collect flow data from the Goddard Space Flight Center DAAC in Maryland to the EROS Data Center (EDC) DAAC in Sioux Falls, SD. Note that the capability to exclude certain information (in this case port 5500 which actually is the port for the ac- tive performance testing (PT) measurements de- scribed in Section 3) is provided by using a minus sign. The Flow Status provides several op- tions including Inactive, Collect Only, Collect and Graph, or Daily Report. Once the data is collected for a flow both quan- tity and rate graphs are produced, four of each, similar to the Object tracking graphs described above. Flow data may pass through several net- work devices, and the user can use discretion to pick the appropriate device from which to collect the data. An �All Devices� option is under develop- Fig. 7. Entering a new flow endpoint. ment, which would permit, for example, collecting Fig. 8. Creating a new flow. �all data initiating from multiple devices through- out the system but destined to a particular external host.� Fig. 10 provides an example of a flow-quantity graph. Here the graph shows daily quantities for the flow defined above. In this case, the GSFC DAAC is flowing over 2 Terabytes of data per day to the EDC DAAC. Fig. 11 shows the EN- SIGHT system�s ability to look at utilization rates Fig. 9. A defined flow. tworks 46 (2004) 299–320 305 report is generated once a day and breaks out the specific contributors to the flow that is being collected and for which daily quantities and rates are graphed. In the figure, the hyperlink (under- lined) provides a quick link to the flow quantity and rate graphs. The option to look at another day is provided as well. The sustained rate over the 24 h period for this flow was 268 Megabits per second. Note the 2 Terabyte daily transfer is between two specific hosts. Fig. 14 shows the Web page used to build cus- tom flow reports. This page is primarily a front- end to the Flow Tools capability customized for the EOSDIS network. It exploits the great er Ne 306 J. Loiacono et al. / Comput reporting flexibility of the Flow Tools suite. This tool permits a network analyst to closely examine Fig. 10. Flow qua Fig. 11. Flow rate, 5 Fig. 12. Flow rate to Univ tworks 46 (2004) 299–320 any aspect of the flow of data through the network over any time period for which the NetFlow data ntity yearly. min interval. ersity of Maryland. er Ne J. Loiacono et al. / Comput remains stored. Options exist to examine a view as broad as all flows between two interfaces to (for Fig. 13. Daily Fig. 14. Building a cus tworks 46 (2004) 299–320 307 example) a very narrow examination of individual flows (with time values) between two host: ports Report. tom flow report. over a specified interface for 3 s. An option is pro- vided to report all hosts with DNS resolved names, as well as the option to sort the resulting report on a particular column. Specific data can be excluded by preceding any form input with a minus sign (�). The report is created quickly owing to the effi- ciency of the underlying Flow Tools. A sample report is shown in Fig. 15. This report captured all of the active-test measurements pass- ing through the ecs_larc router between 10:00:00 and 11:00:00 on September 24, 2003. The Custom Flowgraph tool provides the net- work engineer with the ability to visually analyze network traffic behavior. Fig. 16 shows the results of a query that examined a 1-h period, looking at FTP transfers (TCP port 20.) This particular tool was used to demonstrate that the EOSDIS odic samples of traffic on the LAN that is attached to the collector device are collected and graphed. The impact on EOSDIS networks created by col- lecting NetFlow data from the current four net- work devices (to increase soon to nine devices) is minimal as is seen in Fig. 17. 2.3. Live monitoring The SNMP object-data-tracking capability and the flow-measurement capabilities have been merged to provide ENSIGHT�s live-monitoring capability. This function provides the user with a �live� look at an interface together with the flows that are passing through that interface during the monitoring period. The user selects an interface for monitoring via a Web page. The ENSIGHT 308 J. Loiacono et al. / Computer Networks 46 (2004) 299–320 network was not the cause of recent FTP file trans- fer problems when a flowgraph was generated showing both FTP transfers and active-perform- ance-measurement traffic during the same time period. It was clear that the active TCP measure- ments were able to use all of the available band- width, but the FTP transfers were otherwise constrained. Of course exporting NetFlow data continuously from multiple devices is not free. The capability to track the network impact caused by the NetFlow data export is also provided with ENSIGHT. Peri- Fig. 15. Custom software begins pooling the SNMP interface- byte-count object every 5 s (this is the minimum rate at which network devices tend to update their SNMP counters.) A report (see Fig. 18) is pro- duced and updated automatically every 5 s while the user is viewing it. There are three components to this report. The first is the typical interface-utilization graph but which is now presenting data at a much finer gran- ularity than the 5 min used in the SNMP object- tracking graphs discussed above. The second section is created from live NetFlow data and flow report. er Ne J. Loiacono et al. / Comput shows host-to-host activity. The third component is like the second but provides additional port Fig. 16. Custom Fig. 17. Custom tworks 46 (2004) 299–320 309 information. The information in the second and third sections is derived from the live collection flowgraph. flow report. er Ne 310 J. Loiacono et al. / Comput of NetFlow data and is posted to the updating Web page as it becomes available. Because NetFlow data is not exported from the routers until the flow expires, the information in Fig. 18. Live interfa tworks 46 (2004) 299–320 the flow sections trails in time that of the utiliza- tion graph. Thus, in Fig. 18, we see that the utili- zation graph shows an average of approximately 85 Megabits per second, while the total of flow ce monitoring. data acquired so far is only 71 Megabits. To cor- rect this, the page is left to update for several min- utes after the live utilization tracking is ended. As a result, in terms of rates, after a few minutes both sections are in much closer harmony. The third section, which includes port informa- tion, has a Flow Rate and an Overall Rate column. The flow rate is that rate achieved during the time that the flow existed. The overall rate represents the total data through the flow, which may be a shorter period, divided by the total period for which utilization data was collected. An examina- tion of the data in the third section illuminates how EOS computers use multiple, parallel TCP streams to accomplish large-scale data transfers quickly. 3.1. Overview of source nodes and active-measure- ment tests The source nodes actively drive the tests; testing to each destination is initiated hourly by a cron job. The test times are scheduled throughout each hour in order to avoid overlapping with other tests from the same source, and also with other tests to the same destination from other sources. In the nominal case (there are many variations), the first test step is a traceroute to the destination. From this traceroute a number of hops will be de- rived––the last hop to respond will be counted if the destination is not reached. Also, the RTT will be derived, provided the intended destination is reached. This will be used as the RTT if the ping test does not provide one. Next is the standalone ping test. In this test, 100 J. Loiacono et al. / Computer Networks 46 (2004) 299–320 311 3. Active measurements The active-measurement system currently uti- lizes about 30 source nodes, and 100 sink nodes. The sink nodes generally are passive; they run servers, which respond to requests from clients at the source nodes. The source nodes clients are in- voked by cron jobs. Figs. 1 and 2, presented earlier in the paper, show the participating sites. Fig. 19 is an overview of the active-measurement system. Fig. 19. Active-measureme pings are sent to the destination, and the RTT and the number not returned is extracted. If obtained, this RTT will be used in preference to the one found in the traceroute. The loss from this ping test (if under 100%) will be extracted as packet loss if there is no concurrent ping test. Next is the throughput test, and on some nodes another ‘‘concurrent’’ ping test. The throughput test is designed to last for 30 s, so the concurrent ping test is used on nodes which are enabled to nt system overview. er Ne send 100 pings in this period. If so, the packet loss is extracted from the concurrent ping test. The throughput test is one of four types, depend- ing on characteristics of both the source anddestina- tion nodes. Iperf [14] (available from the National Lab for Advanced Network Research (NLANR [15])) and nuttcp [16] (NASA developed) are pre- ferred, due to their rich set of features. An older pro- gram tcpwatch [17] (alsoNASAdeveloped) has been mostly phased out, but is still in use on some sys- tems. Note that tcpwatch and iperf are compatible (i.e., the client for one works with the server for the other, except for their default port numbers) and get similar results.Whichever of these three pro- grams (all derived from ttcp) is used, the tests are set up to keep the TCP send buffer full for 30 s. Alternatively, if none of the above programs can be used, ftp is the fallback throughput test. In this case a file size is chosen to target the trans- fer time to about 30 s, although this can change as the network is upgraded. Accordingly, no concur- rent ping test is used with ftp throughput tests–– because it cannot be assured that the pings are actually concurrent! Note that these tests all use the source ma- chine�s standard TCP stack, and thus because of TCP�s congestion-control mechanism will tend not to allow the network to be swamped by test packets. This would not be true of UDP tests, which could easily be configured to exceed the net- work�s capacity. But TCP will tend to share the network more or less equally among all TCP streams, so other users will still get a share of the available throughput. This, and the short (30 s per hour) test duration, has enabled these tests to continue in an ongoing way along with user oper- ational flows. The impact on users is negligible. All the data extracted from these tests is col- lected on the source machine in a ‘‘results’’ file, which is sent on an hourly basis to a ‘‘collector node’’, where it is ingested into the database. These results are then plotted and displayed on the Web site, as described in Section 3.4. In case of the inability to send the data, it is retained at the source node, and new results appended to the results file. An attempt to send the data is made hourly; the accumulated data is deleted only after 312 J. Loiacono et al. / Comput it is successfully sent. 3.2. Sink nodes Ideally, a sink node will include the capability to respond to pings, traceroutes, and one of the four throughput programs listed above. In most cases the throughput server is configured to be available continuously, but an effort is in progress to make the server available on a schedule corre- sponding to the source-node schedule. 3.3. Navigating the active-measurement Web pages The active-measurementWeb pages and displays are organized on a ‘‘destination’’ node basis. There are several groupings of destinations that can be ac- cessed and used for individual site selection. Maps. The US or international maps can be se- lected, which then shows a map of all of the sites tested in those categories. Clicking on an individ- ual site on these maps will lead to the ‘‘destination page’’ for that site. If multiple systems are tested at the same location, clicking on that site will link to a page with a list of systems tested at that site. That page will then link to the individual destina- tion pages. EOS Mission sites. Selection of one of the EOS Missions will link to a map showing the sites tested which participate in that mission. Fig. 20 is a sam- ple mission map for the ‘‘Terra’’ mission. On the map is a tiny graph for each site, located in the approximate geographical map location for that site. These graphs show the daily min–max range for the throughput testing to that site for the last week, the median for each day, and the require- ment against which the throughput is compared, if any. If there is more than one requirement, only the highest one is shown. If there is more than one source testing to a site, only one source is shown. This will normally be the source corresponding to the requirement. The border color of these tiny graphs, which are updated hourly, is used to indi- cate the status of the performance relative to the requirement, according to the chart that appears on the page. In this way, the status of all nodes associated with a mission can be evaluated in a short glance. From these mission pages, clicking on any of tworks 46 (2004) 299–320 the tiny graphs will link to the ‘‘destination page’’ er Ne J. Loiacono et al. / Comput for that site. This can also be achieved by clicking on the name of the site from the list at the left or bottom of the page. Links to other categories of sites can be found at the top of the page. Network sites. Two categories of networks can be selected from the Active-Measurements page: EMSnet, and Research Networks. EMSnet is the EOS internal production network. This link is available only to authorized users, and is intended as a network-monitoring tool. It uses tiny graphs to show the status of all the nodes tested on the EMS network. The ‘‘Research Networks’’ (RENs) page lists sites at hubs of various Research Net- works that are participating in these tests. Organization sites. These are sites grouped organizationally––sites belonging to participants Fig. 20. Terra mission tworks 46 (2004) 299–320 313 in the Committee on Earth Observing Satellites (CEOS) [18], the DAACs, sites participating in the Earth Science Technology Office (ESTO) [19] Computational Technologies Project [20], and a large number of nodes located at NASA GSFC. Other. Finally, ‘‘Other’’ is a list of a few sites that don�t fit any of the above categories. 3.4. Destination-page display Selecting a destination, either from a map page or a list, links to a destination page. Fig. 21 is a sample destination map for Pennsylvania State University. The destination page mainly consists of up to 16 small graphs, showing the results of measurements ––clickable map. er Ne 314 J. Loiacono et al. / Comput to that site. Note that all tests shown on graphs on destination pages are TO that site. The graphs are arranged in a 4·4 array, and can all fit on most screens at the same time. The leftmost column con- tains ‘‘Thruput’’ graphs. To the right of those are ‘‘Loss’’ graphs, next are ‘‘Hops’’ graphs, and on the right are ‘‘RTT’’ graphs. The top row of these graphs shows individual results for each of the four parameters for the cur- rent week: today plus the previous seven full days. The second row is based on the same data, but shows the median for each hour of the day (in Greenwich Mean Time (GMT)). In this way time-of-day sensitivities can be easily seen, and the variability between individual tests is some- what smoothed out. The requirements (if any) are shown on the throughput graphs as dashed lines. The third row shows a longer period (four Fig. 21. Destination page: P tworks 46 (2004) 299–320 full months plus the current month), displaying daily median values. The fourth row shows weekly medians for the current year plus two full years in the past. The line colors on these graphs indicate the var- ious source nodes that test to that destination. Source colors are consistent throughout all desti- nations (source colors have been chosen with the intent of generating readable graphs, although with the various permutations of sources and des- tinations, some graphs have some similar colors). Clicking on any of the smaller graphs links to a larger version of the same graph. The source key is not shown on the small graphs, but can be found on the large graphs. Some additional information is also shown on the destination pages. Typically included are a description of the node, and the cur- rent status of the throughput versus the require- enn State University. er Ne ments. Below the graphs is an indication of the route used from the various sources to the destina- tion. Throughput graphs. The throughput graphs are based on the throughput program run, either iperf, nuttcp, ftp, or tcpwatch. Iperf and nuttcp are capa- ble of running multiple parallel TCP streams. This option is often used when the throughput of a sin- gle stream is limited not by the network, but in- stead by the maximum TCP window size of one of the end systems (or sometimes by a proxy fire- wall). If so, the throughput shown on the graphs is the sum of all the parallel streams. This condi- tion is noted in the footer section of the destination page. Hops graphs. The hops graph is based on the re- sults of a traceroute. The intent is to be able to see when route changes occur. (Note: some route changes have the same number of hops. This will not be seen as readily, but perhaps the RTT graph will show this occurrence.) The graph shows the hop count for the last node which responds to the traceroute––the ‘‘***’’ responses are ignored. The important note here is that this may or may not be the actual destination. This presentation was favored over always showing the maximum number of hops (30) when the destination did not respond, based on the fact that many cam- puses block ICMP inside their campus. In this case the hops count will indicate the number of hops to the campus edge, and will be useful in determining route changes from the source node to the campus edge. So there is no clear indication on this graph whether the end node is or is not reached. Loss graphs. There are two methods for deter- mining loss. The preferred method is to send 100 pings, and see how many fail to get back. There are two variants of the ping tests. The preferred method is to send the 100 pings concurrently with the throughput test (‘‘concurrent pings’’). In this case, the packet loss shown is for a period of sig- nificant network load. However, note that the throughput tests last 30 s. To send 100 pings in 30 s requires a ping spacing of 0.3 s. Many systems do not permit this close spacing (1 s is generally the default minimum) without intervention of the sys- admin. On those systems where the sysadmin has J. Loiacono et al. / Comput chosen not to enable this spacing, the loss is deter- mined from 100 pings prior to the throughput test (‘‘standalone ping’’). This thus shows the loss for a network with unknown load. If pings are blocked, either at the source or des- tination, an alternative method is used that looks at the difference in the number of TCP packets re- transmitted by the source node before and after the throughput test (obtained by netstat––s). It then calculates the percentage this represents of the total packets sent during the throughput test. Note that this alternative method has many draw- backs, and the results are subject to various errors. However, the thinking is that the data has been collected; it might mean something. For example, step changes probably do indicate that something has changed. The first problem with this alternative method is that it measures not packet loss, but packet re- transmission. While these are certainly related, a single packet lost may result in the retransmission of all following packets that had already been sent (unless sack is in use). So there is an unknown mul- tiplier function applied due to this factor. The next problem is that the retransmissions counted this way include all retransmissions for the source node, not just for the tcp session(s) to the destination node. So if the source node is re- transmitting many packets to unrelated destina- tions, this count will be applied to the loss calculated here. This factor makes it essential not to attribute individual cases of high loss to the net- work between the source and destination tested here. In other cases, where the source node is send- ing out a high volume of data to many nodes (example: the GSFC GES DAAC at this writing is sending out about 250 Megabits per second, more or less continuously), a relatively small per- centage of retransmission to unrelated nodes can overwhelm a low-rate flow to a test node. The third problem is that the number of good packets is estimated, not calculated. The through- put rate and time is known from the throughput test, so the total data flow is known accurately. But the number of packets this represents is esti- mated using the maximum transmission unit (MTU), as reported by iperf. This could be correct, or there could have been more good packets sent–– tworks 46 (2004) 299–320 315 MTU is a maximum size, not guaranteed to be the size of every packet. Anyway, the loss rate is then calculated as the packets retransmitted divided by the sum of good packets and retransmitted pack- ets. This third factor is probably not as big a source of error as the other factors. For all the above reasons, this calculation of the percentage of retransmitted packets is not highly reliable. RTT graphs. There are two methods used for determining RTT. The preferred method is to use the median value from the 100 pings. Even if the concurrent pings are used for the loss graph, the RTT is derived from standalone pings. This is be- cause it has been found that in network-limited cases, if using concurrent pings, the pings will get queued at the bottleneck node behind packets from the throughput test. This will distort the failure conditions, and responding appropriately. But they seem less sensitive to network degrada- tion; it will take longer to recognize a higher packet-loss rate, or circuit-speed reduction, as long as some capability remains. It is hoped that the integrated measurements (see Section 4), when completed, will offer improved recognition and notification of these partial failure conditions. For this function, the recent performance from a source to a destination is compared to the expected or required value, and a status (‘‘Good’’, ‘‘Ade- quate’’, ‘‘Poor’’, etc.) is assigned. If this status changes, automated e-mail notifications are sent to appropriate parties. Another use of this system is to assist trouble- shooting of the problems observed. One character- istic of these tests, which contributes to the ability 316 J. Loiacono et al. / Computer Networks 46 (2004) 299–320 RTT value observed. If the pings are blocked, but the traceroute does indeed reach the destina- tion node (yes, there are a few of these cases), the lowest of the three RTTs from the traceroute is used. 3.5. Troubleshooting using active-measurements results One intended use of this system is automated event notification. It has been observed that net- work operation centers are quite good at detecting Fig. 22. Throughput to troubleshoot problems, is the overlapping of test sources and sinks. Most sources test to several sinks, and most sinks are tested from at least two sources. So if most or all tests from a given source experience a performance change at about the same time, attention is directed near that source. Conversely, if performance changes are observed from multiple sources to a single sink, the change is attributed to the vicinity of the sink. Sometimes there are multiple sinks in relative proximity. This can allow estimation of the location of the prob- lem. If both sinks are similarly affected, the change by GMT hour. er Ne J. Loiacono et al. / Comput is inferred to be in a common element, while if only one is affected, the change is more local to that sink. We conclude this section with some examples to show how inferences can be derived from actual graphs produced by the ENSIGHT system. Example 1. This example involves diurnal vari- ation. Examination of the ‘‘hourly’’ graphs of throughput (Fig. 22) and packet loss (Fig. 23) shows that high packet loss corresponds to low Fig. 23. Packet loss Fig. 24. Round tworks 46 (2004) 299–320 317 throughput at certain hours of the day, and that all sources are similarly affected. The inference is that there is a highly congested circuit along the route during US East Coast business hours. Example 2. This example involves a destination problem. In Fig. 24 we see a noisy RTT develop from all four sources to Boston University (BU) on 30 September 2003. As the four sources are spread widely around the US, the inference is that the problem is near BU. Comparison of these by GMT hour. trip time. results with another node near BU (perhaps the Massachusetts Institute of Technology (MIT)) would enable further pinpointing of the problem. near the specific source. ments. One main question asked of this perform- ance-measurement system is: are these circuits performing up to their specifications. (This ques- tion cannot usually be asked about shared net- works, because we typically have no information regarding their specifications, or the usage of these circuits by others. In these cases, the performance- measurement system focuses on whether the EOS requirements are being met.) It is clear that neither active nor passive meas- urements, by themselves, are sufficient to evaluate the circuits. The passive measurements only show flow if there is user demand. So if there is no user demand for some period (perhaps due to a prob- lem with the user equipment or software), the pas- sive measurements will show low utilization. But the circuit may very well be fine. The active meas- Fig. 25. Packet loss. 318 J. Loiacono et al. / Computer Networks 46 (2004) 299–320 4. Integrated displays In those cases where private circuits are used, the circuits are purchased to meet the require- Example 3. The last example involves a source problem. In Fig. 25 we see a high packet loss from one source to BU, while the other sources have low packet loss. The inference is that the problem is Fig. 26. Sample inte urements, conversely, compete for bandwidth with the user flows. So during periods of high user flow, the active measurements will be lower. The solu- tion is to combine the active and passive measure- ments. The intent is to add the active and passive flows for the same period. In principle, the sum should indicate the total capacity of the network at the time. Fig. 26 shows a typical integrated graph. There are several factors and difficulties in com- bining the active and passive results that should be described. One problem is that ideally, the active grated graph. ments against performance data, and forecast re- quired upgrades. Operators of individual system provides a comprehensive view of relevant performance parameters. The ENSIGHT Web site er Ne and passive measurements should be taken over the same time periods. Currently, however, the ac- tive throughput tests run for 30 s. The user flows typically last much longer. If so, the flow rate is averaged over their duration. So adjustments are made to the data before adding the passive to the active results. The first is due to recognition that the measurements are being made at different levels of the protocol stack. The active tests measure TCP payload––they are at the application protocol layer. But the user flows measure the length of the entire IP packet. So the passive measurements are ‘‘discounted’’ by 3% as an approximation for the layer 3 and 4 pro- tocol overhead. Next, it is noted that the data flows of the active tests are included in the passive measurements. So this effect is subtracted from the passive values. Finally, an ‘‘interference effect’’ is estimated and adjusted for. In this case, if a user flow lasts longer than the iperf test, it will only compete for bandwidth with the iperf test for a short percent- age of its duration. Thus its average throughput rate will be substantially unaffected by the iperf test. However, its rate DURING the iperf test WILL be affected by the iperf test, and thus be sig- nificantly lower. For example, if the user flow con- sumes the full bandwidth (100%) of a circuit for a long period, it may get only 50% during the iperf test, if both flows are a single stream––the iperf test would get the other 50% during its 30 s lifetime. The user flow�s pro-rated flow rate would still be close to 100%, since it is averaged over the full duration of the flow. If the user flow�s 100% was added to iperf�s 50%, the total would be 150% of the circuit bandwidth––an unrealistic value. So an attempt is made to estimate the interference of the iperf test on the user flow. This adjustment is made to the iperf value, to allow the user-flow portion of the graph to accurately reflect the ob- served flow. Although it is recognized that this method has inherent inaccuracies, it is hoped that it is still use- ful. The example graph in Fig. 26 shows a much flatter total result (as expected on a private circuit) than either the user flow or iperf would. It is clear that the iperf values drop when the user flow in- J. Loiacono et al. / Comput creases, and vice versa. provides several options for visualizing the per- formance data that is collected. This flexibility enables users to view data in the context that is most meaningful to them, e.g., in the context of a specific mission. In this paper we have provided insight into how the Web site can be used and the information that can be gleaned from the statistics that are collected. Finally, we are continually striving to improve the ENSIGHT tool by incorporating new features, new ways of presenting the data, and new tech- niques for collecting the data. Acknowledgment This work was partially supported by the Na- tional Aeronautics and Space Administration under contract number NAS5-01909 to Swales Aerospace, Inc. References [1] The Earth Observing System Project Science Office. Avail- able from: . [2] Aura website. Available from: . [3] NASA Integrated Services Network. Available from: . [4] Abilene Backbone Network. Available from: . [5] ENSIGHT Performance Measurements. Available from: . [6] Net-SNMP Project. Available from: . [7]––Interface to GD Graphics Library. Available from: . [8] Perl DBI. Available from: . [9] Flow-tools. Available from: . [10] Apache http Server Project. Available from: . [11] Cisco White Paper: NetFlow Services and Applications. 1978. He has been employed with CSC since 1981, performing software engineering with a recent focus on network engineer- ing and management. In an initial assignment to NASA he performed work on satellite orbit determination systems. He has supported NASA wide-area networking since 1998. Re- cently he has supported the Earth Observing System networks with an emphasis on network performance monitoring. Andy Germain is a network engineer and current task manager for the Swales Aerospace, Inc. EOS Network Engineering task at NASA Goddard Space Flight Center. He began his ca- reer in electronic communications with 320 J. Loiacono et al. / Computer Networks 46 (2004) 299–320 Available from: . [12] MRTG: The Multi Router Traffic Grapher. Available from: . [13] RRDtool––Round Robin Database Tool. Available from: . [14] Iperf Version 1.7.0. Available from: . [15] NLANR. Available from: . [16] ttcp/nttcp/nuttcp/iperf versions. Available from: . [17] TCPWatch. Available from: . [18] Committee on Earth Observation Satellites home page. Available from: . [19] ESTO home page. Available from: . [20] NASA�s Computational Technologies Project home page. Available from: . Joe Loiacono is a senior engineer with the Computer Sciences Corporation (CSC) at NASA Goddard Space Flight Center. Along with Andy Ger- main and Jeff Smith, he is one of the primary designers and developers of the ENSIGHT performance-measure- ment system. He received a B.A. in Mathematics from the University of Maryland in 1976 and a M.A. from the University of California, San Diego, in Internet infrastructure. Before his NCS assignment he worked for over 6 years on NASA�s Earth Science Data and Information System (ESDIS) Project at NASA Goddard Space Flight Center as the ESDIS Science-Data Network Manager and International Network Liaison. ufacturing TVs and stereo compo- nents, and then for NCR Corporation, developing microcomputer and communications hardware and software. He has been working as a NASA contractor since 1986, initially on the Central Data Handling Facility for the Upper Atmosphere Research Satellite with Science Systems and Applications, Inc., and since 1992 on the network for the Earth Observation System with Swales Aerospace, Inc. Jeff Smith is currently working on de- tail as a NASA representative to the Department of Homeland Security�s National Communication System (NCS, He is working on a proposal to investigate the feasibility of an Internet Prioriti- zation Service focusing on VoIP tech- nologies with respect to the convergence of the Public Switched Telephone Network (PSTN) onto the a ham radio license at age 13. He graduated from Rensselaer Polytech- nic Institute, in Troy, NY, with a B.S. in Electrical Engineering in 1973. He first worked for GTE Sylvania, man- Network performance measurements for NASA rsquo s Earth Observation System Introduction Passive measurements Object measurements Flow measurements Live monitoring Active measurements Overview of source nodes and active-measurement tests Sink nodes Navigating the active-measurement Web pages Destination-page display Troubleshooting using active-measurements results Integrated displays Acknowledgment References