Optimising TCP/IP connectivity: |
| Author: Oscar Hellström Supervisor: Carl Magnus Olsson |
![]() |
![]() |
|
| Bachelor's Thesis, Software Engineering and Management June 2007 |
IT University of Gothenburg |
Erlang Training and Consulting Ltd. |
Abstract
With the increased Use of network enabled applications and server hosted software systems, scalability with respect to network connectivity is becoming an increasingly important subject. The programming language Erlang has previously been shown to be a suitable choice for creating highly available, scalable and robust telecoms systems. In this exploratory study we want to investigate how to optimise an Erlang system for maximum TCP/IP connectivity in terms of operating system, tuning of the operating system TCP stack and tuning of the Erlang Runtime System. The study shows how a series of benchmarks are used to evaluate the impact of these factors and how to evaluate the best settings for deploying and configuring an Erlang application. We conclude that the choice of operating system and the use of kernel poll both have a major impact on the scalability of the benchmarked systems.
Keywords:
Erlang, TCP/IP, Scalability
Erlang is a general-purpose programming language with built-in support for concurrency, distribution and fault tolerance. [14] The language was designed for development of large concurrent soft real-time systems with focus on the telecommunication industry. Previous studies [28,29] have shown that Erlang is suitable for building robust telecoms systems, due to its ability to achieve high scalability, fault-tolerance and its soft real-time features. Today, Erlang is used for a variety of applications, many of these network-intensive. Examples of such applications are an XMPP/Jabber [31,32] server [6], an HTTPserver [13] and a web framework [17]. Since network communication usually is dependent on the operating system (OS), it is however not enough to know that Erlang scales well, but the combination of the Erlang Runtime System (ERTS), OS and configurations of the these must also scale well.
Common for most of today's network-enabled applications is the use of TCP or UDP as a transport layer in network communications. The TCP protocol is usually implemented by the OS* on which the application is running. It is however possible to replace the OS TCP implementation with alternatives such as a TCP stack implemented in Erlang [30] or one implemented in Standard ML [16]. Apart from reimplementing the TCP/IP stack it is also possible to control how an application is using the OS TCP stack through various configurable options. These settings, together with the choice of OS will together affect the performance of the software system in respect to maximum number of connections, the total throughput and throughput per socket.
Erlang Training and Consulting develops several applications using Erlang, many of which are both concurrent and network-intensive. This paper aims to use Erlang Training and Consulting in an initial case study for evaluating the best practices for how to configure and optimise the deployment of such a systems to reach maximum scalability in the form of network connectivity.
Related work
As mentioned above, an implementation of a TCP stack in Erlang exists and
is discussed in A High Performance Erlang Tcp/Ip
Stack [30].
This particular implementation was however designed to achieve fault-tolerance
through TCP connection migration rather than performance or scalability.
The study however briefly address how scheduling is affected by a large number
of sockets, both using the OS native TCP stack and the developed TCP stack.
BEEN [21] and Tsung [5] are two similar tools for automated benchmarking and analysis. Both provides a framework for distributed benchmarking, together with an execution framework. BEEN is implemented using Java and uses the Java Virtual Machine (JVM) as a platform independent execution environment while Tsung uses Erlang and the ERTS as its execution platform.
Liedtke et al. describes how stochastic benchmarking is used in OS research [25]. Even though Erlang is a programming language and not an OS the ERTS has features usually associated with an OS such as processes, scheduling, distribution etc. This paper collects data from benchmarks using a stochastic user simulation model.
Besides the previous mentioned studies, Erlang has also been studied in several other ways, among them in terms of code reuse [26] and performance impacts of specific implementations of certain parts of the standard libraries [19]. This paper wishes to complete these studies by evaluating how existing, network-intensive, Erlang applications are affected in terms of network scalability by different deployment configurations.
Research approachThe aim of this paper is to study how different deployment configurations of network-intensive Erlang systems affect connectivity. The goal of these studies are to find best practises for deploying and configuring these systems, with a focus on network scalability. To do this, a number of different deployment environments and configurations will be evaluated through benchmarking. Several parameters will be manipulated between these benchmarks, and the results will be collected and presented through tables and graphs.
This section will describe the approach taken in this study and the main applications and tools used.
Traffic profiles
When looking at TCP optimisations there are a few different traffic profiles to
consider.
These traffic profiles have been given four rough characteristics; short
lived and long lived connections and fast and slow
clients.
A long lived connection is a client which is maintained for a long period of
time while the short lived connection is closed shortly after it was
established.
A fast client is a client which sends a large amount of data over a short
period of time while a slow client will send small amounts of data over a short
period of time.
An HTTP client would be described as a short lived, potentially fast client
while an XMPP/Jabber client usually is a long lived and slow client.
Test tool
Tsung [5] is a distributed load testing tool which was primarily
developed to load test XMPP servers but now also supports several other
protocols.
Tsung uses a stochastic model for user simulation, with user event
distribution based on a Poisson Process.
This model, which is further discussed in Traffic Model and Performance
Evaluation of Web Servers [27].
Users actions are described through different scenarios which specify what the
user will do and also how many of the users will choose that particular
scenario.
The configuration also specifies how often a new user will connect to the
service and for how long new users will continue to connect.
Test subject
Were interested in finding out the maximum number of concurrently connected
clients and not interested in network throughput etc.
Therefore, the focus of benchmarks will be on slow slow clients that will use
long lived connections to the server.
Ejabberd [6] is a distributed and fault tolerant Jabber/XMPP
server implemented mostly in Erlang.
Ejabberd is a network-intensive system, which can be expected to have
several thousands of users connected concurrently during normal operation.
The connected users are however not expected to generate any large amount of
traffic, and have therefore the characteristics of long lived connections and
slow clients.
Ejabberd version 1.1.3 will be deployed using Erlang R11B-4 as main test subject in this study. The ERTS and Ejabberd will be compiled from source and deployed on each OS.
FactorsThe first factor to be considered is the OS the application is executed on. The other factors likely to have an impact on performance of the application will be evaluated though benchmarks on each operating systems included in this study.
This section will describe the factors that influence the result. Hypotheses concerning the results of altering these factors will also be stated here.
Factor 1 - Operating system
The choice of OS is very likely to affect the overall performance
of the Erlang application in several ways.
Benchmarks by von Leitner [22] for example suggest that
FreeBSD [8] scales the best in the case of allocating
sockets* in an
application.
Von Leitner also shows that Linux 2.6 [9], FreeBSD and
NetBSD [10] has the same latency while establishing a large number
of connections *.
These benchmarks are at the time of writing considered old, and addresses
outdated versions of the operating systems,* even if some remarks about more
recent OS versions has been added to the results.
Due to the lack of recent data, it is hard to make any hypothesis about the
performance benefits of running a particular OS.
The choice of operating systems to evaluate has been based on several factors, such as available commercial support, other benchmarks [22] and current OS knowledge. The most popular implementation of Erlang is distributed as open source and supports a wide range of operating systems. Commercial support is however only offered for a subset of these [3]. The benchmark will be performed on the following operating systems:
These operating systems are all available as open source.It was our intention to evaluate FreeBSD 6.2 but the benchmarks could not be completed. The ERTS always terminated with a segmentation fault after some time under heavy load. The reason for these crashes has not been located and further investigations of
this would not fit within the scope of this paper.Not only the choice of operating but also variables such as other running services, file systems used, swap-space and scheduling, which are configurable in most operating systems, will affect the performance of the running application. Therefore minimal installations, where such an option is given by the installer, will be used, and the default configuration of run-levels will be used.
Factor 2 - Kernel poll
Kernel poll is a name for several techniques which replaces the traditional
way* of checking for
data on a collection of file descriptors with a more efficient and scalable
mechanism.
The use of kernel poll is likely to affect the performance of the system when
a large number, i.e. several thousands, of file descriptors is being used.
Each OS supports at least one different implementation of the technique, even though these are based on the same theory. Linux 2.6 currently implements epoll [24], most BSD variants* implements kqueue [23] and Solaris implements /dev/poll [2]. There exists patches for the Linux kernel which adds support for Solaris /dev/poll interface, but these have been deprecated by epoll. In fact, the /dev/poll interface is marked obsolete and has been replaced by Event Ports [15] in Solaris 10. The Event Ports are however currently not supported by the ERTS while the /dev/poll interface is.
Hypothesis 1The use of kernel poll will affect the number of concurrent connections positively. This is because the use of kernel poll will decrease the CPU time spent while reading from several thousands of file descriptors.
Factor 3 - Asynchronous threads
Asynchronous threads in the ERTS is a way of stopping blocking calls from
interrupting the scheduling of Erlang processes while the calling thread is
blocked.
The blocking calls are done in separate OS threads, allowing the
emulator to schedule the Erlang processes that are not doing the blocking calls.
If there is no asynchronous thread available, the job request will be queued
until an asynchronous thread becomes idle.
The use of Asynchronous threads will increase the amount of available concurrent connections since the virtual machine will not be blocked waiting for I/O.
Hypothesis 3The use of too many asynchronous threads will decrease the amount of available concurrent connections since more time will be spent on context switching by the OS.
Factor 4 - TCP window size
The TCP window size is a way of controlling how many packages that can be put
on the network before the sender requires an acknowledgement of the data.
Window Scaling, as described in RFC1323, [20] is a way to use larger
window size to improve performance on high bandwidth networks.
Wechta, Eberlein and Halsall [33] claims that large
windows help utilize long, high bandwidth networks, which have a high
bandwidth-delay product, but does not help in a local area network (LAN)
environment, since there is no need to have many packages in transit.
It is possible to set the maximum window size, by means of send and
receive buffers for the TCP socket.
These values are however also controlled by the application setting up the
socket.
While the application can specify the preferred buffer sizes when setting up a
socket, the minimum and maximum sizes configured by the OS usually cannot be
overridden.
In some OS it is also possible to configure the TCP stack to always negotiate
window scaling, which would allow, but not enforce, large TCP windows.
Window scaling will not affect the number of possible concurrently open connections. XMPP will not create large amounts of traffic and this option will thus not have much impact on the system. Furthermore, the test environment is connected to a small fast and reliable LAN, which has been show not to benefit from large TCP windows [33].
Hypothesis 5Setting a small window size will would potentially decrease the amount of memory used by the application. Since the application cannot override the maximum buffer set by the OS this will decrease memory consumption, but will not affect the connectivity of the application.
Test environmentThis section describes the test subject and the configurations together with the configuration of the test tool and environment.
Test subject configuration
The test subject, see Section 2.3, has been set up on rather
low-end hardware.
The Ejabberd node is running on a Intel ® Celeron ® CPU running at 2.40GHz and having 512MB of RAM.
The machines are connected to a switched 100Mbit LAN by a SiS900 Fast Ethernet
Controller.
Tsung deployment
The Tsung cluster is running on two machines, identical to the test subject.
The two machines are connected to the test subject through a switched LAN and
applies load to a single machine, by adding 50 clients per second.
Figure 1 shows a diagram over the deployment.
The clients will choose different scenarios, where 20 percent of the clients are
inactive, sending single chat messages with large gaps, 60 percent are idle,
just staying connected receiving presence updates from its contacts and 20
percent are active clients, chatting with other online clients.
Figure 2 shows a part of the inactive scenario
used in the study.
|
File systems
The file systems used are the default choice when installing the various
operating systems.
No tuning of hard drive access has been made after installation.
The file systems will affect the test subject's ability to log efficiently.
Disk storage is also used for offline message storage.
The file systems used:
This section describes results from the benchmark. The result has been collected by using Tsung's logs. Observations of CPU time and system versus user space CPU, through each operating systems supplied tools, has also been recorded where applicable.
Common observations
The overall user connection rate are very much the same on all operating
systems and settings since this is controlled by the Tsung scenarios.
A general connection count curve is shown in Figure 3 and a
general error rate curve is shown in Figure 4.
The error rate curve plots all error a client can experience throughout a
scenario.
The curves looks more or less the same for all benchmarks, with the difference
being the slope of the curve and the number of time outs error rate.
Without optimisations
Table 1 shows the number of concurrent connections with the
OS and ERTS without optimisations.
The ERTS has been given options to allow for more processes, and more
ETS tables* and ports*.
The OS has been configured to allow more file descriptors (and sockets in some
cases) than the default settings.
These settings aren't really any optimisations, but they need to be changed to
allow the Erlang process to allocate as much resources as is needed.
Kernel poll
Table 2 shows the number of concurrent connections when
kernel poll, but no other options, is used.
The CPU time is affected by the use of the kernel poll feature. Most notably, on SuSE and NetBSD, system CPU utilisation drops dramatically compared to the benchmarks without optimisations On Solaris, the system CPU drop is not as large, since the system was never CPU-bound on without kernel poll either.
Without kernel poll activated on SuSE and NetBSD, the total* CPU utilisation is very close to 100% when the peak of concurrent connections is reached. The high percentage of system CPU leaves very little to user space and it is therefore hard to record if the user space CPU is also affected by kernel poll.
Asynchronous threads
Four different settings for asynchronous threads has been evaluated in the
benchmarks; 50 threads in the pool without kernel poll and 25, 50 and 75
threads in the pool with kernel poll activated.
Table 4 shows the number of concurrent connections with
asynchronous threads without kernel poll while Table 5
shows the number of concurrent connections with asynchronous threads and kernel
poll.
TCP window size
The benchmark was executed with both a large, 6291456 bytes, and small window
size, 8192 bytes.
The results of the benchmarks using the different window sizes are shown in
Table 6.
Errors symptoms
From a clients perspective, all system behave very similar when their respective
limit is reached for each settings.
Connection attempts made to the test subject will time out while no reset
connections are noticed.
On SuSE and NetBSD, the test subject behaves almost identical. When the benchmark is run without kernel poll, the system is clearly CPU bound, hitting 82% system CPU and 100% total CPU usage. Using kernel poll changes the this, The ERTS process never reaches more than around 50% CPU of which 3% is system CPU, the system is no longer CPU bound. Instead the available RAM is used, and the system starts swapping.
On Solaris, the system is never CPU bound, instead it's always memory bound. The total CPU usage rarely exceeds 65% during the benchmark. Using kernel poll does not change this but it does decrease system CPU slightly.
We can see that there is a significant difference in the number of concurrent connections in the different benchmarks. There is also a clear difference between the operating systems, where SuSE always performs the best. NetBSD and Solaris results are similar without any optimisations, while NetBSD performs better than Solaris when kernel poll is used. The difference in results for the asynchronous threads and TCP window sizes are quite small compared to the increase in performance gained from kernel poll. It could also be argued that these small differences are due to disturbances during the benchmarks.
Choice of operating system
The results of the benchmarks shows that the choice of operating system
has an impact on scalability of network-intensive Erlang system.
The best case SuSE benchmark compared to the best case Solaris benchmark shows
that SuSE is able to maintain 31% more connections than Solaris.
Comparing the best case SuSE to the best case NetBSD shows a difference of
14%. Finally, comparing the best results from NetBSD to the best results from
Solaris shows a difference of 15%.
The performance of SuSE is highly dependent on the Linux kernel, and it is
therefore possible that other distributions of Linux performs equally good.
It is also possible that fine-tuning the operating systems even further could improve the results. It is however hard to predict how significant this improvement would be, as this would be dependent on the researcher's in-depth knowledge of operating systems in general and also the particular operating systems being studied.
ERTS optimisations
Adjusting ERTS settings produces the same effect on SuSE and NetBSD.
The major factor here is the use of kernel poll.
On solaris, this is however not true.
Solaris was not CPU-bound without kernel poll activated,
which was the case with SuSE and NetBSD. Instead, running Solaris, the system
was always memory bound.
This indicates that Solaris has a more efficient implementation of the
traditional polling mechanism compared to SuSE* and NetBSD.
The relatively small impact of kernel poll in Solaris might also be due to the relatively low number of connected users in this benchmark. The performance gain from using an efficient polling mechanism should increase with the number of connections. Also the /dev/poll interface has been deprecated in favour of Event Ports [15]. Solaris new Event Ports are planned to be implemented some time in the future [4], which could have an impact on the performance on Solaris.
Asynchronous threads
The use of asynchronous threads did not show any impact on results of the
benchmarks in this study, even though the operations in the test subject are
very I/O-intensive.
The use of asynchronous threads is however likely to have an impact on
performance if a linked in driver would utilise the thread pool more
exhaustively *.
One way to test this would be to use a specialised test subject,
implemented to benefit from asynchronous threads.
TCP optimisations
Tuning of the TCP stack has very little impact on performance for an application
where most clients are passive.
No exhaustive measurement of network throughput has been documented for this
application, but the amount of data sent by the clients is too small to benefit
much from these kind of optimisations.
Another test subject, i.e. a fast client such as an HTTP or FTP
server, is more likely to benefit from this kind of optimisations.
Furthermore, the test environment did not allow for any thorough evaluation of different traffic profiles in combination with network characteristics. To complete these study, deeper investigations network performance and optimisations for certain networks could be done. For instance, doing the same benchmarks on unreliable links could show large differences in impact from any TCP optimisations. This is however not specific for Erlang systems.
Benchmark environment
It would be impossible to create a completely fair benchmarking environment
since certain aspects of the setup will benefit one or the other test subject.
Pre-compiled software is optimised for the architecture it was compiled
for and the level of optimisation can also vary depending on the complier and
the options used.
Often pre-compiled OS kernels are optimised for a subset of the instructions
available on modern platform, to allow for it to run on a number of platforms.
This potentially makes it perform suboptimal on any platform different enough
from the one compiled for.
These issues are however the same all kernels in these benchmarks, and is part
of the reason why custom kernels has been used in all cases.
Not just kernels, but also programs and libraries* comes pre-compiled for all systems benchmarked. The way these packages are compiled and linked is known to affect performance. Von Leitner [22] for instance shows how a statical linking can halve the latency in a certain operations.
Hardware and benchmark results
The fact that the benchmarks were executed on rather low-end hardware is
likely to have had an impact on the results.
In particular, the improved polling mechanisms are designed to scale very well
and there is likely to have been larger differences when polling on more than
20000 file descriptors.
Also, when kernel polling is used, the available RAM is exhausted.
More memory would potentially avoid swapping which is very likely to increase
performance.
Symmetric multiprocessing (SMP) support is a relatively new feature in the ERTS which is supported on most of the operating systems Erlang supports. The benchmarks in this paper were performed on a single processor system, without support for hyper threading. With the increased availability of multi-core and hyper-thread enabled CPUs it would also be interesting to benchmark SMP systems.
ConclusionsIn this study we wanted to evaluate best practices for deploying network-intensive Erlang applications. We have shown that in these benchmarks, the choice of OS and some ERTS configurations have a considerable impact on performance. Considering the results of the benchmarks, we can argue that for low-end systems it is favourable to use SuSE 9.3 to reach maximum scalability. SuSE is at the best able to maintain 14% more connections than NetBSD and 31% more connections than Solaris.
On SuSE and NetBSD the polling mechanism used has a large impact on the CPU usage, while it does not have such large impact on Solaris. The fact that Solaris is not CPU bound without the use of kernel poll indicates that Solaris has a more efficient implementation of the traditional polling mechanisms. Furthermore, this shows that the use of kernel poll has a positive effect for network scalability, but the actual impact is dependent on the particular implementation of kernel poll. Looking at the CPU usage with and without kernel poll, see Section 5.3.1, we can conclude that on SuSE and NetBSD, kernel poll is crucial to the scalability of network-intensive Erlang systems.
Adjusting TCP stack settings does not show any significant impact on the results of the benchmarks in this study. TCP optimisations are likely to make more sense in larger networks, which might have poor links or are crowded, but this falls outside the scope of this study. Performance optimisations in the TCP protocol has been discussed in several other studies, among them Impact of topology and choice of TCP window size on the performance of switched LANs [33] and Performance Impacts of Multi-Scaling in Wide-Area TCP/IP Traffic [18].
Furthermore, a contribution is made with this paper by examining the validity of the five hypotheses. Kernel poll significantly decreases CPU utilisation in the application due to a scalable way of reading from several thousand file descriptors. Even though the impact is not as large on Solaris, it is still notable. This validates Hypothesis 1, witch was quite expected based on previous studies and reports [23,1].
The use of asynchronous threads does not increase the number of concurrently connected users notably, which invalidates Hypothesis 2. This is probably a result of the fact that the test subject does not send or receive large amounts of data and is therefore not blocked for any significant period of time while doing I/O. Also, there are probably not enough CPU intensive operations that could be scheduled while waiting for I/O since most operations are not particularly CPU heavy.
Using many asynchronous threads cannot be shown to have an effect on the number of concurrently connected users, which invalidates Hypothesis 3. One likely reason as to why we cannot see any impact of the thread pool is that the asynchronous thread pool in not heavily used by the I/O driver. Therefore all threads are sleeping and no context switching in the OS is necessary.
No impact on performance can be seen in the configuration of TCP window sizes which validates both Hypothesis 4 and 5. The amounts of data sent and received by the clients is likely too small to have an impact on network performance together with the reliable LAN used in the test environment.
AcknowledgementsThis work has been sponsored by Erlang Training and Consulting Ltd.