Cómo ajustamos HAProxy para lograr 2,000,000 de conexiones SSL concurrentes

Si observa detenidamente la captura de pantalla anterior, encontrará dos datos importantes:

  1. Esta máquina tiene 2,38 millones de conexiones TCP establecidas y
  2. La cantidad de RAM que se utiliza es de unos 48 Gigabytes .

Bastante impresionante, ¿verdad? Lo que sería aún más asombroso es si alguien proporcionara los componentes de configuración y los ajustes necesarios para lograr este tipo de escala en una sola máquina HAProxy. Bueno, lo haré en esta publicación;)

Esta es la parte final de la serie de varias partes sobre pruebas de carga HAProxy. Si tiene tiempo, le recomiendo que lea primero las dos primeras partes de la serie. Estos le ayudarán a familiarizarse con los ajustes de nivel de kernel necesarios en todas las máquinas de esta configuración.

Prueba de carga HAProxy (Parte 1)

Prueba de carga ? HAProxy? Si todo esto te parece griego, no te preocupes. Proporcionaré enlaces en línea para leer sobre lo que ... medium.com Prueba de carga HAProxy (Parte 2)

Esta es la segunda parte de la serie de 3 partes sobre pruebas de rendimiento del famoso equilibrador de carga TCP y proxy inverso ... medium.com

Hay muchos componentes pequeños que nos ayudaron a reunir toda la configuración y lograr estos números.

Antes de decirte la configuración final de HAProxy que usamos (si estás muy impaciente, puedes desplazarte hasta la parte inferior), quiero desarrollarla guiándote a través de nuestro pensamiento.

Lo que queríamos probar

El componente que queremos probar fue HAProxy versión 1.6. Estamos usando esto en producción ahora mismo en máquinas de 4 núcleos y 30 Gig. Sin embargo, toda la conectividad no está basada en SSL.

Queríamos probar dos cosas de este ejercicio:

  1. El porcentaje de CPU aumenta cuando cambiamos toda la carga de conexiones que no son SSL a conexiones SSL. El uso de la CPU definitivamente debería aumentar, debido al protocolo de enlace más largo de 5 vías y luego al cifrado de paquetes.
  2. En segundo lugar, queríamos probar los límites de nuestra configuración de producción actual en términos de número de solicitudes y el número máximo de conexiones simultáneas que se pueden admitir antes de que el rendimiento comience a degradarse.

Requerimos la primera parte debido a una importante implementación de funciones que está en pleno apogeo y que requiere comunicación a través de SSL. Necesitábamos la segunda parte para poder reducir la cantidad de hardware dedicado en producción a las máquinas HAProxy.

Los componentes involucrados

  • Varias máquinas cliente para enfatizar el HAProxy.
  • Una sola máquina HAProxy versión 1.6 en varias configuraciones

    * 4 núcleos, 30 Gig

    * 16 núcleos, 30 Gig

    * 16 núcleos, 64 Gig

  • Servidores backend que ayudarán a admitir todas estas conexiones simultáneas.

HTTP y MQTT

Si ha leído el primer artículo de esta serie, debe saber que toda nuestra infraestructura es compatible con dos protocolos:

  • HTTP y
  • MQTT.

En nuestra pila, no usamos HTTP 2.0 y, por lo tanto, no tenemos la funcionalidad de conexiones persistentes en HTTP. Entonces, en producción, el número máximo de conexiones TCP que vemos está en algún lugar (2 * 150k) en una sola máquina HAProxy (Inbound + Outbound). Aunque el número de conexiones simultáneas es bastante bajo, el número de solicitudes por segundo es bastante alto.

Por otro lado, MQTT es una forma completamente diferente de comunicación. Ofrece parámetros de gran calidad de servicio y conectividad persistente también. Por lo tanto, la comunicación continua bidireccional puede ocurrir a través de un canal MQTT. En cuanto a HAProxy que admite conexiones MQTT (TCP subyacente), vemos en algún lugar alrededor de 600-700k conexiones TCP en la hora pico en una sola máquina.

Queríamos hacer una prueba de carga que nos diera resultados precisos para las conexiones basadas en HTTP y MQTT.

Hay muchas herramientas que nos ayudan a probar la carga de un servidor HTTP fácilmente y muchas de estas herramientas brindan funcionalidades avanzadas como resultados resumidos, conversión de resultados basados ​​en texto en gráficos, etc. Sin embargo, no pudimos encontrar ninguna herramienta de prueba de esfuerzo. para MQTT. Tenemos una herramienta que desarrollamos nosotros mismos, pero no era lo suficientemente estable para soportar este tipo de carga en el tiempo que teníamos.

Así que decidimos buscar clientes de prueba de carga para HTTP y simular la configuración de MQTT usando lo mismo;) Interesante, ¿verdad?

Bueno, sigue leyendo.

La configuración inicial

Esta será una publicación larga, ya que proporcionaré muchos detalles que creo que serían realmente útiles para alguien que realice pruebas de carga similares o ajustes finos.

  • Inicialmente, tomamos una máquina de 16 núcleos y 30 Gig para configurar HAProxy. No seguimos con nuestra configuración de producción actual porque pensamos que el impacto de la CPU debido a la terminación de SSL en el extremo de HAProxy sería tremendo.
  • Para el extremo del servidor, optamos por un servidor NodeJs simple que responde pongal recibir una pingsolicitud.
  • En cuanto al cliente, al principio terminamos usando Apache Bench. La razón por la que nos decidimos abfue porque era una herramienta muy conocida y estable para probar puntos finales HTTP y también porque proporciona hermosos resultados resumidos que nos ayudarían mucho.

La abherramienta proporciona muchos parámetros interesantes que usamos para nuestra prueba de carga como:

  • - c, concurrency Especifica el número de solicitudes simultáneas que llegarían al servidor.
  • -n, no. of requests Como sugiere el nombre, especifica el número total de solicitudes de la ejecución de carga actual.
  • -p POST file Contiene el cuerpo de la solicitud POST (si eso es lo que desea probar).

Si observa estos parámetros de cerca, encontrará que son posibles muchas permutaciones ajustando los tres. Una solicitud de ab de muestra se vería así

ab -S -p post_smaller.txt -T application/json -q -n 100000 -c 3000 //test.haproxy.in:80/ping

Un resultado de muestra de una solicitud de este tipo se parece a esto

Los números que nos interesaban eran

  • Latencia del 99%.
  • Tiempo por solicitud.
  • No. de solicitudes fallidas.
  • Solicitudes por segundo.

El mayor problema de abes que no proporciona un parámetro para controlar el número de solicitudes por segundo. Tuvimos que ajustar el nivel de concurrencia para obtener nuestras solicitudes deseadas por segundo y esto dio lugar a muchos rastros y errores.

El gráfico todopoderoso

No podríamos realizar varias ejecuciones de carga al azar y seguir obteniendo resultados porque eso no nos daría ninguna información significativa. Tuvimos que realizar estas pruebas de alguna manera específica para obtener resultados significativos. Así que seguimos este gráfico

Este gráfico establece que hasta cierto punto, si seguimos aumentando el número de solicitudes, la latencia seguirá siendo casi la misma. Sin embargo, más allá de cierto punto de inflexión , la latencia comenzará a aumentar exponencialmente. Es este punto de inflexión para una máquina o para una configuración que pretendíamos medir.

Ganglios

Antes de proporcionar algunos resultados de las pruebas, me gustaría mencionar a Ganglia.

Ganglia es un sistema de monitoreo distribuido escalable para sistemas informáticos de alto rendimiento, como clusters y Grids.

Look at the following screenshot of one of our machines to get an idea about what ganglia is and what sort of information it provides about the underlying machine.

Pretty interesting, eh?

Moving on, we constantly monitored ganglia for our HAProxy machine to monitor some important things.

  1. TCP established This tells us the total number of tcp connections established on the system. NOTE: this is the sum of inbound as well as outbound connections.
  2. packets sent and received We wanted to see the total number of tcp packets being sent and received by our HAProxy machine.
  3. bytes sent and received This shows us the total data that we sent and received by the machine.
  4. memory The amount of RAM being used over time.
  5. network The network bandwidth consumption because of the packets being sent over the wire.

Following are the known limits found via previous tests/numbers that we wanted to achieve via our load test.

700k conexiones TCP establecidas,

50k paquetes enviados, 60k paquetes recibidos,

10-15 MB de bytes enviados y recibidos,

14-15Gig de memoria en el pico,

Red de 7 MB.

ALL these values are on a per second basis

HAProxy Nbproc

Inicialmente, cuando comenzamos a probar la carga de HAProxy, descubrimos que con SSL la CPU estaba siendo atacada al principio del proceso, pero las solicitudes por segundo eran muy bajas. Al investigar el comando superior, encontramos que HAProxy estaba usando solo 1 núcleo. Mientras que teníamos 15 núcleos más de sobra.

Buscar en Google durante unos 10 minutos nos llevó a encontrar esta configuración interesante en HAProxy que permite a HAProxy utilizar varios núcleos.

Se llama nbprocy para entender mejor qué es y cómo configurarlo, consulte este artículo:

//blog.onefellow.com/post/82478335338/haproxy-mapping-process-to-cpu-core-for-maximum

Tuning this setting was the base of our load testing strategy moving forward. Because the ability to use multiple cores by HAProxy gave us the power to form multiple combinations for our load testing suite.

Load Testing with AB

When we had started out with our load testing journey, we were not clear on the things we should be measuring and what we need to achieve.

Initially we had only one goal in mind and that was to find the tipping point only by variation of all the below mentioned parameters.

I maintained a table of all the results for the various load tests that we gave. All in all I gave over 500 test runs to get to the ultimate result. As you can clearly see, there are a lot of moving parts to each and every test.

Single Client issues

We started seeing that the client was becoming bottleneck as we kept on increasing our requests per second. Apache bench uses a single core and from the documentation it is evident that it does not provide any feature for using multiple cores.

To run multiple clients efficiently we found an interesting linux utility called Parallel. As the name suggests, it helps you run multiple commands in parallel and utilises multiple cores. Exactly what we wanted.

Have a look at a sample command that runs multiple clients using parallel.

cat hosts.txt | parallel 'ab -S -p post_smaller.txt -T application/json -n 100000 -c 3000 {}'
[email protected]:~$ cat hosts.txt//test.haproxy.in:80/ping//test.haproxy.in:80/ping//test.haproxy.in:80/ping

The above command would run 3 ab clients hitting the same URL. This helped us remove the client side bottleneck.

The Sleep and Times parameter

We talked about some parameters in ganglia that we wanted to track. Lets discuss them once by one.

  1. packets sent and received This can be simulated by sending some data as a part of the post request. This would also help us generate some network as well as bytes sent and received portions in ganglia
  2. tcp_established This is something which took us a long, long time to actually simulate in our scenario. Imagine if a single ping request takes about a second, that would take us about 700k requests per second to reach our tcp_established milestone.

    Now this number might seem easier to achieve on production, but it was impossible to generate it in our scenario.

What did we do you might ask? We introduced a sleep parameter in our POST call that specifies the number of milliseconds the server needs to sleep before sending out a response. This would simulate a long running request on production. So now say we have a sleep of about 20 minutes (Yep), that would take us around 583 requests per second to reach the 700k mark.

Additionally, we also introduced another parameter in our POST calls to the HAProxy and that was the times parameter. That specified number of times the server should write a response on the tcp connection before terminating it. This helped us simulated even more data transferred over the wire.

Issues with apache bench

Although we found out a lot of results with apache bench, we also faced a lot of issues along the way. I won’t be mentioning all of them here as they are not important for this post as I’ll be introducing another client shortly.

We were pretty content with the numbers we were getting out of apache bench, but at one point of time, generating the required tcp connections just became impossible. Somehow the apache bench was not handling the sleep parameter we had introduced, properly and was not scaling for us.

Although running multiple ab clients on a single machine was sorted out by using the parallel utility. Running this setup across multiple client machines was still a pain for us. I had not heard of the pdsh utility by then and was practically stuck.

Also, we were not focussing on any timeouts as well. There are some default set of timeouts on the HAProxy, the ab client and the server and we had completely ignored these. We figured out a lot of things along the way and organized ourselves a lot on how to go about testing.

We used to talk about the tipping point graph but we deviated a lot from it as time went on. Meaningful results, however, could only be found by focusing on that.

With apache bench a point came where the number of TCP connections were not increasing. We had around 40–45 clients running on 5–6 different client boxes but were not able to achieve the scale we wanted. Theoretically, the number of TCP connections should have jumped as we went on increasing the sleep time, but it wasn’t working for us.

Enter Vegeta

I was searching for some other load testing tools that might be more scalable and better functionality wise as compared to apache bench when I came across Vegeta.

From my personal experience, I have seen Vegeta to be extremely scalable and provides much better functionality as compared to apache bench. A single Vegeta client was able to produce the level of throughput equivalent to 15 apache bench clients in our load test.

Moving forward, I will be providing load test results that have been tested using Vegeta itself.

Load Testing with Vegeta

First, have a look at the command that we used to run a single Vegeta client. Interestingly, the command to put load on the backend servers is called attack :p

echo "POST //test.haproxy.in:443/ping" | vegeta -cpus=32 attack -duration=10m -header="sleep:30000" -body=post_smaller.txt -rate=2000 -workers=500 | tee reports.bin | vegeta report

Just love the parameters provided by Vegeta. Let’s have a look at some of these below.

  1. -cpus=32 Specifies the number of cores to be used by this client. We had to expand our client machines to 32core, 64Gig because of the amount of load to be generated. If you look closely above, the rate isn’t much. But it becomes difficult to sustain such a load when a lot of connections are in sleep state from the server end.
  2. -duration=10m I guess this is self explanatory. If you don’t specify any duration, the test will run forever.
  3. -rate=2000 The number of requests per second.

So as you can see above, we reached a hefty 32k requests per second on a mere 4 core machine. If you remember the tipping point graph, you will be able to notice it clearly enough above. So the tipping point in this case is 31.5k Non SSL requests.

Have a look at some more results from the load test.

16k SSL connections is also not bad at all. Please note that at this point in our load testing journey, we had to start from scratch because we had adopted a new client and it was giving us way better results than ab. So we had to do a lot of stuff again.

An increase in the number of cores led to an increase in the number of requests per second that the machine can take before the CPU limit is hit.

We found that there wasn’t a substantial increase in the number of requests per second if we increased the number of cores from 8 to 16. Also, if we finally decided to go with a 8 core machine in production, we would never allocate all of the cores to HAProxy or be it a any other process for that matter. So we decided to perform some tests with 6 cores as well to see if we had acceptable numbers.

Not bad.

Introducing the sleep

We were pretty satisfied with our load test results till now. However, this did not simulate the real production scenario. That happened when we introduced a sleep time as well which was absent till now in our tests.

echo "POST //test.haproxy.in:443/ping" | vegeta -cpus=32 attack -duration=10m -header="sleep:1000" -body=post_smaller.txt-rate=2000 -workers=500 | tee reports.bin | vegeta report

So a sleep time of 1000 milliseconds would lead to server sleeping for x amount of time where 0< x <; 1000 and is selected randomly. So on an average the above load test will give a latency of ≥ 500ms

The numbers in the last cell represent

TCP established, Packets Rec, Packets Sent

respectively. As you can clearly see the max requests per second that the 6 core machine can support has decreased to 8k from 20k. Clearly, the sleep has its impact and that impact is the increase in the number of TCP connections established. This is however nowhere near to the 700k mark that we set out to achieve.

Milestone #1

How do we increase the number of TCP connections? Simple, we keep on increasing the sleep time and they should rise. We kept playing around with the sleep time and we stopped at the 60 seconds sleep time. That would mean an average latency of around 30 sec.

There is an interesting result parameter that Vegeta provides and that is % of requests successful. We saw that with the above sleep time, only 50% of the calls were succeeding. See the results below.

We achieved a whooping 400k TCP established connections with 8k requests per second and 60000 ms sleep time. The R in 60000R means Random.

The first real discovery we made was that there is a default call timeout in Vegeta which is of 30 seconds and that explained why 50% of our calls were failing. So we increased that to about 70s for our further tests and kept on varying it as and when the need arose.

We hit the 700k mark easily after tweaking the timeout value from the client end. The only problem with this was that these were not consistent. These were just peaks. So the system hit a peak of 600k or 700k but did not stay there for very long.

Sin embargo, queríamos algo similar a esto

Esto muestra un estado estable donde se mantienen 780k conexiones. Si observa detenidamente las estadísticas anteriores, la cantidad de solicitudes por segundo es muy alta. Sin embargo, en producción, tenemos un número mucho menor de solicitudes (alrededor de 300) en una sola máquina HAProxy.

Estábamos seguros de que si reducíamos drásticamente la cantidad de HAProxies que tenemos en producción (alrededor de 30, lo que significa 30 * 300 ~ 9k conexiones por segundo) alcanzaremos los límites de la máquina con la cantidad de conexiones TCP primero y no la CPU.

Así que decidimos lograr 900 solicitudes por segundo y 30 MB / s de red y 2,1 millones de conexiones TCP establecidas. Acordamos estos números, ya que serían 3 veces nuestra carga de producción en un solo HAProxy.

Plus, till now we had settled on 6 cores being used by HAProxy. We wanted to test out 3 cores only because this is what would be easiest for us to roll out on our production machines (Our production machines, as mentioned before are 4 core 30 Gig. So for rolling out changes with nbproc = 3 would be easiest for us.

REMEMBER the machine we had at this point in time was 16 core 30 Gig machine with 3 cores being allocated to HAProxy.

Milestone #2

Now that we had max limits on requests per second that different variations in machine configuration could support, we only had one task left as mentioned above.

Achieve 3X the production load which is

  • 900 requests per second
  • 2.1 million TCP established and
  • 30 MB/s network.

We got stuck yet again as the TCP established were taking a hard hit at 220k. No matter what the number of client machines or what the sleep time was, number of TCP connections seemed to have stuck there.

Let’s look at some calculations. 220k TCP established connections and 900 requests per second = 110,000 / 900 ~= 120 seconds .I took 110k because 220k connections include both incoming and outgoing. So it’s two way.

Our doubt about 2 minutes being a limit somewhere in the system was verified when we introduced logs on the HAProxy side. We could see 120000 ms as total time for a lot of connections in the logs.

Mar 23 13:24:24 localhost haproxy[53750]: 172.168.0.232:48380 [23/Mar/2017:13:22:22.686] api~ api-backend/http31 39/0/2062/-1/122101 -1 0 - - SD-- 1714/1714/1678/35/0 0/0 {0,"",""} "POST /ping HTTP/1.1"
122101 is the timeout value. See HAProxy documentation on meanings of all these values. 

On investigating further we found out that NodeJs has a default request timeout of 2 minutes. Voila !

how to modify the nodejs request default timeout time?

I was using nodejs request, the default timeout of nodejs http is 120000 ms, but it is not enough for me, while my…stackoverflow.comHTTP | Node.js v7.8.0 Documentation

The HTTP interfaces in Node.js are designed to support many features of the protocol which have been traditionally…nodejs.org

But our happiness was apparently short lived. At 1.3 million, the HAProxy connections suddenly dropped to 0 and started increasing again. We soon checked the dmesg command that provided us some useful kernel level information for our HAProxy process.

Basically, the HAProxy process had gone out of memory. So we decided to increase the machine RAM and we shifted to 16 core 64 Gig machine with nbproc = 3 and because of this change we were able to reach 2.4 million connections.

Backend Code

Following is the backend server code that was being used. We had also used statsd in the server code to get consolidated data on requests per second that were being received by the client.

var http = require('http');var createStatsd = require('uber-statsd-client');qs = require('querystring');
var sdc = createStatsd({host: '172.168.0.134',port: 8125});
var argv = process.argv;var port = argv[2];
function randomIntInc (low, high){ return Math.floor(Math.random() * (high - low + 1) + low);}
function sendResponse(res,times, old_sleep){ res.write('pong'); if(times==0) { res.end(); } else { sleep = randomIntInc(0, old_sleep+1); setTimeout(sendResponse, sleep, res,times-1, old_sleep); }}
var server = http.createServer(function(req, res) headers = req.headers; old_sleep = parseInt(headers["sleep"]); times = headers["times"] );
server.timeout = 3600000;server.listen(port);

We also had a small script to run multiple backend servers. We had 8 machines with 10 backend servers EACH (yeah !). We literally took the idea of clients and backend servers being infinite for the load test, seriously.

counter=0while [ $counter -le 9 ]do port=$((8282+$counter)) nodejs /opt/local/share/test-tools/HikeCLI/nodeclient/httpserver.js $port & echo "Server created on port " $port
 ((counter++))done
echo "Created all servers"

Client Code

As for the client, there was a limitation of 63k TCP connections per IP. If you are not sure about this concept, please refer my previous article in this series.

So in order to achieve 2.4 million connections (two sided which is 1.2 million from the client machines), we needed somewhere around 20 machines. Its a pain really to run the Vegeta command on all 20 machines one by one and even of you found a way to do that using something like csshx, you still would need something to combine all the results from all the Vegeta clients.

Check out the script below.

result_file=$1
declare -a machines=("172.168.0.138" "172.168.0.141" "172.168.0.142" "172.168.0.18" "172.168.0.5" "172.168.0.122" "172.168.0.123" "172.168.0.124" "172.168.0.232" " 172.168.0.244" "172.168.0.170" "172.168.0.179" "172.168.0.59" "172.168.0.68" "172.168.0.137" "172.168.0.155" "172.168.0.154" "172.168.0.45" "172.168.0.136" "172.168.0.143")
bins=""commas=""
for i in "${machines[@]}"; do bins=$bins","$i".bin"; commas=$commas","$i; done;
bins=${bins:1}commas=${commas:1}
pdsh -b -w "$commas" 'echo "POST //test.haproxy.in:80/ping" | /home/sachinm/.linuxbrew/bin/vegeta -cpus=32 attack -connections=1000000 -header="sleep:20" -header="times:2" -body=post_smaller.txt -timeout=2h -rate=3000 -workers=500 > ' $result_file
for i in "${machines[@]}"; do scp [email protected]$i:/home/sachinm/$result_file $i.bin ; done;
vegeta report -inputs="$bins"

Apparently, Vegeta provides information on this utility called pdsh that lets you run a command concurrently on multiple machines remotely . Additionally, the Vegeta allows us to combine multiple results into one and that’s really all we wanted.

HAProxy Configuration

This is probably what you came here looking for, below is the HAProxy config that we used in our load test runs. The most important part being that of the nbproc setting and the maxconn setting. The maxconn setting allows us to provide the maximum number of TCP connections that the HAProxy can support overall (one way).

Changes to maxconn setting leads to increase in HAProxy process’ ulimit. Take a look below

The max open files has increased to 4 million because of the max connections for HAProxy being set at 2 million. Neat !

Check the article below for a whole lot of HAProxy optimisations that you can and should do to achieve the kind of stats we achieved.

Use HAProxy to load balance 300k concurrent tcp socket connections: Port Exhaustion, Keep-alive and…

I'm trying to build up a push system recently. To increase the scalability of the system, the best practice is to make…www.linangran.com

The http30 goes on to http83 :p

That’s all for now folks. If you’ve it so far, I’m truly amazed :)

A special shout out to Dheeraj Kumar Sidana who helped us all the way through this and without whose help we would not have been able to reach any meaningful results. :)

Do let me know how this blog post helped you. Also, please recommend (❤) and spread the love as much as possible for this post if you think this might be useful for someone.