Does this prove a network bandwidth bottleneck?
- by Yuji Tomita
I've incorrectly assumed that my internal AB testing means my server can handle 1k concurrency @3k hits per second.
My theory at at the moment is that the network is the bottleneck. The server can't send enough data fast enough.
External testing from blitz.io at 1k concurrency shows my hits/s capping off at 180, with pages taking longer and longer to respond as the server is only able to return 180 per second.
I've served a blank file from nginx and benched it: it scales 1:1 with concurrency.
Now to rule out IO / memcached bottlenecks (nginx normally pulls from memcached), I serve up a static version of the cached page from the filesystem.
The results are very similar to my original test; I'm capped at around 180 RPS.
Splitting the HTML page in half gives me double the RPS, so it's definitely limited by the size of the page.
If I internally ApacheBench from the local server, I get consistent results of around 4k RPS on both the Full Page and the Half Page, at high transfer rates.
Transfer rate: 62586.14 [Kbytes/sec] received
If I AB from an external server, I get around 180RPS - same as the blitz.io results.
How do I know it's not intentional throttling?
If I benchmark from multiple external servers, all results become poor which leads me to believe the problem is in MY servers outbound traffic, not a download speed issue with my benchmarking servers / blitz.io.
So I'm back to my conclusion that my server can't send data fast enough.
Am I right? Are there other ways to interpret this data? Is the solution/optimization to set up multiple servers + load balancing that can each serve 180 hits per second?
I'm quite new to server optimization, so I'd appreciate any confirmation interpreting this data.
Outbound traffic
Here's more information about the outbound bandwidth: The network graph shows a maximum output of 16 Mb/s: 16 megabits per second. Doesn't sound like much at all.
Due to a suggestion about throttling, I looked into this and found that linode has a 50mbps cap (which I'm not even close to hitting, apparently). I had it raised to 100mbps.
Since linode caps my traffic, and I'm not even hitting it, does this mean that my server should indeed be capable of outputting up to 100mbps but is limited by some other internal bottleneck? I just don't understand how networks at this large of a scale work; can they literally send data as fast as they can read from the HDD? Is the network pipe that big?
In conclusion
1: Based on the above, I'm thinking I can definitely raise my 180RPS by adding an nginx load balancer on top of a multi nginx server setup at exactly 180RPS per server behind the LB.
2: If linode has a 50/100mbit limit that I'm not hitting at all, there must be something I can do to hit that limit with my single server setup. If I can read / transmit data fast enough locally, and linode even bothers to have a 50mbit/100mbit cap, there must be an internal bottleneck that's not allowing me to hit those caps that I'm not sure how to detect. Correct?
I realize the question is huge and vague now, but I'm not sure how to condense it. Any input is appreciated on any conclusion I've made.