10gige - Developer IT

Linux/bsd tcp load balancing with 10 gigabit ethernet

- by user37899

Okay, I've been looking at layer 4 load balancing solutions for 10 gigabit links. I need the following properties Works at 10Gig ethernet speeds. Can support long live tcp connections. up to 1mil live tcp connections. Balancer not involved in the return path. Fault tolerant with tcp session fail over. low latency and good through put. can be scripted. Either a software or hardware solution. Can it be done? Anyone doing this?

Read the article

vyatta Server Reboots by itself

- by Fernando

I have an issue regarding some hardware, maybe you can help me. First, I set up a Supermicro Superserver SYS-5016I-NTF with a Intel Xeon X3470 and 4 GB of Ram with a Hotlava Card Tambora 64G4 with Intel Chipset 82599EB and 4x10G SPF+ ports. Installed Vyatta community edition 6.3. I used it as router making BGP connections with 2 operators. No load at all, temp ranges normal. But the issue is that it reboots by itself in a ramdom way. Not very often, once every few days. But it is unacceptable for production purposes. So I try to test on different hardware, and installed Vyatta community edition 6.3 on a Dell PowerEdge 2950, with Xeon(R) E5345 @ 2.33GHz and 4 GB of Ram. Same Vyatta configuration as Supermicro Server. With same hotlava Card model ( I bought two of them ) Well I have reboots with this equipment as well. Same frecuency as above. I have checked syslog no strange logs until boot process starts to be logged. So it seems server reboot suddenly. I have installed latest driver for the chipset of the Hotlava card. Servers are placed in a datacenter with UPS So finally two things in common in both servers: Hotlava Card. Someone with issues with this card, or the chipset?? Could be it this card?? Vyatta 6.3 community edition. I don't thing is the problem. Is a regular Debian with packages to glue together different services. Or maybe is something I am missing. Andy ideas, suggestions?? Thank you very much... Fernando

Read the article

10 GigE interfaces limits single connection throughput to 1 Gb on a ProCurve 4208vl

- by wazoox

The setup is as follow : 3 Linux servers with Intel CX4 10 GigE controllers and an X-Serve with a Myricom 10 GigE CX4 controller are connected to a ProCurve 4208vl switch, with a myriad of other machines connected through good ol' 1000 base-T. The interfaces are actually set up as 10 Gig, according to both the switch monitoring interface and the servers (ethtool, etc). However a single connection between two 10 GigE equipped machines through the switch is limited to exactly 1Gb. If I connect two of the 10 GigE machines directly with a CX4 cable, netperf reports the link bandwidth as 9000 Mb/s. NFS achieves about 550 MB/s transfers. But when I'm using the switch, the connection tops at 950 Mb/s through netperf and 110 MB/s with NFS. When I open several connections from 3 of the machines to the 4th, I get 350 MB/s of NFS transfer speed. So each individual 10 GigE ports actually can reach much more than 1 Gb, but individual connections are strictly limited to 1 Gb. Conclusion : the 10 GigE connection through the switch behaves exactly like a trunk of 10 1 Gb connections. That doesn't make any sense to me, unless HP planned these ports only for cascading switches or strictly for many-clients-to-single-server connection. Unfortunately this is NOT the envisioned setup, we need big throughput from machine to machine. Is this a not-so-known (or carefully hidden...) limitation of this type of switch? Should I suggest seppuku to the HP representative? Does anyone have any idea on how to enable a proper behaviour ? I upgraded for an hefty price from bonded 1Gb links to 10 GigE and see exactly ZERO gain! That's absolutely unacceptable.

Read the article

CISCO WS-C4948-10GE SFP+?

- by Brian Lovett

I have a pair of CISCO WS-C4948-10GE's that we need to connect to a new switch that has SFP+ and QSFP ports. Is there an X2 module that supports this? If so, can someone name the part number that will work? I have found some information, but want to make sure I have the correct part number for our exact switches. Per the discussion in the comments, I believe I have a better understanding of things now. Would I be correct in saying that I need these SR modules on the cisco side: [url]http://www.ebay.com/itm/Cisco-original-used-X2-10GB-SR-V02-/281228948970?pt=LH_DefaultDomain_0&hash=item417a8d35ea[/url] Then, on the switch with sfp+ ports, I can pick up an SR to SFP+ transceiver like this: [url]http://www.advantageoptics.com/SFP-10G-SR_lp.html?gclid=CKP4s-G27b4CFXQiMgodLD8AQA[/url] and finally, an SR calbe such as this: [url]http://www.colfaxdirect.com/store/pc/viewPrd.asp?idproduct=1551[/url] Am I on the right track here?

Read the article

Which is the fastest way to move 1Petabyte from one storage to a new one?

- by marc.riera

First of all, thanks for reading, and sorry for asking something related to my job. I understand that this is something that I should solve by myself but as you will see its something a bit difficult. A small description: Now Storage = 1PB using DDN S2A9900 storage for the OSTs, 4 OSS , 10 GigE network. (lustre 1.6) 100 compute nodes with 2x Infiniband 1 infiniband switch with 36 ports After Storage = Previous storage + another 1PB using DDN S2A 990 or LSI E5400 (still to decide) (lustre 2.0) 8 OSS , 10GigE network 100 compute nodes with 2x Infiniband Previous experience: transfered 120 TB in less than 3 days using following command: tar -C /old --record-size 2048 -b 2048 -cf - dir | tar -C /new --record-size 2048 -b 2048 -xvf - 2>&1 | tee /tmp/dir.log So , big problem here, using big mathematical equations I conclude that we are going to need 1 month to transfer the data from one side to the new one. During this time the researchers will need to step back, and I'm personally not happy with this. I'm telling you that we have infiniband connections because I think that may be there is a chance to use it to transfer the data using 18 compute nodes (18 * 2 IB = 36 ports) to transfer the data from one storage to the other. I'm trying to figure out if the IB switch will handle all the traffic but in case it just burn up will go faster than using 10GigE. Also, having lustre 1.6 and 2.0 agents on same server works quite well, with this there is no need to go by 1.8 to upgrade the metadata servers with two steps. Any ideas? Many thanks Note 1: Zoredache, we can divide it in two blocks (A)600Tb and (B)400Tb. The idea is to move (A) to new storage which is lustre2.0 formated, then format where (A) was with lustre2.0 and move (B) to this lustre2.0 block and extend with the space where (B) was. This way we will end with (A) and (B) on separate filesystems, with 1PB each.

Read the article

How to capture strings using * or ? with groups in python regular expressions

- by user1334085

When the regular expression has a capturing group followed by "*" or "?", there is no value captured. Instead if you use "+" for the same string, you can see the capture. I need to be able to capture the same value using "?" >>> str1='This string has 29 characters' >>> re.search(r'(\d+)*', str1).group(0) '' >>> re.search(r'(\d+)*', str1).group(1) >>> >>> re.search(r'(\d+)+', str1).group(0) '29' >>> re.search(r'(\d+)+', str1).group(1) '29' More specific question is added below for clarity: I have str1 and str2 below, and I want to use just one regexp which will match both. In case of str1, I also want to be able to capture the number of QSFP ports >>> str1='''4 48 48-port and 6 QSFP 10GigE Linecard 7548S-LC''' >>> str2='''4 48 48-port 10GigE Linecard 7548S-LC''' >>> When I do not use a metacharacter, the capture works: >>> re.search(r'^4\s+48\s+.*(?:(\d+)\s+QSFP).*-LC', str1, re.I|re.M).group(1) '6' >>> It works even when I use the "+" to indicate one occurrence: >>> re.search(r'^4\s+48\s+.*(?:(\d+)\s+QSFP)+.*-LC', str1, re.I|re.M).group(1) '6' >>> But when I use "?" to match for 0 or 1 occurrence, the capture fails even for str1: >>> re.search(r'^4\s+48\s+.*(?:(\d+)\s+QSFP)?.*-LC', str1, re.I|re.M).group(1) >>>

Read the article

VNIC - New feature of AK8 - Working with VNICs

- by Steve Tunstall

One of the important new features of the AK8 code is the ability to use multiple IP addresses on the same physical network port. This feature is called VNICs, or Virtual NICs. This allows us to no longer "burn" a whole port in a cluster when one cluster peer owns a network port. Traditionally, we have had to leave Net0 empty on controller 2, because it was used for managing controller 1. Vise-versa for Net1 on Controller 1. Then, if you have data going over 10GigE ports, you probably only had half of your ports running at any given time, and the partner 10GigE port on the other controller just sat there, doing nothing, unless the first controller went down. What a waste. Those days are over. I want to thank and give a big shout-out to our good partner, OnX Enterprise Solutions, for allowing me to come into their lab and play around with their 7320 to do this demo. They let me make a big mess of their lab for the day as I played around with VNICs. If you're looking for a partner who knows Oracle well and can also piece together a solution from multiple vendors to get you what you need, OnX is a good choice. If you would like to talk to your local OnX rep, you can contact Scott Gill at [email protected] and he can point you in the right direction for your area. Here we go: Here is what your Datalinks window looks like BEFORE you upgrade to AK8. Here's what the same screen looks like after you upgrade. See the new box? So here is my current network setup. I have my 4 physical interfaces setup each with an IP address. If I ping them, no problems. So I can ping 180, 181, 251, and 252. However, if I try to ping 240, it does not work, as the 240 address is not being used by any of these interfaces, right?Let's change that. Here, I'm going to make a new Datalink by clicking the Datalink "Plus sign" button. I will check the VNIC box and tell it to use igb2, even though another interface is already using it. Now, I will create a new Interface, and choose "v_dl2" for it's datalink. My new network screen looks like this. A few things to take note of here. First, when I click the "igb2" device, it only highlights dl2 and int2. It does not highlight v_dl2 or v_int2.I think it should, but OK, it looks like VNICs don't highlight when you click the device. Second, note how the underscore character in v_dl2 and v_int2 do not seem to show on this screen. You can see it plainly if you go in and edit them, but from here it looks like a space instead of an underscore. Just a cosmetic bug, but something to be aware of. Now, if I click the VNIC datalink "v_dl2", on the other hand, it DOES highlight the device it belongs to, as it should. Seen here: Note that it did not, however, highlight int2 with it, even though int2 is connected to igb2. That's because we clicked v_dl2, which int2 has nothing to do with. So I'm OK with that. So let's try pinging 240 now. Of course, it works great. So I now make another VNIC, and call it v_dl3 using igb3, and v_int3 with an address of 241. I then setup three shares, using ports 251, 240, and 241.Remember that IP 251 and 240 both are using the same physical port of igb2, and IP 241 is using port igb3. Next, I copy a folder full of stuff over to all three shares at the same time. I have analytics going so I can see the traffic. My top chart is showing the logical interfaces, and the bottom chart is showing the physical ports.Sure enough, look at the igb2 and vnic1 interfaces. They equal the traffic going over the igb2 physical port on the second chart. VNIC2, on the other hand, gets igb3 all to itself. This would work the same way with 10Gig or Infiniband ports. You can now have multiple IP addresses and even completely different subnets sharing the same physical ports. You may need to make route table entries for that. This allows us to use all of the ports you paid for with no more waste. Very, very cool. One small "bug" I found when doing this. It's really not a bug, it was designed to do this when VNICs were not around. But now that we have NVIC capability, they should probably change this. I've alerted the engineering team about this and they're looking into it, so perhaps it will be fixed in a later code. Here it is. Remember when we made the new VNIC datalink, I specifically said to click on the "Plus Sign" button to create it? I don't always do that. I really like to use the drag-and-drop method to create my datalinks in the network screen.HOWEVER, if you were to do that for building a VNIC, it will mess you up a little. Watch this. Here, I'm dragging igb3 over to make a new datalink. igb3 is already being used by dl3, but I'm going to make this a VNIC, so who cares, right? Well, the ZFSSA does not KNOW you are going to make it a VNIC, now does it? So... it works as designed and REMOVES the igb3 device from the current dl3 datalink in the background. See how it's now missing? At the same time, the dl3 datalink choice is missing from my list of possible VNICs for me to choose from!!!! Hey!!! I wanted to pick dl3. Why isn't it on the list??? Well, it can't be on this list because dl3 no longer has a device associated with it. Bummer for you. When you click cancel, the device is still missing from dl3. The fix is easy. Just edit dl3 by clicking the pencil button, do absolutely nothing, and click "Apply". The device will magically come back. Now, make the VNIC datalink by clicking the "Plus Sign" button. Sure enough, once you check the VNIC box, dl3 is a valid choice. No problem. That's it for now. Have fun with VNICs.

Read the article

New Write Flash SSDs and more disk trays

- by Steve Tunstall

In case you haven't heard, the Write SSDs the ZFSSA have been updated. Much faster now for the same price. Sweet. The new write-flash SSDs have a new part number of 7105026 , so make sure you order the right ones. It's important to note that you MUST be on code level 2011.1.4.0 or higher to use these. They have increased in IOPS from 6,000 to 11,000, and increased throughput from 200MB/s to 350MB/s. Also, you can now add six SAS HBAs (up from 4) to the 7420, allowing one to have three SAS channels with 12 disk trays each, for a new total of 36 disk trays. With 3TB drives, that's 2.5 Petabytes. Is that enough for you? Make sure you add new cards to the correct slots. I've talked about this before, but here is the handy-dandy matrix again so you don't have to go find it. Remember the rules: You can have 6 of any one kind of card (like six 10GigE cards), but you only really get 8 slots, since you have two SAS cards no matter what. If you want more than 12 disk trays, you need two more SAS cards, so think about expansion later, too. In fact, if you are going to have two different speeds of drives, in other words you want to mix 15K speed and 7,200 speed drives in the same system, I would highly recommend two different SAS channels. So I would want four SAS cards in that system, no matter how many trays you have.

Read the article

Creating basic, redundant gigE or IB storage network for Xen?

- by StaringSkyward

With only a modest budget, I want to move my 4 xen servers over to network storage -either NFS or iSCSI which will be determined based on how well it performs when we test it (we need good throughput and it must continue to work through link and switch failure tests). We may add another couple of xen servers at some point when this is done. I don't know much about the design and operation of storage networks, so would really appreciate some hints from those with experience. The budget is around $3,800 excluding the storage appliance. I am currently thinking these are my options to remain on budget: 1) Go for used infiniband hardware and aim for 10gb performance. 2) Stick with gig ethernet and buy some new switches (cisco or procurve) to create a storage-only ethernet LAN. Upgrade to 10gigE later but try to use hardware capable of it where possible to reduce upgrade costs. I have seen used, warrantied infiniband switches at reasonable prices (presumably because big companies are converging on 10gbit ethernet?) and the promise of cheap 10gb is attractive. I know nothing about IB, so here come the questions: Can I buy 2 x switches and have multiple HBAs in my xen and storage nodes to get redundancy and increased performance without complexity or expensive management software costs? If so, can you point me to some examples? Do NFS and iSCSI work just the same regardless? Is IB a sensible choice or could/should I use ethernet or FC on the same budget - I'm keen not to get boxed into a corner for future upgrades, however. For the storage I am likely to build a storage server using nexentastor with the intention that I can later add more disks, SSDs and add another server to provide a failover option at the storage level. An HP LeftHand starter SAN is also under consideration, too. Thanks in advance.

Read the article

openstack, bridging, netfilter and dnat

- by Craig Sanders

In a recent upgrade (from Openstack Diablo on Ubuntu Lucid to Openstack Essex on Ubuntu Precise), we found that DNS packets were frequently (almost always) dropped on the bridge interface (br100). For our compute-node hosts, that's a Mellanox MT26428 using the mlx4_en driver module. We've found two workarounds for this: Use an old lucid kernel (e.g. 2.6.32-41-generic). This causes other problems, in particular the lack of cgroups and the old version of the kvm and kvm_amd modules (we suspect the kvm module version is the source of a bug we're seeing where occasionally a VM will use 100% CPU). We've been running with this for the last few months, but can't stay here forever. With the newer Ubuntu Precise kernels (3.2.x), we've found that if we use sysctl to disable netfilter on bridge (see sysctl settings below) that DNS started working perfectly again. We thought this was the solution to our problem until we realised that turning off netfilter on the bridge interface will, of course, mean that the DNAT rule to redirect VM requests for the nova-api-metadata server (i.e. redirect packets destined for 169.254.169.254:80 to compute-node's-IP:8775) will be completely bypassed. Long-story short: with 3.x kernels, we can have reliable networking and broken metadata service or we can have broken networking and a metadata service that would work fine if there were any VMs to service. We haven't yet found a way to have both. Anyone seen this problem or anything like it before? got a fix? or a pointer in the right direction? Our suspicion is that it's specific to the Mellanox driver, but we're not sure of that (we've tried several different versions of the mlx4_en driver, starting with the version built-in to the 3.2.x kernels all the way up to the latest 1.5.8.3 driver from the mellanox web site. The mlx4_en driver in the 3.5.x kernel from Quantal doesn't work at all) BTW, our compute nodes have supermicro H8DGT motherboards with built-in mellanox NIC: 02:00.0 InfiniBand: Mellanox Technologies MT26428 [ConnectX VPI PCIe 2.0 5GT/s - IB QDR / 10GigE] (rev b0) we're not using the other two NICs in the system, only the Mellanox and the IPMI card are connected. Bridge netfilter sysctl settings: net.bridge.bridge-nf-call-arptables = 0 net.bridge.bridge-nf-call-iptables = 0 net.bridge.bridge-nf-call-ip6tables = 0 Since discovering this bridge-nf sysctl workaround, we've found a few pages on the net recommending exactly this (including Openstack's latest network troubleshooting page and a launchpad bug report that linked to this blog-post that has a great description of the problem and the solution)....it's easier to find stuff when you know what to search for :), but we haven't found anything on the DNAT issue that it causes.

Read the article

Oracle NoSQL Database Exceeds 1 Million Mixed YCSB Ops/Sec

- by Charles Lamb

We ran a set of YCSB performance tests on Oracle NoSQL Database using SSD cards and Intel Xeon E5-2690 CPUs with the goal of achieving 1M mixed ops/sec on a 95% read / 5% update workload. We used the standard YCSB parameters: 13 byte keys and 1KB data size (1,102 bytes after serialization). The maximum database size was 2 billion records, or approximately 2 TB of data. We sized the shards to ensure that this was not an "in-memory" test (i.e. the data portion of the B-Trees did not fit into memory). All updates were durable and used the "simple majority" replica ack policy, effectively 'committing to the network'. All read operations used the Consistency.NONE_REQUIRED parameter allowing reads to be performed on any replica. In the past we have achieved 100K ops/sec using SSD cards on a single shard cluster (replication factor 3) so for this test we used 10 shards on 15 Storage Nodes with each SN carrying 2 Rep Nodes and each RN assigned to its own SSD card. After correcting a scaling problem in YCSB, we blew past the 1M ops/sec mark with 8 shards and proceeded to hit 1.2M ops/sec with 10 shards. Hardware Configuration We used 15 servers, each configured with two 335 GB SSD cards. We did not have homogeneous CPUs across all 15 servers available to us so 12 of the 15 were Xeon E5-2690, 2.9 GHz, 2 sockets, 32 threads, 193 GB RAM, and the other 3 were Xeon E5-2680, 2.7 GHz, 2 sockets, 32 threads, 193 GB RAM. There might have been some upside in having all 15 machines configured with the faster CPU, but since CPU was not the limiting factor we don't believe the improvement would be significant. The client machines were Xeon X5670, 2.93 GHz, 2 sockets, 24 threads, 96 GB RAM. Although the clients had 96 GB of RAM, neither the NoSQL Database or YCSB clients require anywhere near that amount of memory and the test could have just easily been run with much less. Networking was all 10GigE. YCSB Scaling Problem We made three modifications to the YCSB benchmark. The first was to allow the test to accommodate more than 2 billion records (effectively int's vs long's). To keep the key size constant, we changed the code to use base 32 for the user ids. The second change involved to the way we run the YCSB client in order to make the test itself horizontally scalable.The basic problem has to do with the way the YCSB test creates its Zipfian distribution of keys which is intended to model "real" loads by generating clusters of key collisions. Unfortunately, the percentage of collisions on the most contentious keys remains the same even as the number of keys in the database increases. As we scale up the load, the number of collisions on those keys increases as well, eventually exceeding the capacity of the single server used for a given key.This is not a workload that is realistic or amenable to horizontal scaling. YCSB does provide alternate key distribution algorithms so this is not a shortcoming of YCSB in general. We decided that a better model would be for the key collisions to be limited to a given YCSB client process. That way, as additional YCSB client processes (i.e. additional load) are added, they each maintain the same number of collisions they encounter themselves, but do not increase the number of collisions on a single key in the entire store. We added client processes proportionally to the number of records in the database (and therefore the number of shards). This change to the use of YCSB better models a use case where new groups of users are likely to access either just their own entries, or entries within their own subgroups, rather than all users showing the same interest in a single global collection of keys. If an application finds every user having the same likelihood of wanting to modify a single global key, that application has no real hope of getting horizontal scaling. Finally, we used read/modify/write (also known as "Compare And Set") style updates during the mixed phase. This uses versioned operations to make sure that no updates are lost. This mode of operation provides better application behavior than the way we have typically run YCSB in the past, and is only practical at scale because we eliminated the shared key collision hotspots.It is also a more realistic testing scenario. To reiterate, all updates used a simple majority replica ack policy making them durable. Scalability Results In the table below, the "KVS Size" column is the number of records with the number of shards and the replication factor. Hence, the first row indicates 400m total records in the NoSQL Database (KV Store), 2 shards, and a replication factor of 3. The "Clients" column indicates the number of YCSB client processes. "Threads" is the number of threads per process with the total number of threads. Hence, 90 threads per YCSB process for a total of 360 threads. The client processes were distributed across 10 client machines. Shards KVS Size Clients Mixed (records) Threads OverallThroughput(ops/sec) Read Latencyav/95%/99%(ms) Write Latencyav/95%/99%(ms) 2 400m(2x3) 4 90(360) 302,152 0.76/1/3 3.08/8/35 4 800m(4x3) 8 90(720) 558,569 0.79/1/4 3.82/16/45 8 1600m(8x3) 16 90(1440) 1,028,868 0.85/2/5 4.29/21/51 10 2000m(10x3) 20 90(1800) 1,244,550 0.88/2/6 4.47/23/53

Developer IT