ewwhite - Developer IT

Are VMWare ESXi 5 patches cumulative?

- by ewwhite

It seems basic, but there's confusion about the patching strategy needed to manually update standalone VMWare ESXi hosts. The VMWare vSphere blog attempts to explain this, but it's still not clear. From the blog: Say Patch01 includes updates for the following VIBs: "esxi-base", "driver10" and "driver 44". And then later Patch02 comes out with updates to "esxi-base", "driver20" and "driver 44". P2 is cumulative in that the "esxi-base" and "driver44" VIBs will include the updates in Patch01. However, it's important to note that Patch02 not include the "driver 10" VIB as that module was not updated. Many of my ESXi installations are standalone and do not make use of Update Manager. It is possible to update an individual host using the patches make available through the VMWare patch download portal. The process is quite simple, and that part makes sense. The bigger issue is determining what to actually download and install. In my case, I have a good number of HP-specific ESXi builds that incorporate sensors and management for HP ProLiant hardware. Let's say that those servers start at ESXi build #474610 from 9/2011. Looking at the patch portal screenshot below, there is a patch for ESXi update01, build #623860. There are also patches for builds #653509 and #702118. Coming from the old version of ESXi, what is the proper approach to bring the system fully up-to-date? Which patches are cumulative and which need to be applied sequentially? Perhaps the download size is the confusing factor, but is installing the newest build the right approach, or do I need to step back and patch incrementally?

Read the article

Disable reverse PTR check in Zimbra and force accept from invalid domains

- by ewwhite

I've moved an older Sendmail/Dovecot system to a Zimbra community edition system. I need to be able to receive messages from certain standalone Linux hosts that may not have valid A records or proper reverse DNS entries established (e.g. AT&T is the ISP or systems sitting on a consumer-level ISP). Establishing the reverse DNS or setting a SMARTHOST is not an option. The error I get in zimbra.log is: zimbra postfix/smtp[2200]: DB83B231B53: to=<root@host_name.baddomain.com>, relay=none, delay=0.07, delays=0.06/0/0/0, dsn=5.4.4, status=bounced (Host or domain name not found. Name service error for name=host_name.baddomain.com type=A: Host not found How can I override this? Is this more of a Postfix issue or is it Zimbra? edit - The problem seems to be with an underscore in the hostname of the server. So it's a problem with root@host_name.baddomain.com. Again, how can I override this in Zimbra?

Read the article

OpenSolaris / Nexenta problems with NetXen 4-port NIC card (ntxn driver)

- by ewwhite

Hello, I'm running NexentaStor Enterprise on an HP ProLiant DL180 G6 server. The onboard NIC interfaces surface as igb0 and igb1 and work well. However, I've added an HP NC375T 4-port network card using the NetXen 3031 chipset. This card should be handled by the ntxn driver in the SUNWntxn package, but that results in "ntxn0: failed to map doorbell" messages upon boot. The network interfaces don't show up. After some research, I found HP's driver package for the card. The release notes for the driver package state: This version of the Driver is supported only on Oracle Solaris 10 5/09 & 10/09. Oracle Solaris 10 5/09 & 10/09 contain an older version of NetXen P3 driver package called SUNWntxn. So, adding another version of NetXen P3 driver package using pkgadd command might result in conflicts with the NetXen driver binary & related files. Users are advised to uninstall native SUNWntxn driver package before installing the new package. The install completes, but I end up with a different set of errors in initializing the card. ifconfig ntxn0 plumb ifconfig: cannot open link "ntxn0": DLPI link does not exist dmesg output: Jan 29 07:20:17 ch-san2 ntxn: [ID 977263 kern.warning] WARNING: Memory not available Jan 29 07:20:17 ch-san2 ntxn: [ID 404858 kern.notice] NOTICE: ntxn0: Mac registration error Trying to manually create the device files: root@ch-san2:/volumes# add_drv -i "4040,100" ntxn ("ntxn") already in use as a driver or alias. Update the driver: root@ch-san2:/volumes# update_drv -f ntxn devfsadm: driver failed to attach: ntxn Warning: Driver (ntxn) successfully added to system but failed to attach Any ideas on how to get this driver working, or should I ditch the card and go with an Intel or something else?

Read the article

ZFS - destroying deduplicated zvol or data set stalls the server. How to recover?

- by ewwhite

I'm using Nexentastor on a secondary storage server running on an HP ProLiant DL180 G6 with 12 Midline (7200 RPM) SAS drives. The system has an E5620 CPU and 8GB RAM. There is no ZIL or L2ARC device. Last week, I created a 750GB sparse zvol with dedup and compression enabled to share via iSCSI to a VMWare ESX host. I then created a Windows 2008 file server image and copied ~300GB of user data to the VM. Once happy with the system, I moved the virtual machine to an NFS store on the same pool. Once up and running with my VMs on the NFS datastore, I decided to remove the original 750GB zvol. Doing so stalled the system. Access to the Nexenta web interface and NMC halted. I was eventually able to get to a raw shell. Most OS operations were fine, but the system was hanging on the zfs destroy -r vol1/filesystem command. Ugly. I found the following two OpenSolaris bugzilla entries and now understand that the machine will be bricked for an unknown period of time. It's been 14 hours, so I need a plan to be able to regain access to the server. http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6924390 and http://bugs.opensolaris.org/bugdatabase/view_bug.do;jsessionid=593704962bcbe0743d82aa339988?bug_id=6924824 In the future, I'll probably take the advice given in one of the buzilla workarounds: Workaround Do not use dedupe, and do not attempt to destroy zvols that had dedupe enabled. Update: I had to force the system to power off. Upon reboot, the system stalls at Importing zfs filesystems. It's been that way for 2 hours now.

Read the article

What happens when the USB key or SD card I've installed VMware ESXi on fails?

- by ewwhite

An SD (SDHC) card installed in an HP ProLiant DL380p Gen8 server running VMware ESXi just failed. I encountered some rather ominous looking messages on the vCenter console and in the HP ProLiant ILO event log... Lost connectivity to the device ... backing the boot filesystem. As a result, host configuration changes will not be saved to persistent storage. Embedded Flash/SD-CARD: Error writing media 0, physical block 848880: Stack Exception. VMware advocates the use of USB and SD (SDHC) boot devices for ESXi. It was one of the main reasons the smaller footprint ESXi was developed (versus the older ESX). I've spent much time highlighting the differences between ESXi's installable and embedded modes to coworkers and clients. However, these failures do seem to happen. In this case, this is my third instance. Luckily, this is a vSphere cluster with SAN storage. What steps should be taken to remediate this failure?

Read the article

Is it necessary to burn-in RAM for server-class systems?

- by ewwhite

When using server-class systems with ECC RAM, is it necessary or even useful to burn-in the memory DIMMs prior to deployment? I've encountered an environment where all server RAM is placed through a lengthy multi-day burn-in/stress-tesing process. This has delayed system deployments on occasion and adds an extra step to the hardware lead-time. The server hardware is primarily Supermicro, so the RAM is sourced from a variety of vendors; not directly from the manufacturer like a Dell Poweredge or HP ProLiant. Is this process useful? In my past experience, I simply used vendor RAM out of the box. Isn't that what the POST memory tests are for? I've encountered and responded to ECC errors long before a DIMM actually failed. The ECC thresholds were usually the trigger for warranty placement. Do you burn your RAM in? If so, what method do you use to perform the tests? Has the burn-in process resulted in any additional platform stability? Has it identified any pre-deployment problems?

Read the article

Is it possible to shrink the size of an HP Smart Array logical drive?

- by ewwhite

I know extension is quite possible using the hpacucli utility, but is there an easy way to reduce the size of an existing logical drive (not array)? The controller is a P410i in a ProLiant DL360 G6 server. I'd like to reduce logicaldrive 1 from 72GB to 40GB. => ctrl all show config detail Smart Array P410i in Slot 0 (Embedded) Bus Interface: PCI Slot: 0 Serial Number: 5001438006FD9A50 Cache Serial Number: PAAVP9VYFB8Y RAID 6 (ADG) Status: Disabled Controller Status: OK Chassis Slot: Hardware Revision: Rev C Firmware Version: 3.66 Rebuild Priority: Medium Expand Priority: Medium Surface Scan Delay: 3 secs Surface Scan Mode: Idle Queue Depth: Automatic Monitor and Performance Delay: 60 min Elevator Sort: Enabled Degraded Performance Optimization: Disabled Inconsistency Repair Policy: Disabled Wait for Cache Room: Disabled Surface Analysis Inconsistency Notification: Disabled Post Prompt Timeout: 15 secs Cache Board Present: True Cache Status: OK Accelerator Ratio: 25% Read / 75% Write Drive Write Cache: Enabled Total Cache Size: 512 MB No-Battery Write Cache: Disabled Cache Backup Power Source: Batteries Battery/Capacitor Count: 1 Battery/Capacitor Status: OK SATA NCQ Supported: True Array: A Interface Type: SAS Unused Space: 412476 MB Status: OK Logical Drive: 1 Size: 72.0 GB Fault Tolerance: RAID 1+0 Heads: 255 Sectors Per Track: 32 Cylinders: 18504 Strip Size: 256 KB Status: OK Array Accelerator: Enabled Unique Identifier: 600508B1001C132E4BBDFAA6DAD13DA3 Disk Name: /dev/cciss/c0d0 Mount Points: /boot 196 MB, / 12.0 GB, /usr 8.0 GB, /var 4.0 GB, /tmp 2.0 GB OS Status: LOCKED Logical Drive Label: AE438D6A5001438006FD9A50BE0A Mirror Group 0: physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS, 146 GB, OK) physicaldrive 1I:1:2 (port 1I:box 1:bay 2, SAS, 146 GB, OK) Mirror Group 1: physicaldrive 1I:1:3 (port 1I:box 1:bay 3, SAS, 146 GB, OK) physicaldrive 1I:1:4 (port 1I:box 1:bay 4, SAS, 146 GB, OK) SEP (Vendor ID PMCSIERA, Model SRC 8x6G) 250 Device Number: 250 Firmware Version: RevC WWID: 5001438006FD9A5F Vendor ID: PMCSIERA Model: SRC 8x6G

Read the article

Integrating HP Systems Insight Manager into an existing environment

- by ewwhite

I'm working with an environment that spans multiple data centers/sites and consists primarily of HP ProLiant servers (G5-G7) running Linux. The mix is 30% RHEL/CentOS, the rest are Gentoo :(. I also have a few dozen virtual machines running back-office and Windows servers on VMWare ESX hosts. I run OpenNMS to pull SNMP data from the various server nodes and networking devices. While OpenNMS works wonderfully for up/down, thresholds and notifications, it's native handling of traps is a little rough and the graphs are not particularly pretty. I use Orca/RRD graphs for performance trending and nice graphs. I'm tasked with inventorying the environment and wanted to come up with a clean way to organize server information. Since my environment is mostly HP, I've been playing with HP Systems Insight Manager as a way to extract server data and to deploy HP health/monitoring packages and firmware. The Gentoo systems eventually have to be converted to CentOS, so getting a quick assessment of what hardware is where would be great. Although I've read through a few hundred pages of HP manuals, I'm having a difficult time understanding how to get HP SIM to do what I want, though. My main problems are: I have about 40 subnets to deal with; 98% connected with private lines to facilities across the globe. I don't want to initiate an HP SIM discovery only to pull back every piece of intermediate networking hardware and equipment from all of the locations. I'd like this to focus on the servers. I have OpenNMS configured to accept traps. I don't want HP SIM to duplicate that effort. It seems like the built-in software deployment tool wants to overwrite the trapsink parameters for the systems it encounters during discovery. I have about 10 administrative username/password combinations in use across this infrastructure. Is there a more efficient way to get HP SIM to do the discovery or break discovery into manageable chunks? In terms of general workflow, do people typically install the HP Management Agents during the initial OS deployment (e.g. kickstart post script) or afterwards from HP SIM? Is HP SIM too thick/fat to be an inventory tool? I can't tell if it's meant to be used standalone or alongside other monitoring products. Since the majority of the systems I'm trying to track are those running Gentoo (in order to plan the move to CentOS), is there any way for HP SIM to extract system model information from them ( like dmidecode)? I have systems here where I may have an SSH key established, but not direct user or login access. Is there a way for me to import an SSH private/public key pair into HP SIM to reach out to the servers that can't accept standard credentials? There are a handful of sites where I have inconsistent access or have a double-NAT situation. I may be able to poke a server, but it may not be able to find its way back to the management system. Is there a workaround for this? The certificate configuration for HP SIM seems complicated. What is the preferred setup for trust between systems? I'd also appreciate any notes or recommendations to using this product. Or if there's a better way to do this, I'd like to know.

Read the article

Determining the health of a Cisco switch port?

- by ewwhite

I've been chasing a packet-loss and network stability issue for a handful of end-users on an internal network for the past few days... These issues surfaced recently, however, the location was struck by lightning six weeks ago. I was seeing 5-10% packet loss between a stack of four Cisco 2960's and several PC's and phones on the other side of a 77-meter run. The PC's were run inline with the phones over a trunked link. We were seeing dropped calls and interruptions in client-server applications and Microsoft Exchange connectivity. I tried the usual troubleshooting steps remotely, having a local technician do the following during breaks in user and production activity: change cables between the wall jack and device. change patch cables between the patch panel and switch port(s). try different switch ports within the 2960 stack. change end-user devices with known-good equipment (new phones, different PC's). clear switch port interface counters and monitor incrementing errors closely. (Pastebin output of sh int) Pored over the device logs and Observium RRD graphs. No link up/down issues from the switch side. change power strips on the end-user side. test cable runs from the Cisco 2960 using test cable-diagnostics tdr int Gi4/0/9 (clean)* test cable runs with a Tripp-Lite cable tester. (clean) run diagnostics on the switch stack members. (clean) In the end, it took three changes of switch ports to find a stable solution. The only logical conclusion is that a few Cisco 2960 switch ports are bad or flaky... Not dead, but not consistent in behavior either. I'm not used to seeing individual ports die in this manner. What else can I test or check to determine if these devices are bad? Is it common for single ports to have problems, rather than a contiguous bank of ports? BTW - show cable-diagnostics tdr int Gi4/0/14 is very cool... Interface Speed Local pair Pair length Remote pair Pair status --------- ----- ---------- ------------------ ----------- -------------------- Gi4/0/14 1000M Pair A 79 +/- 0 meters Pair B Normal Pair B 75 +/- 0 meters Pair A Normal Pair C 77 +/- 0 meters Pair D Normal Pair D 79 +/- 0 meters Pair C Normal

Read the article

ZFS - destroying deduplicated zvol or data set stalls the server. How to recover?

- by ewwhite

I'm using Nexentastor on a secondary storage server running on an HP ProLiant DL180 G6 with 12 Midline (7200 RPM) SAS drives. The system has an E5620 CPU and 8GB RAM. There is no ZIL or L2ARC device. Last week, I created a 750GB sparse zvol with dedup and compression enabled to share via iSCSI to a VMWare ESX host. I then created a Windows 2008 file server image and copied ~300GB of user data to the VM. Once happy with the system, I moved the virtual machine to an NFS store on the same pool. Once up and running with my VMs on the NFS datastore, I decided to remove the original 750GB zvol. Doing so stalled the system. Access to the Nexenta web interface and NMC halted. I was eventually able to get to a raw shell. Most OS operations were fine, but the system was hanging on the zfs destroy -r vol1/filesystem command. Ugly. I found the following two OpenSolaris bugzilla entries and now understand that the machine will be bricked for an unknown period of time. It's been 14 hours, so I need a plan to be able to regain access to the server. http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6924390 and http://bugs.opensolaris.org/bugdatabase/view_bug.do;jsessionid=593704962bcbe0743d82aa339988?bug_id=6924824 In the future, I'll probably take the advice given in one of the buzilla workarounds: Workaround Do not use dedupe, and do not attempt to destroy zvols that had dedupe enabled. Update: I had to force the system to power off. Upon reboot, the system stalls at Importing zfs filesystems. It's been that way for 2 hours now.

Read the article

ZFS - Impact of L2ARC cache device failure (Nexenta)

- by ewwhite

I have an HP ProLiant DL380 G7 server running as a NexentaStor storage unit. The server has 36GB RAM, 2 LSI 9211-8i SAS controllers (no SAS expanders), 2 SAS system drives, 12 SAS data drives, a hot-spare disk, an Intel X25-M L2ARC cache and a DDRdrive PCI ZIL accelerator. This system serves NFS to multiple VMWare hosts. I also have about 90-100GB of deduplicated data on the array. I've had two incidents where performance tanked suddenly, leaving the VM guests and Nexenta SSH/Web consoles inaccessible and requiring a full reboot of the array to restore functionality. In both cases, it was the Intel X-25M L2ARC SSD that failed or was "offlined". NexentaStor failed to alert me on the cache failure, however the general ZFS FMA alert was visible on the (unresponsive) console screen. The zpool status output showed: pool: vol1 state: ONLINE scan: scrub repaired 0 in 0h57m with 0 errors on Sat May 21 05:57:27 2011 config: NAME STATE READ WRITE CKSUM vol1 ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 c8t5000C50031B94409d0 ONLINE 0 0 0 c9t5000C50031BBFE25d0 ONLINE 0 0 0 mirror-1 ONLINE 0 0 0 c10t5000C50031D158FDd0 ONLINE 0 0 0 c11t5000C5002C823045d0 ONLINE 0 0 0 mirror-2 ONLINE 0 0 0 c12t5000C50031D91AD1d0 ONLINE 0 0 0 c2t5000C50031D911B9d0 ONLINE 0 0 0 mirror-3 ONLINE 0 0 0 c13t5000C50031BC293Dd0 ONLINE 0 0 0 c14t5000C50031BD208Dd0 ONLINE 0 0 0 mirror-4 ONLINE 0 0 0 c15t5000C50031BBF6F5d0 ONLINE 0 0 0 c16t5000C50031D8CFADd0 ONLINE 0 0 0 mirror-5 ONLINE 0 0 0 c17t5000C50031BC0E01d0 ONLINE 0 0 0 c18t5000C5002C7CCE41d0 ONLINE 0 0 0 logs c19t0d0 ONLINE 0 0 0 cache c6t5001517959467B45d0 FAULTED 2 542 0 too many errors spares c7t5000C50031CB43D9d0 AVAIL errors: No known data errors This did not trigger any alerts from within Nexenta. I was under the impression that an L2ARC failure would not impact the system. But in this case, it surely was the culprit. I've never seen any recommendations to RAID L2ARC. Removing the bad SSD entirely from the server got me back running, but I'm concerned about the impact of the device failure (and maybe the lack of notification from NexentaStor as well). Edit - What's the current best-choice SSD for L2ARC cache applications these days?

Read the article

Why does nmap ping scan over a VPN link return all hosts alive?

- by ewwhite

I'm curious as to why running an nmap -sP (ping scan) on a remote subnet linked via a Cisco site-to-site IPSec tunnel returns "host up" status for every IP in the range. [root@xt ~]# nmap -sP 192.168.108.* Starting Nmap 4.11 ( http://www.insecure.org/nmap/ ) at 2012-11-22 14:08 CST Host 192.168.108.0 appears to be up. Host 192.168.108.1 appears to be up. Host 192.168.108.2 appears to be up. Host 192.168.108.3 appears to be up. Host 192.168.108.4 appears to be up. Host 192.168.108.5 appears to be up. . . . Host 192.168.108.252 appears to be up. Host 192.168.108.253 appears to be up. Host 192.168.108.254 appears to be up. Host 192.168.108.255 appears to be up. Nmap finished: 256 IP addresses (256 hosts up) scanned in 14.830 seconds However, a ping of a known-down IP simply times out or doesn't return anything... [root@xt ~]# ping 192.168.108.201 PING 192.168.108.201 (192.168.108.201) 56(84) bytes of data. --- 192.168.108.201 ping statistics --- 144 packets transmitted, 0 received, 100% packet loss, time 143001ms Is there a more effective way to scan live devices connected in this manner?

Read the article

Nexenta/OpenSolaris filer kernel panic/crash

- by ewwhite

I've an x4540 Sun storage server running NexentaStor Enterprise. It's serving NFS over 10GbE CX4 for several VMWare vSphere hosts. There are 30 virtual machines running. For the past few weeks, I've had random crashes spaced 10-14 days apart. This system used to open OpenSolaris and was stable in that arrangement. The crashes trigger the automated system recovery feature on the hardware, forcing a hard system reset. Here's the output from mdb debugger: panic[cpu5]/thread=ffffff003fefbc60: Deadlock: cycle in blocking chain ffffff003fefb570 genunix:turnstile_block+795 () ffffff003fefb5d0 unix:mutex_vector_enter+261 () ffffff003fefb630 zfs:dbuf_find+5d () ffffff003fefb6c0 zfs:dbuf_hold_impl+59 () ffffff003fefb700 zfs:dbuf_hold+2e () ffffff003fefb780 zfs:dmu_buf_hold+8e () ffffff003fefb820 zfs:zap_lockdir+6d () ffffff003fefb8b0 zfs:zap_update+5b () ffffff003fefb930 zfs:zap_increment+9b () ffffff003fefb9b0 zfs:zap_increment_int+68 () ffffff003fefba10 zfs:do_userquota_update+8a () ffffff003fefba70 zfs:dmu_objset_do_userquota_updates+de () ffffff003fefbaf0 zfs:dsl_pool_sync+112 () ffffff003fefbba0 zfs:spa_sync+37b () ffffff003fefbc40 zfs:txg_sync_thread+247 () ffffff003fefbc50 unix:thread_start+8 () Any ideas what this means?

Read the article

ZFS SAS/SATA controller recommendations

- by ewwhite

I've been working with OpenSolaris and ZFS for 6 months, primarily on a Sun Fire x4540 and standard Dell and HP hardware. One downside to standard Perc and HP Smart Array controllers is that they do not have a true "passthrough" JBOD mode to present individual disks to ZFS. One can configure multiple RAID 0 arrays and get them working in ZFS, but it impacts hotswap capabilities (thus requiring a reboot upon disk failure/replacement). I'm curious as to what SAS/SATA controllers are recommended for home-brewed ZFS storage solutions. In addition, what effect does battery-backed write cache (BBWC) have in ZFS storage?

Read the article

Barracuda spam filter alternative - virtualization/appliance friendly?

- by ewwhite

I've sold and deployed Barracuda spam and web filters for years. I've always thought that the functionality was good (Barracuda Central, easy interface, effective filtering), but the hardware on the entry to midrange units is a weak point. They have single power supplies, no RAID and limited monitoring support. Personally, I think Barracuda would make a killing selling their software as a VMWare appliance, but I'm looking for something similar that I can deploy as a consultant, but will be easy for customers to manage. It should have support for server-grade hardware or the ability to be deployed as a virtual machine. Is there anything out there that's close?

Read the article

How can a Linux Administrator improve their shell scripting and automation skills?

- by ewwhite

In my organization, I work with a group of NOC staff, budding junior engineers and a handful of senior engineers; all with a focus on Linux. One interesting step in the way the company grows talent is that there's a path from the NOC to the senior engineering ranks. Viewing the talent pool as a relative newcomer, I see that there's a split in the skill sets that tends to grow over time... There are engineers who know one or several particular technologies well and are constantly immersed... e.g. MySQL, firewalls, SAN storage, load balancers... There are others who are generalists and can navigate multiple technologies. All learn enough Linux (commands, processes) to do what they need and use on a daily basis. A differentiating factor between some of the staff is how well they embrace scripting, automation and configuration management methodologies. For instance, we have two engineers who do the bulk of Amazon AWS CloudFormation work, and another who handles most of the Puppet infrastructure. Perhaps a quarter of the engineers are adept at BASH shell scripting. Looking at this in the context of the incredibly high demand for DevOps skills in the job market, I'm curious how other organizations foster the development of these skills and grow their internal talent. Scripting doesn't seem like a particularly-teachable concept. How does a sysadmin improve their shell scripting? Is there still a place for engineers who do not/cannot keep up in the DevOps paradigm? Are we simply to assume that some people will be left behind as these technologies evolve? Is that okay?

Read the article

VMWare - persistent "Host memory status" alarm in vSphere

- by ewwhite

I have a particular VMWare ESX 4.1 host that has a very persistent "Host memory status" alarm. This is running on an HP ProLiant DL360 G7 server. The HP ILO and System Management agents don't know any errors. If I clear the alarm in the vSphere client, it returns within a day. I've tried reseating DIMMs, however, the error does not indicate a problem with a specific module. There's another host in the cluster with an identical configuration. It's not exhibiting any issues. Any thoughts? This is touched on briefly on other forums (and here) with no clear resolution.

Read the article

RHEL 5/CentOS 5 - sshd becomes unresponsive

- by ewwhite

I have a number of CentOS 5.x and RHEL 5.x systems whose SSH daemons become unresponsive, preventing remote logins. The typical error from the connecting side is: $ ssh db1 db1 : ssh_exchange_identification: Connection closed by remote host Examining /var/log/messages after a forced reboot shows the following leading up to the restart: Dec 10 10:45:51 db1 sshd[14593]: fatal: Privilege separation user sshd does not exist Dec 10 10:46:02 db1 sshd[14595]: fatal: Privilege separation user sshd does not exist Dec 10 10:46:54 db1 sshd[14711]: fatal: Privilege separation user sshd does not exist Dec 10 10:47:38 db1 sshd[14730]: fatal: Privilege separation user sshd does not exist These systems use LDAP authentication and the nsswitch.conf file is configured to look at local "files" first. [root@db1 ~]# cat /etc/nsswitch.conf # # /etc/nsswitch.conf # passwd: files ldap shadow: files ldap group: files ldap hosts: files dns The Privilege-separated SSH user exists in the local password file. [root@db1 ~]# grep ssh /etc/passwd sshd:x:74:74:Privilege-separated SSH:/var/empty/sshd:/sbin/nologin Any ideas on what the root cause is? I did not see any Red Hat errata that covers this.

Read the article

Nested RDP and ILO/VMWare console sessions, latency and keystroke repetition

- by ewwhite

I'm working on a remote server installation entirely through ILO. Due to the software application and environment, my access is restricted to a Windows server that I must access through RDP. Going from that system to the target server is accomplished via HP ILO2 or ILO3. I'm trying to run a CentOS installation in an environment where I can't use a kickstart. I'm doing this via text mode, but the keystrokes are repeating randomly and it's difficult to select the proper installation options. For example: ks=http://all.yourbase.org/kickstart/ks.cfg ends up looking like: ks====httttttp://allll..yourbaseee.....org/kicksstart/ks.cccfg I'm doing this using Microsoft's native RDP client (on Mac and Windows). I've also noticed this before when running installations or doing remote work in nested sessions. Same for typing into a VMWare console in some cases. Is there a nice fix for this, or it it simply a function of the protocol(s)?

Read the article

Monitoring HP and Dell hardware in Gentoo.

- by ewwhite

I'm working in an environment that features a large number of Gentoo servers running on HP ProLiant and Dell PowerEdge equipment. While I've moved some of these systems to RedHat or CentOS for consistency, I'm still left with a good number of systems that will remain Gentoo. One of the issues I see with the Gentoo arrangement is lack of vendor-supported hardware monitoring. There doesn't seem to be an equivalent to the HP ProLiant Support Pack or Dell's agents for Gentoo. Is this simply something that you give up when using this distribution? How do you monitor hardware health and the like with Gentoo systems?

Read the article

Backup strategy for developer-focused Apple environments?

- by ewwhite

It's interesting to see the technological split between structured corporate environments and more developer-driven/startup environments. Some of the Microsoft technologies I take for granted (VSS, Folder Redirection, etc.) simply are not available when managing the increasing number of Apple laptops I see in DevOps shops. I'm interested in centralized and automated backup strategies for a group of 30-40 Apple laptops... How is this typically done safely and securely, assuming these are company-owned machines (versus BYOD)? While Apple has Time Machine, it's geared toward individual computer backups and doesn't seem to work reliably in a group setting. Another issue with these workstations is the presence of Vagrant/Virtual Box VMs on the developers' systems. Time Machine and virtual machines typically don't work well unless the VMs are excluded from the backup set. I'd like a push-based backup process with some flexible scheduling options. I know how to handle the backend storage, but I'm not sure on what needs to be presented to the client systems. Due to the nature of the data here, cloud-based backup may not be a viable option. Any suggestions about how you handle this in your environment would be appreciated. Edit: The virtual machine backups are no longer important. They can be excluded from the process and planning.

Read the article

Windows 2008 DHCP service fails - "...failed to see a directory server for authorization."

- by ewwhite

I have a small environment running Windows 2008 R2 where the DHCP service on the domain controller fails every two weeks. The most-visible error is Event ID 1059 and the Event Viewer message is: "The DHCP service failed to see a directory server for authorization." The setup features two domain controller and the usual services and roles (file, print, Exchange). Restarting the service fails for a variety of reasons. I've had the following messages at different times: "Not enough storage is available to complete this operation". "Unable to determine the DHCP Server version for the Server 192.168.x.x" "The DHCP service has detected that it is running on a DC and has no credentials configured for use with Dynamic DNS registrations initiated by the DHCP service." A reboot of the domain controller resolves the issue for ~2 weeks. The systems are virtualized and there are no network connectivity issues. Any ideas what's happening here?

Read the article

Reuse remote ssh connections and reduce command/session logging verbosity?

- by ewwhite

I have a number of systems that rely on application-level mirroring to a secondary server. The secondary server pulls data by means of a series of remote SSH commands executed on the primary. The application is a bit of a black box, and I may not be able to make modifications to the scripts that are used. My issue is that the logging in /var/log/secure is absolutely flooded with requests from the service user, admin. These commands occur many times per second and have a corresponding impact on logs. They rely on passphrase-less key exchange. The OS involved is EL5 and EL6. Example below. Is there any way to reduce the amount of logging from these actions. (By user? By source?) Is there a cleaner way for the developers to perform these ssh executions without spawning so many sessions? Seems inefficient. Can I reuse the existing connections? Example log output: Jul 24 19:08:54 Cantaloupe sshd[46367]: pam_unix(sshd:session): session closed for user admin Jul 24 19:08:54 Cantaloupe sshd[46446]: Accepted publickey for admin from 172.30.27.32 port 33526 ssh2 Jul 24 19:08:54 Cantaloupe sshd[46446]: pam_unix(sshd:session): session opened for user admin by (uid=0) Jul 24 19:08:54 Cantaloupe sshd[46446]: pam_unix(sshd:session): session closed for user admin Jul 24 19:08:54 Cantaloupe sshd[46475]: Accepted publickey for admin from 172.30.27.32 port 33527 ssh2 Jul 24 19:08:54 Cantaloupe sshd[46475]: pam_unix(sshd:session): session opened for user admin by (uid=0) Jul 24 19:08:54 Cantaloupe sshd[46475]: pam_unix(sshd:session): session closed for user admin Jul 24 19:08:54 Cantaloupe sshd[46504]: Accepted publickey for admin from 172.30.27.32 port 33528 ssh2 Jul 24 19:08:54 Cantaloupe sshd[46504]: pam_unix(sshd:session): session opened for user admin by (uid=0) Jul 24 19:08:54 Cantaloupe sshd[46504]: pam_unix(sshd:session): session closed for user admin Jul 24 19:08:54 Cantaloupe sshd[46583]: Accepted publickey for admin from 172.30.27.32 port 33529 ssh2 Jul 24 19:08:54 Cantaloupe sshd[46583]: pam_unix(sshd:session): session opened for user admin by (uid=0) Jul 24 19:08:54 Cantaloupe sshd[46583]: pam_unix(sshd:session): session closed for user admin Jul 24 19:08:54 Cantaloupe sshd[46612]: Accepted publickey for admin from 172.30.27.32 port 33530 ssh2 Jul 24 19:08:54 Cantaloupe sshd[46612]: pam_unix(sshd:session): session opened for user admin by (uid=0) Jul 24 19:08:54 Cantaloupe sshd[46612]: pam_unix(sshd:session): session closed for user admin Jul 24 19:08:55 Cantaloupe sshd[46641]: Accepted publickey for admin from 172.30.27.32 port 33531 ssh2 Jul 24 19:08:55 Cantaloupe sshd[46641]: pam_unix(sshd:session): session opened for user admin by (uid=0) Jul 24 19:08:55 Cantaloupe sshd[46641]: pam_unix(sshd:session): session closed for user admin Jul 24 19:08:55 Cantaloupe sshd[46720]: Accepted publickey for admin from 172.30.27.32 port 33532 ssh2 Jul 24 19:08:55 Cantaloupe sshd[46720]: pam_unix(sshd:session): session opened for user admin by (uid=0) Jul 24 19:08:55 Cantaloupe sshd[46720]: pam_unix(sshd:session): session closed for user admin Jul 24 19:08:55 Cantaloupe sshd[46749]: Accepted publickey for admin from 172.30.27.32 port 33533 ssh2 Jul 24 19:08:55 Cantaloupe sshd[46749]: pam_unix(sshd:session): session opened for user admin by (uid=0) Jul 24 19:08:55 Cantaloupe sshd[46749]: pam_unix(sshd:session): session closed for user admin Jul 24 19:08:55 Cantaloupe sshd[46778]: Accepted publickey for admin from 172.30.27.32 port 33534 ssh2 Jul 24 19:08:55 Cantaloupe sshd[46778]: pam_unix(sshd:session): session opened for user admin by (uid=0) Jul 24 19:08:55 Cantaloupe sshd[46778]: pam_unix(sshd:session): session closed for user admin Jul 24 19:08:55 Cantaloupe sshd[46857]: Accepted publickey for admin from 172.30.27.32 port 33535 ssh2

Read the article

HP ProLiant Smart Array "lock up" code 0x11

- by ewwhite

I've a ProLiant DL580 G7 server that experienced a storage subsystem failure during production. The system appeared available and responded to pings, but all I/O access stalled (the system load must have been 100+). The ASR did not trigger at the specified watchdog timeout. I had to force a reboot from the ILO. During POST, I received the following error: A controller failure event occurred prior to this power-up. (Previous lock up code = 0x11) I haven't pulled the ADU report yet, but I'm curious as to what this error actually means. I was not responsible for the the installation, but can see that the firmware is very old. But if there's anything else I should know about the error, I'd like to know for the post-mortem report. edit - I should add that the server had 95 days of uptime prior to the lock up.

Read the article

Nested RDP and ILO sessions, latency and keystroke repetition.

- by ewwhite

I'm working on a remote server installation entirely through ILO. Due to the software application and environment, my access is restricted to a Windows server that I must access through RDP. Going from that system to the target server is accomplished via HP ILO3. I'm trying to run a CentOS installation in an environment where I can't use a kickstart. I'm doing this via text mode, but the keystrokes are repeating randomly and it's difficult to select the proper installation options. I'm doing this using Microsoft's native RDP client (on Mac and Windows). I've also noticed this before when running installations or doing remote work in nested sessions. Is there a nice fix for this, or it it simply a function of the protocol?

Search Results

Search found 41 results on 2 pages for 'ewwhite'.

Page 1/2 | 1 2 | Next Page >

- by ewwhite

- by ewwhite

- by ewwhite

- by ewwhite

- by ewwhite

- by ewwhite

- by ewwhite

- by ewwhite

- by ewwhite

- by ewwhite

- by ewwhite

- by ewwhite

- by ewwhite

- by ewwhite

- by ewwhite

- by ewwhite

- by ewwhite

- by ewwhite

- by ewwhite

- by ewwhite

- by ewwhite

- by ewwhite

- by ewwhite

- by ewwhite

- by ewwhite

1 2 | Next Page >