thjones2 - Tumblr blog

thjones2 · 12 years ago

Text

ACL Madness

On my current project, our customer makes use of DataDomain storage to provide nearline backups of system data. Backups to these devices is primarily done through tools like Veeam (for ESX-hosted virtual machines and associated application data) and NetBackup.

However, a small percentage of the tenants my customer hosts are "self backup" tenants. In this case, "self backup" means that, rather than leveraging the enterprise backup frameworks, the tenants are given an NFS or CIFS share directly off of the DataDomain that is: A) closest to their system(s); and, B) has the most free space to accommodate their data.

"Self backup" tenants that use NFS shares are minimally problematic. Most of the backup problems come from the fact that DataDomains weren't really designed for multi-tenancy. Things like quota controls are fairly lacking. So, it's possible for a tennants of a shared DataDomain to screw each other over by either soaking up all of the device's bandwidth or soaking up all the space.

Still, those problems aside, providing NFS service to tenants is fairly straight-forward. You create a directory, you go into the NFS share-export interface, create the share and access controls and call it a day. CIFS shares, on the other hand...

While we'd assumed that providing CIFS service would be on a par to providing NFS service, it's proven to be otherwise. While the DataDomains provide an NTFS-style ACL capability in their filesystem, it hasn't proven to work quite as one might expect.

The interface for creating shares allows you to set share-level access controls based on both calling-host as well as assigned users and/or groups. One would reasonably assume that this would mean that the correct way to set up a share is to export it with appropriate client-allow lists and user/group-allow lists and that the shares would be set with appropriate filesystem permissions automagically. This isn't exactly how it's turned out to work.

What we've discovered is that you pretty much have to set the shares up as being universally accessible from all CIFS clients and that you grant global "full control" access to the top-level share-folder. Normally, this would be a nightmare, but, once created, you can lock the shares down. You just have to manage the NTFS attributes from a Windows-based host. Basically, you create the share, present it to a Windows-based administrative host, then use the Windows folder security tools to modify the permissions on the share (e.g., remove all the "Everyone" rights, then manually assign appropriate appropriate ownerships and posix gropus to the folder and set up the correct DACLs.

From an engineering perspective, it means that you have to document the hell out of things and try your best to train the ops folks on how to do things The Right Way™. Then, with frequent turnovers in Operations and other "shit happens" kind of things, you have to go back and periodically audit configurations for correctness and repair the brokenness that has crept in.

Unfortunately, one of the biggest sources of brokenness that creeps in is broken permissions structures. When doing the initial folder-setup, it's absolutely critical that the person setting up the folder remembers to click the "Replace all child object permissions with inheritable permissions from this object" checkbox (accessed by clicking on the "Change Permissions" button within the "Advanced Security Settings" section for the folder). Failure to do so makes it so that each folder, subfolder and file created (by tenants) in the share have their own, tenant-created permissions structures. What this results in is a share whose permissions are not easily maintainable by the array-operators. Ultimately, it results in trouble tickets opened by tenants whose applications and/or operational folks eventually break access for themselves

Once those tickets come in, there's not much that can be easily done if the person who "owns" the share has left the organization. If you find yourself needing to fix such a situation, you need to either involve DataDomain's support staff to fix it (assuming your environment is reachable via an WebEx-type of support session) or get someone to slip you instructions on how to access the array's "Engineering Mode".

Since accessing this mode isn't well-documented and I use this site as a personal-reminder on how to do things. I'm going to put the procedures here.

Please note: use of engineering mode allows you to do major amounts of damage to your data with a frightening degree of ease and rapidity. Don't try to access engineering mode unless you're fully prepared to have to re-install your DataDomain - inclusive of destroying what's left of the data on it.

Accessing Engineering Mode:

SSH to the DataDomain.

Login with an account that has system administrator privileges (this may be one of the default accounts your array was installed with, a local account you've set up for the purpose or an Active Directory managed account that has been placed into a Active Directory security-group that has been granted the system administrator role on the DataDomain

Get the array's serial number. The easiest way to do this is type `system show serialno` at the default command prompt

Access SE mode by typing `priv set se`. You will be prompted for a password - the password is the serial number from the prior step.

At this point, your command prompt will change to "SE@<ARRAYNAME>" where "<ARRAYNAME>" will be the nodename of your DataDomain. Once you're in SE mode, the following command-sequence will allow you to access the engineering mode's BASH shell:

Type "fi st"

Type "df"

Type <CTRL>-C three times

Type "shell-escape"

At this point, a warning banner will come up to remind you of the jeopardy you've put your configuration in. The prompt will also change to include a warning. This is DataDomain's way of reminding you, at every step, the danger of the access-level you've entered.

Once you've gotten the engineering BASH shell, you have pretty much unfettered access to the guts of the DataDomain. The BASH shell is pretty much the same as you'd encounter on a stock Linux system. Most of the GNU utilities you're used to using will be there and will work the same way they do on Linux. You won't have man pages, so, if you forget flags to a given shell command, look them up on a Linux host that has the man pages installed. In addition to the standard Linux commands will be some DataDomain-specific commands. For the purposes of fixing your NTFS ACL mess, you'll be wanting to use the "dd_xcacls" command:

Use "dd_xcacls -O '[DomainObject]' [Object]" to set the Ownership of an object. For example, to set the ownership attribute to your AD domain account, issue the command "dd_xcacls -O 'MDOMAIN\MYUSER' /data/col1/backup/ShareName".

Use "dd_xcacls -G '[DomainObject]' [Object]" to set the POSIX group of an object. For example, to set the POSIX group attribute to your AD domain group, issue the command "dd_xcacls -O 'MDOMAIN\MYUSER' /data/col1/backup/ShareName".

Use "dd_xcacls -D '[ActiveDirectorySID]:[Setting]/[ScopeMask]/[RightsList]' [OBJECT]" to set the POSIX group of an object. For example, to give "Full Control" rights to your domain account, issue the command "dd_xcacls -D 'MDOMAIN\MYUSER:ALLOW/4/FullControl' /data/col1/backup/ShareName".

A couple of notes apply to the immediately preceding:

While the "dd_xcacls" command can notionally set rights-inheritance, I've discovered that this isn't 100% reliable in the DDOS 5.1.x family. It will likely be necessary that once you've placed the desired DACLs on the filesystem objects, you'll need to use a Windows system to set/force inheritance onto objects lower in the filesystem hierarchy.

When you set a DACL with "dd_xcacls -D", it replaces whatever DACLS are in place. Any permissions previously on the filesystem object will be removed. If you want more than one user/group DACL applied to the filesystem-object, you need to apply them all at once. Use the ";" token to separate DACLs within the quoted text-argument to the "-D" flag

Because you'll need to fix all of your permissions, one at a time, from this mode, you'll want to use the Linux `find` command to power your use of the "dd_xcacls" command. On a normal Linux system, when dealing with filesystems that have spaces in directory or file object-names, you'd do something like `find [DIRECTORY] -print0 | xargs -0 [ACTION]` to more efficiently handle this. However, that doesn't seem to work exactly like on a generic Linux system, at least not on the DDOS 5.x systems I've used. Instead, you'll need to use a `find [Directory] -exec [dd_xcacls command-string] {} \;`. This is very slow and resource intensive. On a directory structure with thousands of files, this can take hours to run. Further, because of how resource-intensive using this method is, you won't be able to run more than one such job at a time. Attempting to do so will result in SEGFAULTs - and the more you attempt to run concurrently, the more frequent the SEGFAULTs will be. These SEGFAULTs will cause individual "dd_xcacls" iterations to fail, potentially leaving random filesystem objects permissions unmodified.

#DataDomain #CIFS #dd_xcacls #engineering mode #NTFS #permissions #SID

0 notes

thjones2 · 12 years ago

Text

LVM Online Relayout

Prior to coming to the Linux world, most of my complex, software-based storage taskings were performed under the Veritas Storage Foundation framework. In recent years, working primarily in virtualized environments, most storage tasks are done "behind the scenes" - either at the storage array level or within the context of VMware. Up until today, I had no cause to worry about converting filesystems from using one underlying RAID-type to another.

Today, someone wanted to know, "how do I convert from a three-disk RAID-0 set to a six-disk RAID-10 set". Under Storage Foundation, this is just an online relayout operation - converting from a simple volume to a layered volume. Until I dug into it, I wasn't aware that LVM was capable of layered volumes, letalone online-conversion from one volume-type to another.

At first, I thought I was going to have to tell the person (since Storage Foundation wasn't an option for them), "create your RAID-0 sets with `mdadm` and then layer RAID-1 on top of those MD-sets with LVM". Turns out, you can do it in LVM (I spun up a VM in our lab and worked through it).

Basically the procedure assumes that you'd previously:

Attached your first set of disks/LUNs to your host

Used the usual LVM tools to create your volumegroup and LVM objects (in my testing scenario, I set up a three-disk RAID-0 with a 64KB stripe-width)

Created and mounted your filesystem.

Gone about your business.

Having done the above, your underlying LVM configuration will look something like:

# vgdisplay AppVG --- Volume group --- VG Name AppVG System ID Format lvm2 Metadata Areas 3 Metadata Sequence No 2 VG Access read/write VG Status resizable MAX LV 0 Cur LV 1 Open LV 0 Max PV 0 Cur PV 3 Act PV 3 VG Size 444.00 MB PE Size 4.00 MB Total PE 111 Alloc PE / Size 102 / 408.00 MB Free PE / Size 9 / 36.00 MB VG UUID raOK8i-b0r5-zlcG-TEqE-uCcl-VM3L-RelQgX # lvdisplay /dev/AppVG/AppVol --- Logical volume --- LV Name /dev/AppVG/AppVol VG Name AppVG LV UUID 6QuQSv-rklG-pPv6-Tq6I-TuI0-N50T-UdQ4lu LV Write Access read/write LV Status available # open 1 LV Size 408.00 MB Current LE 102 Segments 1 Allocation inherit Read ahead sectors auto - currently set to 768 Block device 253:7

Take special note that there are free PEs available in the volumegroup. In order for the eventual relayout to work, you have to leave space in the volume group for LVM to do its reorganizing magic. I've found that a 10% set-aside has been safe in testing scenarios - possibly even overly generous. In a large, production configuration, that set-aside may not be enough.

When you're ready to do the conversion from RAID, add second set of identically-sized disks to the system. Format the new devices and use `vgextend` to add the new disks to the volumegroup.

Note: Realistically, so long as you increase the number of available blocks in the volumegroup by at least 100%, it likely doesn't matter whether you add the same number/composition of disks to the volumegroup. Differences in mirror compositions will mostly be a performance rather than an allowed-configuration issue.

Once the volumegroup has been sufficiently-grown, use the command `lvconvert -m 1 /dev/<VolGroupName>/<VolName>` to change the RAID-0 set to a RAID-10 set. The `lvconvert` works with the filesystem mounted and in operation - technically, there's no requirement to take an outage window to do the operation. As the `lvconvert` runs, it will generate progress information similar to the following:

AppVG/AppVol: Converted: 0.0% AppVG/AppVol: Converted: 55.9% AppVG/AppVol: Converted: 100.0%

Larger volumes will take a longer period of time to convert (activity on the volume will also increase the time required for conversion). Output is generated at regular intervals. The longer the operation takes, the more lines of status output that will be generated.

Once the conversion has completed, you can verify that your RAID-0 set is now a RAID-10 set with the `lvdisplay` tool:

lvdisplay /dev/AppVG/AppVol --- Logical volume --- LV Name /dev/AppVG/AppVol VG Name AppVG LV UUID 6QuQSv-rklG-pPv6-Tq6I-TuI0-N50T-UdQ4lu LV Write Access read/write LV Status available # open 1 LV Size 408.00 MB Current LE 102 Mirrored volumes 2 Segments 1 Allocation inherit Read ahead sectors auto - currently set to 768 Block device 253:7

The addition of the "Mirrored Volumes" line indicates that the logical volume is now a mirrored RAID-set.

0 notes

thjones2 · 12 years ago

Text

No-POST Reboots

One of the things that's always sorta bugged me about Linux was reboots generally required a full reset of the system. That is, if you did an `init 6`, the default bahavior cause your system to drop back down to the boot PROM and go through its POST routines. While on a virtualized system, this is mostly an inconvenience as virtual hardware POST is fairly speedy. However, when you're running on physical hardware, it can be a genuine hardship (the HP-based servers I typically work with can take upwards of ten to fifteen minutes to run its POST routines).

At any rate, a few months back I was just dinking around online and found a nifty method for doing a quick, no-BIOS reboot:

# BOOTOPTS=`cat /proc/cmdline` ; KERNEL=`uname -r` ; \ kexec -l /boot/vmlinuz-$KERNEL --initrd=/boot/initrd-"{$KERNEL}".img \ --append="${BOOTOPTS}" ; reboot

Basically, the above:

Reads your /proc/cmdline file to get the boot arguments of your most recent bootup and stuffs the value into the BOOTOPTS environmental

Grabs your system's currently running kernel release (in case you've got multiple kernels installed and want to boot back into the current one) and stuffs the value into the KERNEL environmental

Calls the `kexec` command (a nifty utility for directly booting into a new kernel), leveraging the previously set environmentals to tell `kexec` what to do

Finishes with a reboot to the `kexec`-defined kernel (and options)

In testing it on a physical server that normally takes about 15 minutes to reboot (10+ minutes of POST routines) and speeded it up by over 66%. On a VM, it only saves maybe a minute (though that will depend on your VM's configuration settings).

0 notes

thjones2 · 13 years ago

Text

Why So Big

Recently, while working on getting a software suite ready for deployment, I had to find space in our certification testing environment (where our security guys scan hosts/apps and decide what needs to be fixed for them to be safe to deploy). Our CTA environment is unfortunately-tight on resources. The particular app I have to get certified wants 16GB or RAM to run in but will accept as little as 12GB (less than that and the installer utility aborts).

When I went to submit my server (VM, actually) requirements to our CTA team so they could prep me an appropriate install host, they freaked. "Why does it take so much memory" was the cry. So, I dug through the application stack.

The application includes an embedded Oracle instance that wants to reserve about 9GB for its SGA and other set-asides. It's going on a 64bit RedHat server and RedHat typically wants 1GB of memory to function acceptably (can go down to half that, but you won't normally be terribly happy). That accounted for 10GB of the 12GB minimum the vendor was recommending.

Unfortunately, the non-Oracle components of the application stack didn't seem to have a single file that described memory set asides. It looked like it was spinning up two Java processes with an aggregate heap size of about 1GB.

Added to the prior totals, the aggregated heap sizes put me at about 11GB of the vendor-specified 12GB. That still left an unaccounted for 1GB. Now, it could have been the vendor was requesting 12GB because it was a "nice round number" or they could have been adding some slop to their equations to give the app a bit more wiggle-room.

I could have left it there, but decided, "well, the stack is running, lets see how much it really uses". So, I fired up top. Noticed that the Oracle DB ran under one userid and that the rest of the app-stack ran under a different one. I set top to look only at the userid used by the rest of the app-stack. The output was too long to fit on one screen and I was too lazy to want to add up the RSS numbers, myself. Figured since top wasn't a good avenue, I might be able to use ps (since the command supports the Berkeley-style output options).

Time to hit the man pages...

After digging through the man pages and a bit of cheating (Google is your friend) I found the invocation of ps that I wanted:

`ps -u <appuser> -U <appuser> -orss=`.

Horse that to a nice `awk '{ sum += 1 } END { print sum}' and I had a quick method of divining how much resident memory the application was actually eating up. What I found was that the app-stack had 52 processes (!) that had about 1.7GB of resident memory tied up. Mystery solved.

#memory #profiling #RedHat #RHEL #RHEL 5

3 notes · View notes

thjones2 · 13 years ago

Text

Why Google's Two-Factor Authentication Is Junk

To be fair, I understand the goals that Google was trying to achieve. And, they're starting down a good path. However, there are some serious flaws (as I see it) with how they've decided to treat services that don't support two-factor authentication.

Google advertises that you can set per-service passwords for each application. That is to say, if you use third-party mail clients such as Thunderbird, third-party calendaring clients such as Lightning, and third-party chat clients such as Trillian, you can set up a password for each service. Conceivably, one could set one password for IMAP, one password for SMTP, one password for iCAL and yet another password for GoogleTalk. However, Google doesn't actually sandbox the passwords. By "sandbox" I mean restrict a given password to a specific protocol. If I generate four passwords with the intention of using each password once for each service - as Google's per-application passwords would logically be inferred to work - one actually weakens the security of the associated services. Instead of each service having its own password, each of the four, generated passwords can be used with any of the four targeted services. Instead of having one guessable password, there are now four guessable passwords.

Google's "per-application" passwords do not allow you to set your password strings. You have to use their password generator. While I can give Google credit for generating 16-character passwords, the strength of the generated passwords is abysmally low. Google's generated passwords are comprised solely of lower case characters. When you go to a "how strong is my password site", Google's generated passwords are ridiculously easy. The Google password is rated at "very weak" - a massive cracking array would take 14 years to break it. By contrast, the password I used on my work systems, last December, is estimated take the best part of 16,000 centuries. For the record, my work password from last year is two characters shorter than the ones Google generates.

So, what you end up with is X number of services that are each authenticatable against with X number of incredibly weak passwords.

All in all, I'd have to rate Google's efforts, at this point, pretty damned close to #FAIL: you have all the inconvenience of two-factor authentication and you actually broaden your attack surfaces if you use anything that's not HTTP/HTTPS based.

Resources:

GRC Password Cracker Estimator

PasswordMeter Password Strength Checker

#authentication #authorization #Google #security

1 note · View note

thjones2 · 13 years ago

Text

Finding Patterns

I know that most of my posts are of the nature "if you're trying to accomplish 'X' here's a way that you can do it". Unfortunately, this time, my post is more of a, "I've got yet to be solved problems going on". So, there's no magic bullet fix currently available for those with similar problems who found this article. That said, if you are suffering similar problmes, know that you're not alone. Hopefully that's small consolation and what follows may even help you in investigating your own problem.

Oh: if you have suggestions, I'm all ears. Given the nature of my configuration, there's not much in the way of useful information I've yet found via Google. Any help would be much appreciated...

The shop I work for uses Symantec's Veritas NetBackup product to perform backups of physical servers. As part of our effort to make more of the infrastructure tools we use more enterprise-friendly, I opted to leverage NetBackup 7.1's NetBackup Access Control (NBAC) subsystem. On its own, it provides fine-grained rights-delegation and role-based access control. Horse it to Active Directory and you're able to roll out a global backup system with centralized authentication and rights-management. That is, you have all that when things work.

For the past couple months, we've been having issues with one of the dozen NetBackup domains we've deployed into our global enterprise. When I first began trougleshooting the NBAC issues, the authentication and/or authorization failures had always been associated with a corruption of LikeWise's sqlite cachedb files. At the time the issues first cropped up, these corruptions always seemed to coincide with DRS moving the NBU master server from one ESX host to another. It seemed like, when under sufficiently heavy load - the kind of load that would trigger a DRS event - LikeWise didn't respond well to having the system paused and moved. Probably something to do with the sudden apparent time-jump that happens when a VM is paused for the last parts of the DRS action. My "solution" to the problem was to disable automated-relocation for my VM.

This seemed to stabilize things. LikeWise was no longer getting corrupted and seems like I'd been able to stabilize NBAC's authentication and authorization issues. Well, they stabilized for a few weeks.

Unfortunately, the issues have begun to manifest themselves, again, in recent weeks. We've now had enough errors that some patterns are starting to emerge. Basically, it looks like something is horribly bogging the system down around the time that the nbazd crashes are happening. I'd located all the instances of nbazd crashing from its log files ("ACI" events are loged to the /usr/openv/netbackup/logs/nbazd logfiles), and then began to try correlating them with system load shown by the host's sadc collections. I found two things: 1) I probably need to increase my sample frequency - it's currently at the default 10-minute interval - if I want to more-thoroughly pin-down and/or profile the events; 2) when the crashes have happened within a minute or two of an sadc poll, I've found that the corresponding poll was either delayed by a few seconds to a couple minutes or was completely missing. So, something is causing the server to grind to a standstill andnbazd is a casualty of it.

For the sake of thoroughness (and what's likely to have matched on a Google-search and brought you here), what I've found in our logs are messages similar to the following:

/usr/openv/netbackup/logs/nbazd/vxazd.log

07/28/2012 05:11:48 AM VxSS-vxazd ERROR V-18-3004 Error encountered during ACI repository operation. 07/28/2012 05:11:48 AM VxSS-vxazd ERROR V-18-3078 Fatal error encountered. (txn.c:964) 07/28/2012 05:11:48 AM VxSS-vxazd LOG V-18-4204 Server is stopped. 07/30/2012 01:13:31 PM VxSS-vxazd LOG V-18-4201 Server is starting.

/usr/openv/netbackup/logs/nbazd/debug/vxazd_debug.log

07/28/2012 05:11:48 AM Unable to set transaction mode. error = (-1) 07/28/2012 05:11:48 AM SQL error S1000 -- [Sybase][ODBC Driver][SQL Anywhere] Connection was terminated 07/28/2012 05:11:48 AM Database fatal error in transaction, error (-1) 07/30/2012 01:13:31 PM _authd_config.c(205) Conf file path: /usr/openv/netbackup/sec/az/bin/VRTSaz.conf 07/30/2012 01:22:40 PM _authd_config.c(205) Conf file path: /usr/openv/netbackup/sec/az/bin/VRTSaz.conf

Our NBU master servers are hosted on virtual machines. It's a supported configuration and adds a lot of flexibility and resiliency to the overall enterprise-design. It also means that I have some additional metrics available to me to check. Unfortunately, when I checked those metrics, while I saw utilization spikes on the VM, those spikes corresponded to healthy operations of the VM. There weren't any major spikes (or troughs) during the grind-downs. So, to ESX, the VM appeared to be healthy.

At any rate, I've requested our ESX folks see if there might be anything going on on the physical systems hosting my VM that aren't showing up in my VM's individual statistics. I'd previously had to disable automated DRS actions to keep LikeWise from eating itself - those DRS actions wouldn't have been happening had the hosting ESX system not been experiencing loading issues - perhaps whatever was causing those DRS actions is still afflicting this VM.

I've also tagged one of our senior NBU operators to start picking through NBU's logs. I've asked him to look to see if there are any jobs (or combinations of jobs) that are always running during the bog-downs. If it's a scheduling issue (i.e., we're to blame for our problems), we can always reschedule jobs to exert less loading or we can scale up the VM's memory and/or CPU reservations to accommodate such problem jobs.

For now, it's a waiting-game. At least there's an investigation path, now. It's all in finding the patterns.

#Active Directory #authentication #authorization #ESX #NBAC #nbazd #NetBackup #NetBackup 7.x #RedHat #RHEL #RHEL 5 #VMware

2 notes · View notes

thjones2 · 13 years ago

Text

When VMs Go *POOF!*

Like many organizations, the one I work for is following the pathway to the cloud. Right now, this consists of moving towards an Infrastructure As A Service model. This model heavily leverages virtualization. As we've been pushing towards this model, we've been doing the whole physical to virtual dance, wherever possible.

In many cases, this has worked well. However, like many large IT organizations, we have found some assets on our datacenter floors that have been running for years on both outdated hardware and outdated operating systems (and, in some cases, "years" means "in excess of a decade"). Worse, the people that set up these systems are often no longer employed by our organization. Worse still, the vendors of the software in use long-ago discontinued either the support of the software version in use or no longer maintain any version of the software.

One such customer was migrated last year. They weren't happy about it at the time, but, they made it work. Because we weren't just going to image their existing out of date OS into the pristine, new environment, we set them up with a virtualized operating environment that was as close to what they'd had as was possible. Sadly, what they'd had was a 32bit RedHat Linux server that had been built 12 years previously. Our current offerings only go back as far as RedHat 5, so that's what we built them. While our default build is 64bit, we offer 32bit builds - unfortunately, the customer never requested a 32bit environment The customer did their level best to massage the new environment into running their old software. Much of this was done by tar'ing up directories on the old system and blowing them onto their new VM. They'd been running on this massaged system for nearly six months.

If you noticed that 32-to-64bit OS-change, you probably know where this is heading...

Unfortunately, as is inevitable, we had to take a service outage across the entire virtualized environment. We don't yet have the capability in place to live migrate VMs from one data center to another for such windows. Even if we had, the nature of this particular outage (installation of new drivers into each VM) was such that we had to reboot the customer's VM, any way.

We figured we were good to go as we had 30+ days of nightly backups of each impacted customer VM. Sadly, this particular customer, after doing the previously-described massaging of their systems, had never bothered to reboot their system. It wasn't (yet) in our procedures to do a precautionary pre-modification reboot of the customers' VMs. The maintenance work was done and the customer's VM rebooted. Unfortunately, the system didn't come back from the reboot. Even more unfortunately, the thirty-ish backup images we had for the system were similarly unbootable and, worse, unrescuable.

Eventually, we tracked down the customer to inform them of the situation and to find out if the system was actually critical (the VM had been offline for nearly a full week by the time we located the system owner, but no angry "where the hell is my system" calls had been received or tickets opened). We were a bit surprised to find that this was, somehow, a critical system. We'd been able to access the broken VM to a sufficient degree to determine that it hadn't been rebooted in nearly six months, that it's owner hadn't logged in nearly that same amount of time but that the on-disk application data appeared to be intact (the filesystems they were on were mountable without errors). So, we were able to offer the customer the option of building them a new VM and helping them migrate their application data off the old VM to the new VM.

We'd figured we'd just restore data from the old VM's backups to a directory tree on the new VM. The customer, however, wanted the original disks back as mounted disks. So, we had to change plans.

The VMs we build make use of the Linux Volume Manager software to manage storage allocation. Each customer system is built off of standard templates. Thus, each VM's default/root volume groups all share the same group name. Trying to import another host's root volume group onto a system that already has a volume group of the same name tends to be problematic in Linux. That said, it's possible to massage (there's that word again) things to allow you to do it.

The customer's old VM wasn't bootable to a point where we could simply FTP/SCP its /etc/lvm/backup/RootVG file off. The security settings on our virtualization environment also meant we couldn't cut and paste from the virtual console into a text file on one of our management hosts. Fortunately, our backup system does support file-level/granular restores. So, we pulled the file from the most recent successful backup.

Once you have an /etc/lvm/backup/<VGNAME> file available, you can effect a name change on a volume group relatively easily. Basically, you:

Create your new VM

Copy the old VM's /etc/lvm/backup/<VGNAME> into the new VM's /etc/lvm/backup directory (with a new name)

Edit that file, changing the old volume group's object names to ones that don't collide with the currently-booted root volume groups (really, you only need to change the name of the root volume group object - the volumes can be left as is

Connect the old VM's virtual disk to the new VM

Perform a rescan of the VM's scsi bus so it sees the old VM's virtual disk

Do a `vgcfgrestore -f /etc/lvm/backup/<VGNAME> <NewVGname>`

Do a `pvscan`

Do a `vgscan`

Do a `vgchange -ay <NewVGname>`

Mount up the renamed volume group's volumes

As a side note (to answer "why didn't you just do a `boot linux rescue` and change things that way"): our security rules prevent us from keeping bootable ISOs (etc.) available to our production environment. Therefore, we can't use any of the more "normal" methods for diagnosing or recovering a lost system. Security rules trump diagnostics and/or recoverability needs. Dems da breaks.

#Linux #LVM #RHEL #RHEL 5 #system recovery #virtualization

7 notes · View notes

thjones2 · 13 years ago

Text

Asymmetrical NIC-bonding

Currently, the organization I am working for is making the transition from 1Gbps networking infrastructure to 10Gbps infrastructure. The initial goal had been to first migrate all of the high network-IO servers that were using trunked 1Gbps interfaces to using 10Gbps Active/Passive configurations.

Given the current high per-port cost of 10Gbps networking, it was requested that a way be found to not waste 10Gbps ports. Having a 10Gbps port sitting idle "in case" the active port became unavailable was seen as financially wasteful. A a result, we opted to pursue the use of asymmetrical A/P bonds that used our new 10Gbps links for the active/primary path and reused our 1Gbps infrastructure for the passive/failover path.

Setting up bonding on Linux can be fairly trivial. However, when you start to do asymmetrical bonding, you want to ensure that your fastest paths are also your active paths. This requires some additional configuration of the bonded pairs beyond just the basic declaration of the bond memberships.

In a basic bonding setup, you'll have three primary files in the /etc/sysconfig/network-scripts directory: ifcfg-ethX, ifcfg-ethY and ifcfg-bondZ. The ifcfg-ethX and ifcfg-ethY files are basically identical but for their DEVICE and HWADDR parameters. At their most basic, they'll each look (roughly) like:

DEVICE=ethN HWADDR=AA:BB:CC:DD:EE:FF ONBOOT=yes BOOTPROTO=none MASTER=bondZSLAVE=yes

And the (basic) ifcfg-bondZ file will look like:

DEVICE=bondZ ONBOOT=yes BOOTPROTO=static NETMASK=XXX.XXX.XXX.XXX IPADDR=WWW.XXX.YYY.ZZZ MASTER=yes BONDING_OPTS="mode=1 miimon=100"

This type of configuration may produce the results you're looking for, but it's not guaranteed to. If you want to absolutely ensure that your faster NIC will be selected as the primary NIC (and that it will fail back to that NIC in the event that the faster NIC goes offline and then back online), you need to be a bit more explicit with your ifcfg-bondZ file. To do this, you'll mostly want to modify your BONDING_OPTS directive. I also tend to add some BONDING_SLAVEn directives, but that might be overkill. Your new ifcfg-bondZ fle that forces the fastest path will look like:

DEVICE=bondZ ONBOOT=yes BOOTPROTO=static NETMASK=XXX.XXX.XXX.XXX IPADDR=WWW.XXX.YYY.ZZZ MASTER=yes BONDING_OPTS="mode=1 miimon=100 primary=ethX primary_reselect=1"

The primary= tells the bonding driver to set the ethX device as primary when the bonding-group first onlines. The primary_reselect= tells it to use a interface selection policy of "best".

Note: The default policy is "0". This policy simply says "return to interface declared as primary". I choose to override with policy "1" as a hedge against the primary interface coming back in some kind of degraded state (while most of our 10Gbps media is 10Gbps-only, some of the newer ones are 100/1000/1000). I only want to fail back to the 10Gbps interface if it's still running at 10Gbps and hasn't for some reason, negotiated down to some slower speed.

When using the more explicit bonding configuration, the resultant configuration will resemble something like:

Ethernet Channel Bonding Driver: v3.4.0-1 (Octobber 7, 2008) Bonding Mode: fault-tolerance (active-backup) Primary Slave: ethX (primary_reselect better) Currently Active Slave: ethX MII Status: up MII Polling Interval (ms): 100 Up Delay (ms): 0 Down Delay (ms): 0 Slave Interface: ethX MII Status: up Speed: 10000 Mbps Duplex: full Link Failure Count: 0 Permanent HW addr: AA:BB:CC:DD:EE:FF Slave Interface: ethY MII Status: up Speed: 1000 Mbps Duplex: full Link Failure Count: 0 Permanent HW addr: FF:EE:DD:CC:BB:AA

#Linux #networking #NIC bonding #RHEL #RHEL 5 #RHEL 6

1 note · View note

thjones2 · 13 years ago

Text

ACL Madness

In our environment, security is a concern. So, many programs and directories that you might take being available to you on a "standard" Linux system will give you a "permission denied" on one of our systems. Traditionally, you might work around this by changing the group-ownership and permissions on the object to allow a subset of the system users the expected level of access to those files or directories. And, this will work great if all the people that need access share a common group. If they don't you have to explore other avenues. One of these avenues is the extened permissions attributes found in Linux's ACL handling (gonna ignore SELinux, here, mostly because: A) I'm lazy; B) I freaking hate SELinux - probably because of "A"; and, C) ACLs are a feature that is available across different implementations of UNIX and Linux, and is thus more portable than SELinux or other vendor-specific extended security attribute mechanisms). Even better, as a POSIX filesystem extension, you can use it on things like NFS shares and have your ACLs work across systems and *N*X platforms (assuming everone implements the POSIX ACLs the same way).

And, yes, I know things like `sudo` can be used to delegate access, but that's fraught with its own concerns, as well, least of all its lack of utility for users that aren't logging in with interactive shells.

Say you have a file <tt>/usr/local/sbin/killmenow</tt> that is currently set to mode 700 and you want to give access to it to members of the ca_opers and md_admins groups. You can do something like:

# setfacl -m g:ca_opers:r-x /usr/local/sbin/killmenow # setfacl -m g:md_admins:r-x /usr/local/sbin/killmenow

Now, members of both the ca_opers and md_admins groups can run this program.

All is well and good until someone asks, "what files have had their ACLs modified to allow this" and you've (or others on your team) gone crazy with the setfacl command. With standard file permissions, you can just use the `find` command to locate files that have a specific permissions-setting. With ACLs, `find` is mostly going to let you down. So, what to do? Fortunately, setfacl 's partner-command, `getfacl ` can come to your rescue. Doing something like

# getfacl --skip-base -R 2> /dev/null | sed -n 's/^# file://p`

Will walk the directory-structure, from downward, giving you a list of files with ACLs added to them. Once you've identified such-modified files, you can then run `getfacl` against them, individually, to show what the current ACLs are.

#ACL #getfacl #permissions #posix #RHEL #security #setfacl #sharing #UNIX

2 notes · View notes

thjones2 · 13 years ago

Text

FaceBook Security Fail

Ok, this is a bit of a departure from my usual, work-related posting. That said, security fits into the rubric of "more serious postings", even if that security is related to social networks. Who knows: maybe I'll decide to move it, later. At any rate...

I'm a big fan of social networking. In addition to my various blogs (on Posterous, BlogSpot, Tumblr, etc.), I make heavy use of services like FaceBook, Plus and Twitter (and have "personal" and "work" personnae set up on each). On some services (FaceBook and LiveJournal) I run my accounts fairly locked-down; on others (most everything else), I run things wide-open. In either case, I'm not exactly the most censored of individuals. The way I look at it, if someone's going to use my postings against me, I may as well make it easy for them to do it up front than to put myself in a position where I'm invested in an individual or an organization only to have my posting history negatively subsequently sabotage that. It's a pick your poison kind of thing.

That said, not everyone is quite as laissez-faire about their online sharing activities. So, in general, I prefer to keep my stuff at least as locked down as the members of my sharing community do. That way, there's a lower likelihood that my activities will accidentally compromise someone else.

Today, a FaceBook friend of mine was trying to sort through they myriad security settings available to her so that she could create a "close friends only" kind of profile. She'd thought that she'd gotten things pretty locked down, until an unexpected personal message revealed to her that she had information leakage, somewhere, in her FaceBook usage. I was trying to help her ID it.

While my friend had fairly restrictively locked down her profile, she wasn't aware that certain actions could compromise those settings. Specifically, she wasn't aware that if she posted a comment to a public (or at least more open) thread, that others would be able to "see her". She'd assumed that if she set all of her security buttons-n-dials to "friends only" that anything she did would be kept friends only. With FaceBook, that's mostly the case. However, if you comment on a thread started by another friend, then everyone who is able to see that thread can "see" (aspects of) you, as well. Thus, if a friend starts a thread and has the permissions set to public and you comment on it, the entire Internet can see that you have some kind of FaceBook presence, even if they don't have permission to view your profile/timeline.

In attempting to illustrate this, I took a screen capture of a post that had been set to public. I'd done so using the post of someone I thought was a shared friend (when I'd clicked on the poster's profile, both myself and my security-conscious friend appeared to show up in the poster's "mutual friends" list). It turns out I was mistaken in that thought.

When I'd posted the screen shot to my security-conscious friend's wall, I tagged the original poster in that wall post. My security-conscious friend had set her wall to "friends only". When she informed me that the public-poster was not a mutual "friend" but a "friend of a friend", I'd made the suppostion that the tagging of the public-posting friend would be moot. After all: what kind of security model would allow me to overrided my security-conscious friend's wall security settings with something so simple as a tag-event? Turns out, FaceBook's security model would. To me, that would fall into the general heading of a "broken security model".

Oh well, now to figure out how to rattle some cages in FaceBook's site usability group to get them to fix that.

#broken #FaceBook #fail #failure #security #social media

3 notes · View notes

thjones2 · 13 years ago

Text

AD Integration Woes

For a UNIX guy, I'm a big fan of leveraging Active Directory form centralized system and user management purposes. For me, one of the big qualifications for any application/device/etc. to be able to refer to itself as "enterprise" is that it has to be able to pull user authentication information via Active Directory. I don't care whether it does it the way that Winbind-y things do, or if they just pull data via Kerberos or LDAP. All I care is that I can offload my user-management to a centralized source.

In general, this is a good thing. However, like many things that are "in general, a good thing" it's not always a good thing. If you're in an enterprise that's evolving it's Active Directory infrastructure, tying things to AD means that you're impacted when AD changes or breaks. Someone decides, "hey, we need to reorganize the directory tree" and stuff can break. Someone decides, "hey, we need to upgrade our AD servers from 2003 to 2008" and stuff can break.

Recently, I started getting complaints from the users of our storage resource management (SRM) system that they couldn't login any more. It'd been nearly a year since I'd set it up, so sorting it out was an exercise in trying to remember what the hell I did ...and then Googling.

The application, itself, runs on a Windows platform. The login module that it uses for centralized authentication advertises itself as "Active Directory". In truth, the login module is a hybrid LDAP/Kerberos software module. Even though it's a "Windows" application, they actually use the TomCat Java server for the web UI (and associated login management components). TomCat is what uses Kerberos for authentication data.

Sometime in recent months, someone had upgraded Active Directory. Users that had been using the software before the AD-upgrade were able to authenticate, just fine. Users that had tried to start using the software after the AD-upgrade couldn't get in. Turns out that, when Active Directory had gotten upgraded, the encryption algorithms had gotten changed. Ironically, I didn't find the answer to my problem in any of the forums related to the application: I found it in the forums for another application that used TomCat's Kerberos components.

To start the troubleshooting process, one needs to first modify TomCat's bsclogin.conf file. Normally, this file is only used to tell TomCat where to find the Kerberos configuration file. However, if you modify your bsclogin.conf file and add the directive "debug=true" to it like so:

com.sun.security.auth.module.Krb5LoginModule required debug=true;

Enhanced user login debugging messages are enabled. Once this is added and tomcat is restarted, login-related messages will start showing up in your ${TOMCATHOME}/logs/stdout.log file. What started showing up in mine were messages like:

[Krb5LoginModule] user entered username: [email protected] Acquire TGT using AS Exchange

[Krb5LoginModule] authentication failed KDC has no support for encryption type (14)

With this error message in hand, I was able to find out that TomCat's Kerberos modules were using the wrong encryption routines to get data out of Active Directory. The fix was to update my Kerberos initialization file and add the following two lines to my [libdefaults] stanza (I just added it right after the dns_lookup_realm line and before the next stanza of directives):

default_tkt_enctypes = rc4-hmac default_tgs_enctypes = rc4-hmac

Making this change (and restarting TomCat) resulted in the failing users suddenly being able to get in.

I'd normally rag on my Windows guys for this, but, the Windows guys aren't exactly used to providing AD-related service notifications to anyone but Windows users. This application, while (currently) running on a Windows platform, isn't something that's traditionally thought of as AD-reliant. Factor in that true, native AD modules pretty much just auto-adjust to such changes, and it didn't occur to them to notify us of the changes.

Oh well. AD-integration is a learning experience for everyone involved, I suppose.

#Active Directory #Kerberos #LDAP #SRM #Storage Essentials #TomCat

3 notes · View notes

thjones2 · 13 years ago

Text

Cross-Platform Sharing Pains

All in all, the developers of Linux seem to have done a pretty good job of ensuring that Linux is able to integrate with other systmes. Chief among this integration has been in the realm of sharing files between Linux and Windows hosts. Overall, the CIFS support has been pretty damned good and has allowed Linux to lead in this area compared to other, proprietary *N*X OSes.

That said, Microsoft (and others) seem to like to create a moving target for the Open Source community to aim at. If you have Windows 2008-based fileservers in your operations and are trying to get you Linux hosts to mount shares coming off these systems, you hay have run into issues with doing so. This is especially so if your Windows share-servers are set up with high security settings and you're trying to use service names to reference those share servers (i.e., the Windows-based fileserver may have a name like "dcawfs0035n" but you might have an alias like "repository5").

Normally, when mounting a CIFS fileserver by its real name, you'll do something like:

# mount -t cifs "//dcawfs0035n/LXsoftware" /mnt/NAS -o domain=mydomain,user=myuser

And, assuming the credentials you supply are correct, the URI is valid and the place you're attemtping to mount to exists and isn't busy, you'll end up with that CIFS share mounted on your Linux host. However, if you try to mount it via an alias (e.g., a CNAME in DNS):

# mount -t cifs "//repository5/LXsoftware" /mnt/NAS -o domain=mydomain,user=myuser

You'll get prompted for your password - as per normal - but, instead of being rewarded with the joy of your CIFS share mounted to you Linux host, you'll get an error similar to the following:

mount error 5 = Input/output error Refer to the mount.cifs(8) manual page (e.g.man mount.cifs)

Had you fat-fingered your password, you'd have gotten a "mount error 12" (permission denied), instead. The above results in strict name checking being performed on the share mount attempt. Because you've attempted to an alias to connect with, the name-checking will fail and you'll get the above denial. You can verify that this is the underlying cause by re-attempting the mount with the fileserver's real name. If that succeeds where the alias, failed, you'll know where to go, next.

The Microsoft-published solution is found in KB281308. To summarize, you'll need to:

Have admin and login rights on the share-server

Fire up regedit

Navigate to "HKLM\System\CurrentControlSet\Services\LanmanServer\Parameters"

Create a new DWORD paramter named "DisableStrictNameChecking"

Set its value to "1"

Reboot the fileserver

Retry your CIFS mount attempt.

At this point, your CIFS mount should succeed.

Interestingly, if you've ever tried to connect to this share from another Windows host not in the share-server's domain (e.g., from a host on different Windows domain that doesn't have cross-realm trusts set up or a standlone Windows client), you will probably have experienced connection errors, as well. Typical error messages being something on the order of "account is not allowed to logon from this resource" or just generally refusing to accept what you know to be a good set of credentials.

#CIFS #Linux #mount error #registry edits #RHEL #RHEL 5 #RHEL 6 #Windows #Windows Security Settings #workarounds

5 notes · View notes

thjones2 · 13 years ago

Text

Quick-n-Dirty Yum Repo via HTTP

Recently, we had a bit of a SNAFU in the year-end renewal of our RedHat support. As a result, all of the RHN accounts tied to the previous contract lost access to RHN's software download repositories. This meant that things like being able to yum-down RPMs on rhn_register'ed systems no longer worked and, we couldn't log into RHN and do a manual download, either.

Fortunately, because we're on moderately decent terms with RedHat and they know that the contract eventually will get sorted out, they were willing to help us get through our current access issues. Moving RHN accounts from one permanent contract to another, after first associating them with some kind of temporary entitlement is a paperwork-hassle for all parties involved and is apt to get your account(s) mis-associated down the road. Since all parties knew that this was a temporary problem but needed an immediate fix, our RedHat representative furnished us with the requisite physical media necessary to get us through till the contracts could get sorted out.

Knowing that I wasn't the only one that might need the software and that I might need to execute some burndown-rebuilds on a particular project I was working on, I wanted to make it easy to pull packages to my test systems. We're in an ESX environment, so, I spun up a small VM (only 1GB of virtual RAM, 1GHz of virtual CPU, a couple Gigabytes of virtual disk for a root volume and about 20GB of virtual disk to stage software onto and build an RPM repository on) to act as a yum repository server.

After spinning this basic VM, I had to sort out what to do as far as getting that physical media turned into a repo. I'm not a big fan of copying CDs as a stream of discrete files (been burned, too many times, by over-the-wire corruption, permission issues and the like). So, I took the DVD and made an ISO from it. I then took that ISO and scp'ed it up to the VM.

Once I had the ISO file copied up to the VM, did a quick mount of it (`mount -t iso9660 -o loop,ro /tmp/RHEL5.7_x86_64.iso /mnt/DVD` for those of you playing at home). Once I mounted it, I did a quick copy of its contents to the filesystem I'd set aside for it. I'm kind of a fan of cpio for things like this, so I cd'ed into the root of the mounted ISO and did a `find . -print | cpio -pmd /RepoDir` to create a good copy of my ISO data into a "real" filesystem (note, you'll want to make sure you do a `umask 022` first to ensure that the permission structures from the mounted ISO get copied, intact, along with the files, themselves).

With all of the DVD's files copied to the repo-server and into a writeable filesystem, it's necessary to create all the repo structures and references to support use by yum. Our standard build doesn't include the createrepo tool, so, first I had to locate its RPM in the repo filessytem and then install it onto my repo-server. Doing a quick `find . -name "*createrepo*rpm"` while cd'ed into repo fileystem turned up the path to the requisite RPM. I then did an `rpm -Uh [PathFromFind]` to install the createrepo tool's RPM files.

The createrepo tool is a nifty little tool. You just cd into the root of the directory where you copied your media to, do a `createrepo .`, and it scans the directory structures to find all the RPMs and XMLs and other files and creates the requisite data structures and pointers that allow yum to know how to pull the appropriate RPMs from the filesystem.

Once that's done, if all you care about is local access to the RPMs, you can create a basic .repo file in /etc/yum.repos.d that uses a "baseurl=file:///Path/To/Media" directive in it.

In my case, I wanted to make my new repo available to other hosts at the lab. Easiest way to make the repo available over the network is to do so via HTTP. Our standard build doesn't include the standard RedHat HTTP server, by default. So, I manually installed the requisite RPMs from the repo's filesystem. I modified the stock /etc/httpd/conf/httpd.conf and added the folowing stanzas to it:

Alias /Repo/ "/RepoDir/" <Directory "/RepoDir"> Options Indexes MultiViews AllowOverride None Order allow,deny Allow from all </Directory>

[Note: this is probably a looser configuration than I'd have in place if I was making this a permanent solution, but this was just meant as a quick-n-dirty workaround for a temporary problem.]

I made sure to do a `chkconfig httpd on` and then did a `service httpd start` to activate the web server. I then took my web browser and made sure that the repo filesystem's contents were visable via web client. It wasn't: I forgot that our standard build has port 80 blocked by default. I did the requisite juju to add an exception to iptables for port 80 and all was good to go.

With my RPMs (etc.) now visable via HTTP, I logged into the VM that I was actually needing to install RPMs to via yum. I escalated privileges to root and created an /etc/yum.repos.d/LAB.repo file that looked similar to the following:

[lab-repo] name=RHEL 5.7 baseurl=http://repovm.domain.name/Repo enabled=1 gpgcheck=0

I did a quick cleanup of the consuming VM's yum repo information with a `yum clean all` and then verified taht my consuming VM was able to properly see the repos's data by doing a `yum list`. All was good to go. Depending on how temporary this actually ends up being, I'll go back and make my consuming VM's .repo file a bit more "complete" and more properly layout the repo-server's filesystem and HTTP config.

#apache #DVD #httpd #ISO #RedHat #repository #RHEL #yum

37 notes · View notes

thjones2 · 14 years ago

Text

Who Cut Off My `finger`

Overall, I'm not trying to make this, my "more serious blog" a dumping-ground for rants. So, please forgive me this rant and please feel free to skip this post....

I've been using UNIX and similar systems for a long time, now. So, I'm kind of set in my ways in the things I do on systems and the tools I expect to be there. When someone capriciously removes a useful tool, I get a touch upset.

`finger` is one of those useful tools. Sadly, because people have, in the mists of time` misconfigured finger, security folks now like to either simply disable it or remove it altogether. Fine. Whatever: I appreciate that there might be security concerns. However, if you're going to remove a given command, at least make sure you're accomplishing something useful for the penalty you make system users pay in frustration and lost time. If you decide to remove the finger command, then you should probably also make it so I can't get the same, damned information via:

• `id`• `who`• `whoami`• `last`• `getent passwd <USERID>`• (etc.)

If I can run all of those commands, I've still got all the ability to get the data you're trying to hide by removing `finger`. So, what have you accomplished other than to piss me off and make it so I have to get data via other avenues? Seriously: "WTF"?

Why the hell is it that, when someone reads a "security best practice", they go ahead and blindly implement something without bothering to ask the next, logical questions: "does doing this, by itself, achieve my security goal," "is the potential negative impact on system users more than balanced-out by increased system security", "is there a better way to do this and achieve my security goals" and "what goal am I actually acheiving by taking this measure." If you don't ask these questions (and have good, strong answers to each), you probably shouldn't be following these "best practices."

#annoyances #best practices #half-measures #Linux #security #UNIX

11 notes · View notes

thjones2 · 14 years ago

Text

NetBackup with Active Directory Authentication on UNIX Systems

While the specific hosts that I used for this exercise were all RedHat-based, it should work for any UNIX platform that both NetBackup 6.0/6.5/7.0/7.1 and Likewise Open are installed onto.

I'm a big fan of leveraging centralized-authentication services wherever possible. It makes life in a multi-host environment - particularly where hosts can number from the dozens to the thousands - a lot easier when you only have to remember one or two passwords. It's even more valuable in modern security environments where policies require frequent password changes (or, if you've even been through the whole "we've had a security incident, all the passwords on all of the systems and applications need to be changed, immediately" exercise). Over the years, I've used things like NIS, NIS+, LDAP, Kerberos and Active Directory to do my centralized authentication. If your primary platforms are UNIX-based, NIS, NIS+, LDAP and Kerberos have traditionally been relatively straight-forward to set up and use.

I use the caveat of "relatively" because, particularly in the early implementations of each service, things weren't always dead-simple. Right now, we seem to be mid-way through the "easiness" life-cycle of using Active Directory as a centralized authentication source for UNIX operating systems and UNIX-hosted applications. Linux and OSX seem to be leading the charge in the OS space for ease of integration via native tools. There's also a number of third-party vendors out there who provide commercial and free solutions to do it for you, as well. In our enterprise, we chose LikeWise, because, at the time, it was the only free option that also worked reasonably-well with our large and complex Active Directory implementation. Unfortunately, not all of the makers of software that runs on the UNIX hosts seem to have been keeping up on the whole "AD-integration within UNIX operating environment" front.

My latest pain in the ass, in this arena, is Veritas NetBackup. While Symantec likes to tout the value of NetBackup Access Control (NBAC) in a multi-administrator - particularly one where different administrators may have radically different NetBackup skill sets or other differentiating factors - using it in a mixed-platform environment is kind of sucktackular to set up. While modern UNIX systems have the PAM framework to make writing an application's authentication framework relatively trivial, Symantec seems to still be stuck in the pre-PAM era. NBAC's group lookup components appear to still rely on direct consultation of a server's locally-maintained group files rather than just doing a call to the host OS's authentication frameworks.

When I discovered this problem, I opened a support case with Symantec. Unfortunately, their response was "set up a Windows-based authentication broker". My NetBackup environment is almost entirely RedHat-based (actually, unless/until we implement BareMetal Restore (BMR) or other backup modules that require specific OSes be added into the mix, it is entirely RedHat-based). The idea of having to build a Windows server just to act as an authentication broker struck me as a rather stupid way to go about things. It adds yet another server to my environment and, unless I cluster that server, it introduces a single point of failure into and otherwise fairly resilient NetBackup design. I'd designed my NetBackup environment with a virtualized master server (with DRS and SRM supporting it) and multiple media servers for both throughput and redundancy

We already use LikeWise Open to provide AD-base user and group management service for our Linux and Solaris hosts. When I first was running NetBackup through my engineering process, using the old Java auth.conf method for login management worked like a champ. The Java auth.conf-based systems just assumes that any users trying to access the Java UI are users that are managed through /etc/passwd. All you have to do is add the requisite user/rights entries into the auth.conf file and Java treats AD-provided users the same as it treats locally-managed users. Because of this, I suspected that I could work around Symantec's authorization coding lameness.

After a bit of playing around with NBAC, I discovered that, so long as the UNIX group I wanted to map rights to existed in /etc/group, NBAC would see it as a valid, mappable "UNIX PWD" group. I tested by seeing if it would at least let me map the UNIX "wheel" group to one of the NBAC privilege groups. Whereas, even if I could look up the group via getent, if it didn't exist in /etc/group, NBAC would tell me it was an invalid group. Having already verified that a group's presence in /etc/group allowed NBAC to use a group, I proceded to use getent to copy my NetBackup-related groups out of Active Directory and into my /etc/group file (all you have to do is a quick `getent [GROUPNAME] >> /etc/group` and you've populated your /etc/group file).

Unfortunately, I didn't quite have the full groups picture. When I logged in using my AD credentials, I didn't have any of the expected mapped-privileges. I remembered that I'd explicitly emptied the userids from the group entries I'd added to my /etc/group file (I'd actually sed'ed the getents to do it ...can't remembery why, at this point - probably just anticipating the issue of not including userids in the /etc/group file entries). So, I logged out of the Java UI and reran my getent's - this time leaving the userids in place. I logged back into the Java UI and this time I had my mapped privileges. Eureka.

Still I wasn't quite done. I knew that, if I was going to roll this solution into production, I'd have to cron-out a job to keep the getent file up-to date with the changing AD group memberships. I noticed, while nuking my group entry out of getent, that only my userid was on the group line and not every member of the group. Ultimately, tracked it down to LikeWise not doing full group enumeration by default. So, I was going to have to force LikeWise to enumerate the group's membership before running my getent's.

I proceded to dig around in /opt/likewise/bin for likely candidates for forcing the enumeration. After trying several lw*group* commands, I found that doing a `lw-find-group-by-name [ADGROUP]` did the trick. Once that was run, my getent's produced fully-populated entries in my /etc/group file. I was then able to map rights to various AD groups and wrote a cron script to take care keeping my /etc/group file in sync with Active Directory.

In other words, I was able to get NBAC to work with Active Directory in an all RedHat environment and no need to set up a Windows server just to be an authentication broker. Overall, I was able to create a much lighter-weight, portable solution.

#Active Directory #authentication #LikeWise #Linux #NetBackup #NetBackup 6.x #NetBackup 7.x #Solaris #UNIX

63 notes · View notes

thjones2 · 14 years ago

Text

CLARiiON Report Data Verification

Earlier this year, the organization I work for decided to put into production an enterprise-oriented storage resource management (SRM) system. The tool we bought is actually pretty cool. We install collectors into each of our major data centers and they pull storage utilization data off of all of our storage arrays, SAN switches and storage clients (you know: the Windows and UNIX boxes that use up all that array-based storage). Then, all those collectors pump out the collected data to a reporting server at our main data center. The reporting server is capable of producing all kinds of nifty/pretty reports: configuration snapshots, performance reports, trending reports, utilization profiles, etc.

As cool as all this is, you have the essential problem of "how do I know that the data in all those pretty reports is actually accurate?" Ten or fifteen years ago, when array-based storage was fairly new and storage was still the realm of systems administrators with coding skills, you'd ask you nearest scruffy misanthrope, "could you verify the numbers on this report," and get an answer back within a few hours (and then within minutes each subsequent time you asked). Unfortunately, in the modern, GUI-driven world, asking your storage guys to verify numbers can be like pulling teeth. Many modern storage guys aren't really coders and frequently don't know the quick and easy way to get you hard numbers out of the devices they manage. In some cases, you may watch them cut and paste from the individual array's management web UIs into something like MicroSoft Calculator. So, you'll have to wait and, often times, you'll have to continually prod them for the data because it's such a pain in the ass for them to produce.

With our SRM rollout, I found myself in just such a situation. Fortunately, I've been doing Unix system adminstration for the best part of 20 years and, therefore, am rather familiar with scripting. I frequently wish I was able to code in better reporting languages, but I just don't have the time to keep my "real" coding skills up to par. I'm also not terribly patient. So, after waiting a couple weeks for our storage guys to get me the numbers I'd asked for, I said to myself, "screw it: there's gotta be a quicker/better way."

In the case of our CLARiiONs, that better way was to use the NaviCLI (or, these days, the NaviSECCLI). This is a tool set that has been around a looooooong time, in one form or another, and has been available for pretty much any OS that you might attach to a CLARiiON as a storage client. These days, it's a TCP/IP-based commandline tool - prior to NaviCLI, you either had platform-specific tools (IRIX circa 1997 had a CLI-based tool that did queries through the SCSI bus to the array) or you logged directly into the array's RS232 port and used its onboard tools (hopefully, you had a terminal or terminal program that allowed you to capture output) ...but I digress.

If you own EMC equipment, you've hopefully got maintenance contracts that give you rights to download tools and utilities from the EMC support site. NaviCLI is one such tool. Once you install it, you have a nifty little command-line tool that you can wrap inside of scripts. You can create these scripts to both provisioning tasks and reporting tasks. My use, in this case, was reporting.'

The SRM we bought came with a number of canned-reports - including ones for CLARiiON devices. Unfortunately, the numbers we were getting from our SRM were indicating that we only had about 77TiB on one of our arrays when the EMC order sheets said we should have had about 102TiB. That's a bit of a discrepancy. I was able to wrap some NaviCLI commands into a couple scripts (one that reported on RAID-group capacity and one that reported physical and logical disk capacities [ed.: please note that these scripts are meant to be illustrative of what you can do, but aren't really something you'd want to have as the nexus of your enterprise-reporting strategy. They're slow to run, particularly on larger arrays]) and verify that the 77TiB was sort of right and that the 102TiB was also sorta right. The group capacity script basically just spits out two numbers - total raw capacity and total capacity allocatable to clients (without reporting on how much of either is already allocated to clients). The disk capacity script reports how the disks are organized (e.g., RAID1, RAID5, Spare, etc.) - printing total number of disks in each configuration category and how much raw capacity that represented. Basically, the SRM tool was reporting the maximum number of blocks that were configured into RAID groups, not the total raw physical blocks in the array that we'd thought it was supposed to report.

Having these number in hand allowed us to tear apart the SRM's database queries and tables so that we could see what information it was grabbing, how it was storing/organizing it and how to improve on the vendor-supplied standard reports. Mostly, it consisted of changing the titles of some existing fields and adding some fields to the final report.

Yeah, all of this begs the question "what was the value of buying an SRM when you had to reverse-engineer it to make the presented data meaningful?" To be honest, "I dunno." I guess, at the very least, we bought a framework through which we could put together pretty reports and ones that were more specifically meaningful to us (though, to be honest, I'm a little surprised that we're the only customers of the SRM vendor to have found the canned-reports to be "sadly lacking"). It also gave me an opportunity to give our storage guys a better idea of the powerful tools they had available to them if only they were willing to dabble at the command line (even on Windows).

Still the vendor did provide a technical resource to help us get things sorted out faster than we might have done without that assistance. So, I guess that's something?

#array #CLARiiON #EMC #enterprise management #enterprise storage #reporting #SAN #scripting #SRM #Storage #Storage Resource Management #utilization #verification

37 notes · View notes

thjones2 · 14 years ago

Text

Show Me the Boot Info!

For crusty old systems administrators (such as yours truly), the modern Linux boot sequence can be a touch annoying. I mean, the graphical boot system is pretty and all, but, I absolutely hate having to continually click on buttons just to see the boot details. And, while I know that some Linux distributions give you the option of viewing the boot details by either disabling the graphical boot system completely (i.e., nuke out the "rhgb" option from your grub.conf's kernel line) or switching to an alternate virtual console configured to show boot messages, that's just kind of a suck solution. Besides, if your default Linux build is like the one my company uses, you don't even have the alternate VCs as an option.

Now, this is a RedHat-centric blog, since that's what we use at my place of work (we've a few devices that use embedded SuSE, but, I probably void the service agreement any time I directly access the shell on those!). So, my "solution" is going to be expressed in terms of RedHat (and, by extension, CentOS, Scientific Linux, Fedora and a few others). For many things in RedHat, they give you nifty files in /etc/sysconfig that allow you to customize behaviors. So, I'd made the silly assumption that there'd be an /etc/sysconfig/rhgb type of file. No such luck. So, I dug around in the init scripts (grep -li is great for this, by the way) to see if there were any mentions of tt>rhgb. There was. Well, there was mention of rhgb-client in /etc/init.d/functions.

Unfortunately, even though our standard build seems to include manual pages for every installed component, I couldn't find a manual page for rhgb-client (or an infodoc, for that matter). The best I was able to find was a /usr/share/doc/rhgb-${VERSION}/HOW_IT_WORKS file (I'm assuming that ${VERSION} is consistent with the version of the RHGB RPM installed - it seemed to be). While an interesting read, it's not exactly the best, most exhaustive document I've ever read. It's about what you'd expect from a typical README file, I guess. Still, it didn't display what, if any, arguments that the rhgb-client would take.

Not wanting to do anything too calamitous, I called `rhgb-client --help` as a non-privileged user. I was gladdened to see that it didn't give me one of those annoying "you must be root to run this command" errors. It also gave some usage details:

rhgb-client --help Usage: rhgb-client [OPTION...] -u, --update=STRING Update a service's status -d, --details=STRING Show the details page (yes/no). -p, --ping See if the server is alive -q, --quit Tells the server to quit -s, --sysinit Inform the server that we've finished rc.sysinit Help options: -?, --help Show this help message --usage Display brief usage message

I'd hoped that since /etc/init.d/functions had shown an "--update" argument, it might take other arguments (and, correctly, assumed one would be "--help"). So, I used the above and updated my /etc/init.d/functions script and added "--details=yes" and rebooted. Lo and behold: I get the graphical boot session but get to see all the detailed boot messages, too! Hurrah.

Still, it seemed odd that, since the RHGB components are (sorta) configurable, there wasn't a file in /etc/sysconfig to set the requisite options. I hate having to hack config files that are likely to get overwritten the next time the associated RPM gets updated. I also figure that I can't be the only person out there that wants the graphical boot system and details. So, why havent the RHGB maintainers fixed this (and, yes, I realize that Linux is a community thing and I'm free to contribute fixes to it - I'd just hoped that someone like RedHat or SuSE would have had enough complaints from commercial UNIX converts to have already done it for me)? Oh well, one of these days, I suppose.

#CentOS #detailed boot #Fedora #Graphical Boot #Linux #RedHat #RHGB #Scientific Linux

9 notes · View notes