forty9ten - Tumblr blog

forty9ten · 11 years ago

Text

Omnibus Installer Tutorial

Often, there is a bit of confusion around what Omnibus does. Omnibus is a fantastic tool for creating "full-stack" or "vendor-everything" installers for multiple platforms (including Windows). It's what Opscode/Chef uses to package Chef itself, but you don't need to be using Chef to take advantage of Omnibus.

There are many use cases for an Omnibus installer. In general, it simplifies the installation of any software by including all of the dependencies for that piece of software. This allows it to be installed on a target system regardless of whether its dependencies are in place. The work of resolving dependencies is done at installing build time. This means that once the installer has been built, it can be installed in a consistent manner across platforms.

This post is also available on Github for forking and better reading experience.

About this tutorial

In this tutorial we will be using the Omnibus installer to package a simple Ruby application with some Gem dependencies.

Typically, an Omnibus project is created using the Omnibus project code generator:

$ bin/omnibus project foo

For this tutorial, I have simplified the process by providing a project template. We will also be using Vagrant to startup the virtual machine, so you will need to install it before getting started.

If you have any questions or run into any problems, feel free to drop me an email: [email protected] or hit me up on Twitter @aaronfeng.

Note: I created this Vagrant template because Omnibus 3.0.0+ requires additional tools (such as Test Kitchen and Berkshelf). These tools are intended to make Omnibus better, but they also increase the barrier to entry for Omnibus if you are not already familiar with them.

Setup the template

First, we need to clone the project template. Make sure you clone the repository into a directory named after your project. Notice that I have prefixed by project's name (awesome) with omnibus. This is a an Omnibus project convention:

$ git clone https://github.com/forty9ten/omnibus-skeleton.git omnibus-awesome

Setup the virtual machine

Now we need to start the CentOS virtual machine and setup our Omnibus environment. It will take a few minutes to download the base virtual machine (depending on your internet connection), but this is a one time thing:

$ vagrant up $ vagrant ssh

From this point forward we will be executing commands inside the virtual machine.

Now, after you SSH into the virtual machine, move into previously cloned directory:

$ cd omnibus-awesome

Next, we will need to change all the project files and references to match the project name. setup_project.shwill handle it all for you. This script is not part of Omnibus, but is a convenient script and it only needs to be run once.

$ ./bin/setup_project.sh awesome

Now we need to install Omnibus dependencies:

$ bundle install --binstubs

Before we go any further, let's do a test build to make sure we have everything setup correctly.

bin/omnibus build project awesome

An RPM will be generated inside the pkg directory (because we're on CentOS). If the build finished successfully can continue.

Working with Omnibus

Use your editor to open config/projects/awesome.rb. Notice that the project file is the same name that you named the project from the step above.

First, update the maintainer and homepage attributes. You can also change the install_path to another directory. This is the directory where the installer will install the software on the target machine.

Note: If you decide to change the install_path, you will need to run ./bin/setup_project awesome again in order to setup the new path with the correct permissions. For this tutorial you can just leave it as /opt/awesomefor now.

The Omnibus installer will build everything inside the value of install_path and then create an RPM mirroring that directory. Technically, this isn't exactly how it works behind the scenes, but you can think of it this way.

Around line 15 you will see dependency 'awesome'. This tells the Omnibus installer to depend on the software definition file inside config/software directory (without the .rb extension).

Now let's edit config/software/awesome.rb in order to provide installation instructions. The content of the file should look something like below:

name "awesome" default_version "1.0.0" # remove these if you don't need ruby dependency "ruby" dependency "rubygems" build do end

We will need to keep the ruby and rubygems dependencies since we are deploying a Ruby application. You might be wondering where the dependency definition is coming from. By default it will look insideconfig/software directory. However, if it can't find it, then it will look in the omnibus-software repository.

It doesn't actually go out to the Internet to retrieve those files during the build. That repository got cloned onto your virtual machine when you did bundle install. If you look inside the Gemfile you will see it points to theomnibus-software repo on GitHub:

# Install omnibus software gem 'omnibus', '~> 3.0' gem 'omnibus-software', github: 'opscode/omnibus-software'

If you want to see the contents of those files on your machine, the location can be retrieved by using bundle show omnibus-software.

Those definitions are written by Opscode/Chef. You can include anything in that list for free. In addition, looking through these files is a quick way to learn the different things Omnibus can do.

Inside the build block is where we need to provide instructions on how to build our installer. Below is the demo application we will be including inside this installer:

https://github.com/forty9ten/omnibus-example-ruby-app

The application is trivial, but it demonstrates how to package up all its dependencies (Gems) inside the installer. Remember, we want to bundle all the program's dependencies into the installer so the application can run without needing to pull down any additional dependencies.

Below is a completed version of config/software/awesome.rb:

name "awesome" default_version "master" dependency "ruby" dependency "rubygems" dependency "bundler" source :git => "https://github.com/forty9ten/omnibus-example-ruby-app" build do # vendor the gems required by the app bundle "install --path vendor/bundle" # setup a script to start the app using correct ruby and bundler command "cat << EOF > #{install_dir}/bin/run.sh cd #{install_dir}/app #{install_dir}/embedded/bin/bundle exec \ #{install_dir}/embedded/bin/ruby \ #{install_dir}/app/money.rb cd - EOF" # make it executable command "chmod +x #{install_dir}/bin/run.sh" # move built app into install_dir command "cp -r #{project_dir} #{install_dir}/app" end

The default_version is the version of the application. It can be any string. In this case, it goes hand-in-hand with source. source is a way to tell Omnibus where the application lives. When source sees the :gitsymbol, it will clone the code from git. Since we set the default_version to master, it will clone the master branch of the code. The directory of the cloned code will match the name attribute (awesome). This is usually not important, but useful to know for debugging. If you want to checkout a specific version, tag name can be used as value of default_version. There are other valid sources such as :url and :path.

The bundler dependency is included because we use it to vendor all the application's Gems at build time. Internally Omnibus knows how to execute Bundler, which is why the bundle command is available. --path vendor/bundle tells bundler to download all the Gems specified by the Gemfile into vendor/bundle of the cloned Git repository.

Another thing that might seem magical is where bundle command is executed from. The application directory specified by the source is totally managed by Omnibus. The bundle command is executed within the context of source. This is nice because we are free to change the installer name without worry about adjusting paths to align everything.

Next we will create a convenience script to run our application. Inside the project (config/projects/awesome.rb) file we specified the install_path (/opt/awesome) which points to the same location as install_dir. Thebin directory will be provided by Omnibus, so we don't need to create it ahead of time. The script uses thebundle and ruby command packaged by the installer.

project_dir is where our application is cloned to (managed by source), it actually lives in an Omnibus cached directory inside /var/cache/omnibus/src/awesome. We need to cp it to the install_dir in order to be packaged with our installer. We will just copy to a directory named app inside install_dir.

Now, let's build our installer:

$ bin/omnibus build project awesome

The Omnibus installer takes a while to build the first time, but it is smart enough to cache files in between builds. Subsequent builds should take less time.

If everything is setup correctly, you should see the new RPM inside pkg directory. Notice the RPM nameawesome-0.0.0+20140508200804-1.el6.x86_64.rpm Omnibus uses git tag to version the RPM. It defaults to0.0.0 if no tag is found for the Omnibus project.

In case you are wondering what install_dir looks like after a successful build:

#> tree -L 3 /opt/awesome /opt/awesome/ ├── app ├── bin │ └── run.sh ├── embedded │ ├── bin │ ├── include │ ├── lib │ ├── man │ ├── share │ └── ssl └── version-manifest.txt

The output above only shows 3 levels deep, so most of the files are not displayed. Everything under/opt/awesome will be mirrored into the RPM.

The last step is to test out the RPM. Before we do that, we need to delete the contents of /opt/awesome/ shown above because we want to make sure the RPM has all the necessary files. If we don't delete it, the RPM will just install on top of it.

Luckily, Omnibus has a command to do just that:

$ ./bin/omnibus clean awesome --purge

During Omnibus build process intermediate files are cached to speed up the builds. The clean command will remove all the cached files. --purge will remove contents of install_dir.

Install the RPM:

$ sudo rpm -ivh pkg/awesome-0.0.0....-1.el6.x86_64.rpm

Fill in the ... with your actual build timestamp. Once it has been installed, we need to verify that the program is still working:

$ /opt/awesome/bin/run.sh

The output should look something like this:

hello, how much money do I have? 10.00 going to sleep for 5 hello, how much money do I have? 10.00 going to sleep for 5

Yay! You have successfully created a full-stack installer with everything included. This application can be deployed without needing an internet connection or any extra external dependencies from the installed system.

Retrace the previous steps if you didn't get a working RPM. You need to make sure you uninstall the RPM before reinstalling again by using the command below:

$ sudo rpm -e awesome

Troubleshooting

You might encounter this error during your build process:

[fetcher:net::cacerts] Invalid MD5 for cacerts

This just means that the MD5 checksum for the Certificate Authority has changed compared to the locally cached version. Most likely this problem will go away if you just delete the Gemfile.lock and rerun bundle install again.

If that doesn't fix your problem, you can override the cacerts.rb MD5 checksum in your project. Createconfig/software/cacerts.rb with the content of below:

https://raw.githubusercontent.com/opscode/omnibus-software/master/config/software/cacerts.rb

Update the MD5 value to what it is expecting, which should be shown below the error. This is not the recommended way of fixing the problem, but it might be needed if the upstream repository has not been corrected. This also shows that you can override any files in omnibus-software by creating a file with the same name inside your project.

Written by Aaron Feng

Aaron Feng is the founder of forty9ten LLC. He is a passionate Software Engineer with a special interest in cloud based infrastructure and DevOps. He has organized various tech groups since 2007, but is most well known for Philly Lambda. He is currently organizing DockerATL. Twitter: @aaronfeng Email: [email protected]

#devops #ruby #omnibus-installer #chef

0 notes

forty9ten · 11 years ago

Text

Docker Service Discovery

At forty9ten, we love Docker. If you are interested in exploring Docker we would love to hear from you! Full source code and post is also available on Github.

It is pretty common to have multiple applications communicate with each other over a network. You might have a database server that the application server retrieves data from. In the Docker world, you will create two containers: One for the database server and the second one for the application server.

In this tutorial we will demonstrate how to setup Docker containers so they can communicate with each other transparently via Links. Links (since 0.6.5) provides service discovery for the Docker containers. There's no need to hardcode the IP address inside the application that runs in a Docker container. When you link the containers together Docker will provide the IP address to where the destination container is. This can also be used as a security feature because, in order for containers to be linked, a name must be specified ahead of time.

I will demonstrate Docker Links via a demo RabbitMQ application. The topology will consist of three containers:

RabbitMQ Server

Message Publisher

Message Subscriber

We will be using Docker 0.7.2 via Vagrant. I have created a custom vagrant box file with Docker already preinstalled. This box file was created from the official Docker's Vagrantfile.

Clone the demo repository and cd into it:

#> git clone https://github.com/forty9ten/docker-rabbitmq-example.git

Start up vagrant virtual machine and connect via ssh:

#> vagrant up && vagrant ssh

Make sure /vagrant is properly mounted and should contain all the demo code from the cloned repository.

#> cd /vagrant && ls

client1.rb client2.rb Dockerfile Gemfile Vagrantfile

Build RabbitMQ Server

Let's start by building the RabbitMQ container. Docker can build it directly from Git:

#> docker build -rm -t rabbitmq github.com/forty9ten/docker-rabbitmq

Once build is complete, you can verify it by listing all the available images:

#> docker images

REPOSITORY TAG IMAGE ID rabbitmq latest 09814f596e4f ubuntu 12.04 8dbd9e392a96

The image IDs will be different but everything else should be very similar.

We passed -rm to delete all intermediate layers in order to keep the images more compact. Each RUN command will create a new layer. A layer is like a snapshot, a new layer is created after a successful RUN command. This is why Docker images builds can be so fast because if anything else failed after a RUN command it will continue from the previous known good layer instead of starting over again. In our case we only care about the final image that's why we delete all intermediate layers.

Build the Clients

Next we will build the clients that will be communicating with RabbitMQ. The clients are included in this repository so we will instruct Docker to build it from local directory:

#> docker build -rm -t rabbitmq_client .

One again, we can verify by listing the images again:

#> docker images

REPOSITORY TAG IMAGE ID rabbitmq_client latest 4ad59fa07dd1 … …

One thing to note is that both clients share the same Docker image since they are both very similar. When we start the container, we can choose which client we would like to run.

Run RabbitMQ Server

Once RabbitMQ has successfully been built, we can run it via command below:

#> docker run -name rabbitmq -h rabbitmq -p :49900:15672 rabbitmq

In order to allow other containers to communicate to the RabbitMQ server, we need to provide a linking name via the --name option. -h is used to specify the hostname of the container. RabbitMQ uses the hostname to name the log files. Both the log files and the name of the container is set to rabbitmq.

You should see RabbitMQ logo and the terminal window will block.

RabbitMQ 3.2.2. Copyright (C) 2007-2013 GoPivotal, Inc. ## ## Licensed under the MPL. See http://www.rabbitmq.com/ ## ## ########## Logs: /var/log/rabbitmq/[email protected] ###### ## /var/log/rabbitmq/[email protected] ########## Starting broker... completed with 6 plugins.

Since we are running RabbitMQ inside Vagrant with no graphical user interface, if we want to see the admin interface for RabbitMQ we need to use the browser on the host machine that is running Vagrant. This is what the current network topology looks like:

Host -> Vagrant -> RabbitMQ

We need to do some port forwarding in order make all three levels of indirections workout. The included Vagrant file is already port forwarding 49000 to 49900 from the host to Vagrant. By default, the RabbitMQ admin interface is running on port 15672, so now we need to connect Vagrant port RabbitMQ. The -p option specifies port forwarding separated by : and has the syntax of INTERFACE:HOST_PORT:DESTINATION_PORT. We left the interface empty which means it will bind to all interfaces. The port forwarding looks like below:

Host:49900 -> Vagrant:49900 -> RabbitMQ:15672

If everything is setup correctly, we can browse RabbitMQ admin interface from our host browser by going to localhost:49900 (guest/guest)

Run the Clients

We need to connect two more terminals to vagrant since both apps outputs to stdout while it's running.

Let's start the publisher first (client1.rb)

#> docker run -i -t -link rabbitmq:rabbitmq rabbitmq_client client1.rb

Now the subscriber (client2.rb).

#> docker run -i -t -link rabbitmq:rabbitmq rabbitmq_client client2.rb

If everything goes well, you should see the subcriber output the message counter on the screen.

-link take two values NAME:ALIAS. Name is the target container that will be linked to, and alias affects how the current container will refer to the linked container. If you look at client1.rb it uses some environment variables to find RabbitMQ.

@conn = Bunny.new :host => ENV["RABBITMQ_PORT_5672_TCP_ADDR"], :port => ENV["RABBITMQ_PORT_5672_TCP_PORT"]

The IP address and port environment variables are passed in from Docker and constructed from the link name and the port RabbitMQ exposes (via EXPOSE in RabbitMQ's Dockerfile). The environment variables are prefixed with the link name and can be overridden via alias.

Clean Up

The two RabbitMQ clients can be terminated by sending the Ctrl-C signal. However, RabbitMQ doesn't trap the signal, so we can stop the container via docker stop.

#> docker stop rabbitmq

Caveat

Links can only be created once. If you tried to run the container again (even when the container is stopped) you will see the following error:

Error: create: Conflict, The name rabbitmq is already assigned to 93f03f024728. You have to delete (or rename) that container to be able to assign rabbitmq to a container again.

Behind the scenes, docker run is composed of two operations: create and start. Since it has been created already all we need to do is start the container:

#> docker start rabbitmq

Note

All containers can be run in the background without blocking the terminal. In this tutorial we blocked the terminal for debugging purposes. We can run in detached mode by passing -d to docker run.

Notice the RabbitMQ clients are actually running CentOS images, while RabbitMQ server is running Ubuntu. There is no requirement for what flavor of Linux the containers are running. It doesn't even have to match the host OS.

Written by Aaron Feng

#docker #rabbitmq #devops

0 notes

forty9ten · 11 years ago

Text

Latency Effect on High Scale ELB

“When dealing with large scale applications, any inefficiencies will amplify the tiniest of cracks in your application. Until you reach some level of scale, these issues usually go unnoticed. You need to carefully monitor the application, otherwise the end users’ experience can be negatively impacted. In this post, we will look at how increased ELB latency can cause dropped requests even when the backend servers seem healthy.”

Elastic Load Balancer (ELB) is Amazon Web Services version of load balancer that helps distribute load evenly across servers. It has been designed to handle high traffic (as long as the traffic isn’t spiky). An ELB can grow and shrink as your traffic pattern increases or decreases without you having to manage anything. It also integrates with CloudWatch (Amazon’s monitoring solution), which can give you some important metrics to give you visibility on how your backend servers and ELB are doing. These metrics are worth paying attention to, especially if you are running “at scale”. Proper tuning is required to ensure you have enough backend server capacity so that the ELB is not forced to drop any requests.

One of the clients we are working with is processing billions of requests per day and recently noticed millions of requests were being dropped daily. We made sure all the backend servers behind the ELB are healthy and operating within capacity (in fact, CPU utilization was under 40% across the backend cluster).

The next step is to figure out if the ELB itself is dropping requests. Since an ELB will scale up or down based on the traffic pattern, there could be a chance the ELB is under capacity during high traffic hours. One way to diagnose the issue is to look at HTTPCode_ELB_5XX and HTTPCode_Backend_5XX metrics in CloudWatch. HTTPCode_Backend_5XX shows all the requests that the backend server can’t handle. HTTPCode_ELB_5XX shows all the requests that were rejected at the ELB level including any backend issues. To figure out how many errors were caused by the ELB itself we can subtract the two (HTTPCode_ELB_5XX - HTTPCode_Backend_5XX). However, in reality, it’s more complicated than that.

Before ELB decides to route requests to any backend servers, it tries to verify that the servers are healthy and able to process requests based on how the ELB is configured. If, for some reason, the backend cluster is experiencing latency back to the ELB, it will also be counted as HTTPCode_ELB_5XX error. So it is important to pay attention to the ELB Latency metric. Generally speaking, the easiest way to fix this issue quickly is to increase backend cluster capacity if you are seeing a correlation between increase Latency and HTTPCode_ELB_5XX. We were able to reduce the request errors at peak from hundreds of millions to less than one hundred just by tuning the number of backend instances.

Increasing capacity may not be a good long term solution since it might just be hiding the root issue and spending extra money unnecessarily. Monitoring the CPU is a good start, but not every application has its own unique bottleneck. It is worth time to look through your metrics and understand the profile of the application. I will discuss fixing the root cause in a future post.

Related sources:

Monitor Your Load Balancer Using Amazon CloudWatch

Best Practices in Evaluating Elastic Load Balancing

Written by Aaron Feng

#devops #aws #elb #monitoring

1 note · View note

forty9ten · 12 years ago

Text

Immutable AWS Deployment Pipeline

For the impatient readers, there’s a diagram below that shows the whole deployment pipeline. I would still suggest that you read the post for a deeper understanding.

Many organizations make the mistake of not leveraging the Amazon Machine Image (AMI) for AWS deployment. The most common deployment strategy is to provision new nodes from top to bottom as the nodes are being launched. Provision just-in-time can lead to slow and brittle deployment cycles. Running system updates, downloading packages and setup configurations can take a very long time. What’s worse is that this time is wasted for every machine you provision in AWS. I have seen machines that have taken more than 30 minutes to become useful. If anything goes wrong during provision, the machine will not function as expected, which leads to brittle deploys. One way to solve these problems is to deploy via AMI.

AMI deployment strategy has been perceived as an unmanageable manual process. The bigger issue is how to update the running system. All those things are true, but it doesn’t have to be that way. Bundling software into an Amazon Machine Image (AMI) is by far the most reliable and fastest way to deploy applications on Amazon Web Services (AWS). The unmanageable manual process can be eliminated with automation. If the system needs to be updated, new AMI can be built then deployed side-by-side with the old nodes but will receive only a portion of the traffic. Once the new nodes have been proven to function correctly, old nodes will be decommissioned. This is what is typically referred to as a Canary Deployment, which also removes the need to have any downtime during deployments.

The AMI is considered immutable because once it is built, the configuration will not be changed (from human intervention perspective). In order to release the next version of the software, a new AMI is built from a clean base, not from the previous version. In this post, I will provide a high level description of all the necessary components to build an “Immutable AMI Deployment Pipeline”. Below is a description of one way to build this deployment pipeline, but there are probably many ways to achieve the same outcome.

The Setup:

Source Control

In the the context of deployment pipeline source control provides a way for developers to transmit code to a known location so the software can be built by packager (the next step). The most important decision here is to figure out which branch of the software the deployment pipeline will be built from.

Packager

This step will pull the bits from source control and package up all of the software in an automatic fashion. The easiest way is to inject a custom tool at the end of your Continuous Integration (CI) runs. I would recommend to use your distribution's package type to package up the software. I usually like to use fpm (https://github.com/jordansissel/fpm) to build my packages. It is very flexible in terms of what it can build and is easy to get started. All the hard lifting of getting the application running should be done at this step. For example, if the application requires an upstart file, this tool should be able to construct one on the fly. Dependencies management is another common step. If the app requires nginx it should be able to "include" it. It can be as easy as depend on another package, or it can more complicated as in running a Chef script. The most important part of this step is that it needs to be able to version the package properly so it is clear what has been deployed. Depending on the complexity of the software, it might need some kind of metadata configuration file in order for this step to glue it all together. I typically will use a yaml file that is versioned with the software to provide the hint.

Artifact Repository

Once the artifact has been built, it needs to be stored in a location where it can be retrieved for installation. It also serves as a catalog of all the software that has been built and released. If you are using RPM as your package type, it makes sense to store in a yum repository. However, it can be as simple as a file server.

AMI Provisioner

AMI provisioner is a tool that will provision an instance and install the necessary software, then create an AMI at the end of the run. I typically use Chef solo (or the like) to provision the instance to a point where the target software package can be installed on top of it. For example, if you are running a Java application, Java will be installed via Chef before the target package. Once all the software has been installed, an AMI needs to be created. This can be done by using the AWS SDK. It is also possible to use open source AMI creation tool such as Aminator (https://github.com/Netflix/aminator) if you chose not to roll your own. At the end of the run, it should have created an AMI with proper naming and version to clearly define what software it contains and the version.

This tool should create the AMI in the development AWS account then grant it to the production account. I will talk more about this in the next step.

AWS EC2 Environments

Before the software can be released to production it should be tested. A separate AWS account is recommended in order to provide isolation from production nodes. The previous step should have built an AMI that is available to two AWS accounts (production and development). Depending on the organization structure, the developers might only have access to the development AWS account since you might not want everyone to be able to mess around in production. This typically applies to larger organizations with a separate team for production environment. Regardless of your organization structure, two AWS accounts should be utilized for complete isolation. This way, developers have complete freedom to experiment in the development account.

Deployment Orchestrator

We need a tool to launch the AMI in development and production account. I also recommend launching all your services inside an Auto Scaling Group (ASG) even if you don’t plan to scale up and down. There are many cases nodes might be terminated undesirably. Using an ASG ensure any terminated instances get replenished automatically.

Hopefully your software is designed to be able to distribute the load across multiple nodes. Once the AMI has been deployed inside an ASG, it can be easily scaled up and the old software can be scaled down. This will be another tool that will interface with the AWS SDK to create ASG in an automated fashion. The tool will also properly name and tag the ASG so it is clear what software has been deployed. If CloudWatch (or some other alerting system) is used, you should setup the proper alerting at this step.

It is also common to inject some environment specific configurations in this step. AWS provides an easy way to run arbitrary scripts when an instance is being booted up called userdata. If you have a lot of environments or need to change the configurations during runtime, this might not be the best option. It is better for the application to have a way to retrieve configurations dynamically. This can be done by calling some external service before the application is fully booted. The application will monitor the external service in order to realize new configuration values. The configuration service is most likely not required to get the deployment pipeline going.

If you do not wish to build something custom, Asgard (https://github.com/Netflix/asgard) may be used at this step. Asgard may be too opinionated since it was designed to deploy Netflix services, but it is worth checking out as a possible solution.

The Complete Pipeline

Once all the pieces are in place and glued together you have a complete pipeline. ��Any developer (or anyone, if you have fancy UI) should be able to deploy a version of the software to production with minimal effort and risk. In order for new software to get out of the door, new nodes must be provisioned. This provides an easy way to roll back to the previous version and avoid any manual cruft that might have accumulated over time. Any manual manipulation of the nodes will be wiped out in the following release. The only real way for those changes to stick is to include it as part of the pipeline. This is a high level overview of how a deployment pipeline can be constructed. Some level of engineering is required, but it will enable you to deliver value to your end users quickly and in a robust fashion.

If you have any suggestion or questions, feel free to drop me an email: [email protected]

If you are interested in building a deployment pipeline like this for your organization please email: [email protected]. We can help.

Written by Aaron Feng

#aws #deployment #autoscalinggroups #devops

0 notes

forty9ten · 12 years ago

Text

Avoiding Chef-Suck with Auto Scaling Groups

I have been instrumental in helping implement a cloud solution for a large client that is interested in hosting their new applications in AWS rather than data center. The client was already using Chef for all of their data center deployments. It is only natural to leverage the same technology in AWS. We shared the same Chef server for both AWS and data center infrastructure. All of the AWS nodes are inside VPC with a VPN tunnel back to the data center. This was done so we don’t have the overhead of managing another Chef server just for the AWS infrastructure. Also all of the software packages reside behind the firewall.

Below are some of the issues we encountered using Chef with the AWS setup:

Failed first Chef run

Failed Chef runs are inevitable. However, the first Chef run is the most important since it registers the node with the Chef server so it can be managed for future Chef runs. If, for whatever reason, the first Chef run fails, you are screwed. Manual intervention will be required to join the node (I’ll give an example below).

In the data center, this is less of an issue. Since all the nodes are effectively “static”, you know when the first Chef run will happen (adding new machines, provisioning new OS). It is much easier to monitor for failed Chef runs. However, in AWS auto scale will kick in whenever it needs to so it’s much harder to monitor. Since new nodes are being provisioned all the time (auto scale), the likelihood of failing first Chef runs are much higher.

I know what you guys are thinking, if the cookbooks are written properly Chef run wouldn’t fail. This is mostly true. In AWS, IP addresses are recycled. We were running inside VPC so the subnet IP range is much smaller. IP collisions happen all the time. When a node is terminated by ASG, Chef server doesn’t know about it. Next time that same IP address appears again, Chef server will reject it. To combat this issue, we had to create additional tooling to make sure all of the IP addresses are cleaned in the Chef server.

Provision time

Chef client isn’t the fastest program in the world, but that’s not really the true issue. To provision new nodes we are running all the recipes and transferring all the necessary packages from our data center to the cloud. This can take some time, since provision from top to bottom is very common for us. You can tolerate this in the data center since it only really happens once. This can take up to 25 mins for the first Chef run to complete. In the meantime, our cluster is getting killed. There might be ways to optimize the provision time, but the bottom line is that this will always take more time than we would have liked.

False auto scale trigger

ASG can be triggered a variety of ways. For some of the applications, we use average CPU. Since Chef run happens at some predictive interval (default 30 mins). It is using up enough resources to trigger ASG when the server load is near the threshold. This is an issue because after the Chef run, the load goes back down, now ASG will terminate that newly provisioned node. We can tweak our ASG, but this became a hassle since every application has a different ASG profile.

So, what did we end up doing?

In order to combat the above issues, we ended up creating a simple tool for developers to create (bake) an Amazon Machine Image (AMI) on demand, a la Netflix style. It runs Chef-solo behind the scene, since we already have all the recipes. Actually, most of the applications are Java, we just created a common recipe. In many cases, the person deploying does not need to be technical or require any Chef work. It’s just one button push away. The important takeaway is that we version our image (AMI in this case) with the software. In general, we run one service per box and each service can be clustered (horizontal scalability). The name of the image correlates to the service version baked in. In order to release a new version of the service, a new image has to be baked, then we rotate it into the cluster. If there are no issues with the new service, the old cluster is taken down. The really nice thing about this approach is that we have all the previous versions “frozen” in the exact state. If anyone needs to hunt down a bug, it will be very easy to replicate with the exact production environment. Since everything is in an ASG, all the nodes will be launched in a predictable fashion and as fast as possible since provision isn’t happening just in time. We no longer have to worry about failures during provisioning time.

An important thing to note is that when we build a new image, we do not build it from the previous version. It starts from some known clean base, usually it is pretty bare. This way, there’s no cruft being accumulated over time.

WAIT, what about configuration management?

I know what you guys are thinking. How do you manage the configurations? How do you do service discovery? These two things are previously managed by Chef runs. We no longer run Chef client on your boxes. In order to archive this, we moved both into the application tier. This is not done solely for the sake of removing Chef to build the AMI.

Very often we want to change our application configurations as fast as possible without waiting for some predetermined period. We introduced a REST configuration service that the application will call out before starting. If for some reason that service is down, it will bootstrap itself using some known configuration, typically injected by userdata during launch time.

Service discovery1 is also moved to the application tier because everything is inside an ASG. Nodes in the cluster are always appearing or disappearing. One service needs to know where another service is. A REST service2 is provided for nodes to discover each other. When the node gets launched, it registers itself (unregister during termination). It will communicate to the discovery service periodically. Otherwise, it will be marked as unavailable. Before the service is launched it will retrieve necessary info via discovery service. Discovery service is a critical component. It runs in a cluster with replication to avoid single point of failure. Also Chef run will not save us here since it doesn’t know the cluster will be resized.

It takes a decent amount of work to get here, but it’s possible to take an incremental chiropractic approach to evolve over time. We evolved the infrastructure while we were running in production. This approach results in your application being more robust, resilient and predictable. Currently this system is doing billions of requests per day and has drastically sped up deployment and reduced deployment related issues greatly.

I am interested in hearing about your experiences, if you use Auto Scaling Groups, Chef or are experiencing pain in the cloud I'd love to hear from you: [email protected].

1 Service discovery is when a service (application) is dependent on another service to perform an operation. For example, you might have an aggregation service that needs to call out to multiple sources (services) in order to complete its task. This is very typical in an SOA environment where a discrete function has been broken up into separate services in order to leverage isolation and horizontal scalability. The simplest way to archive this is to hard code the dependent services into a configuration file. This approach will not work when the services can appear dynamically.

2 We are using Eureka from Netflix for our service discovery, but there are a number of solutions out there which I will discuss in a future post (Zookeeper, HA Proxy etc). Eureka offers a REST interface and client SDK (if you are Java) for easy consumption. The bottom line is that this is the most resilient solution we have found so far.

Written by Aaron Feng

#devops #chef #aws #autoscalinggroups #servicediscovery

1 note · View note