cloudarch-blog - Tumblr blog

cloudarch-blog · 13 years ago

Text

Ylastic – AWS Account Advisor

Ylastic yesterday announced a new tool to their product, see the release below... go check it out here. Another great addition to one of my favourite tools.

AWS Account Advisor

Introducing the Ylastic AWS Account Advisor, a tool that will inspect your AWS environment and identify opportunities for optimizing your usage of AWS. In this initial release, the advisor can run ten different checks against your AWS account. The checks are broadly divided into four categories:

Cost Optimization - Opportunities for reducing costs by detecting unused volumes, elastic load balancers, elastic ip addresses and Route 53 zones. These checks will also display an estimated cost saving per month and per year from removing the unused resources.

Disaster Recovery - Check your ability to recover from system wide failures by detecting volumes that are in-use but not being backed up to snapshots. The advisor will also flag volumes that have snapshots older than several days, as that may be an indication that the backups are getting stale.

Fault Tolerance - Identifies situations that can impact your ability to recover from the failure of an EC2 availability zone, by checking if your elastic load balancers have distributed allocation of instances, as well as if you have instances distributed in more than one zone.

Security Audit - Secure access to your resources by detecting security groups that provide public access to sensitive ports or port ranges, as well as S3 buckets that can be listed by anonymous users across the internet.

The advisor is a feature available in the Ylastic Plus version and can be accessed from the Analyze section in the navigation menu. We are working on adding the ability to run the advisor on a schedule, as well as enhance the advisor with additional checks based on customer feedback.

#tools #ylastic

1 note · View note

cloudarch-blog · 13 years ago

Text

Top Cloud Companies

Hardware companies are on a downhill slope while server enterprises are on a steady uphill trek and startups continue to thrive in the IT industry. The question is, who among these cloud companies are on its way to the top. Obviously, and based on the current enterprise trend those with the best scalable platform and easily integrated framework gets the best leverage.

In terms of cloud infrastructure, Amazon.com seem to have the upper hand because it is almost impossible to mention a cloud provider without the mention of this cloud giant. Currently, Amazon has 60,000 users including pharmaceutical companies, major banks and several large companies. They have earned this status through their Amazon Web Service’s (AWS) intense compute feature, which is now considered as one of the top supercomputers of this age.

Rackspace Hosting Cloud Servers also made it on top of the list in infrastructure computing. With almost 80,000 servers operating across the globe by the 3rd quarter of 2011, Rackspace capitalizes on the shared-principle with Amazon that “bare” cloud computing gives its users more control over their business solutions. This holds true for Amazon and Rackspace, but would be very difficult for small players who cannot even afford to invest on 1,000 server hosts. In the meantime, Rackspace and Amazon dominate this scale utility arena. Other utility service providers are also trying to make a mark that include Verizon that recently acquired Terremark, AT&T that took Synaptic Compute, IBM and VMWare.

There are companies that prefer to manage their business solutions in their own platform. This is very useful for companies that have uniquely defined projects that utilize their own software. Salesforce.com through its Force.com and Microsoft’s Azure leads this cloud platform. Force.com offers a sort of template programming for dummies that prefer to run object-oriented development that are already built-in and programmed within the platform. Microsoft Azure, on the other hand, provides problem compute solutions through its .Net service. Google, alternatively, uniquely operates using software through a platform in its search engine cloud solutions.

In the area of cloud storage, Amazon’s Simple Storage Service tops the list for its ability to host as much as 500,000 requests per second even during peak hours. VMWare’s EMC storage service also continues to post record-high revenues for the company.

0 notes

cloudarch-blog · 13 years ago

Text

Amazon S3 reduces storage costs

As you can tell from this recent post on Amazon S3 Growth for 2011, customers are uploading new objects to Amazon S3 at an incredible rate. They continue to innovate on your behalf to drive down storage costs and pass along the resultant savings to you at every possible opportunity. They are introducing price changes.

With this price change, all Amazon S3 standard storage customers will see a significant reduction in their storage costs. For instance, if you store 50 TB of data on average you will see a 12% reduction in your storage costs, and if you store 500 TB of data on average you will see a 13.5% reduction in your storage costs.

Effective February 1, 2012, the following prices are in effect for Amazon S3 standard storage in the US Standard region:

StorageOld (GB / Month)New (GB / Month)First 1TB$0.140$0.125Next 49TB$0.125$0.110Next 450TB$0.110$0.095Next 500TB$0.095$0.090Next 4000TB$0.080$0.080 (no change)Over 5000TB$0.055$0.055 (no change)

The prices for all of the other regions are listed on the Amazon S3 Pricing page. For the AWS GovCloud region, the new lower prices can be found on the AWS GovCloud Pricing page.

There's been a lot written lately about storage availability and prices. Amazon often talks about the benefits that AWS's scale and focus creates for its customers. Their ability to lower prices again now is an example of this principle at work.

It might be useful for you to remember that an added advantage of using a cloud storage service such as Amazon S3 over using your own on-premise storage is that with cloud storage, the price reductions that we regularly roll out apply not only to any new storage that you might add but also to the existing storage that you have. This could amount to considerable financial savings for many of you.

#cloud news

0 notes

cloudarch-blog · 13 years ago

Text

12,233 Transactions a second - that will be twitter

The results are in: People love to share the experience of watching the Super Bowl with millions of other viewers from around the world. Whether it’s cheering for your team, commenting on the halftime show or discussing the ads, it seems you have a lot to say about this annual tradition—and you’re using Twitter to say it.

Here’s a fun fact: in 2008, Twitter’s largest spike in Tweets per second (TPS) during the Super Bowl was just 27. Three years later, fans sent 4,064 TPS, which was the highest TPS for any sporting event at that time.

This year, the TPS peak was 12,233 Tweets. The spike took place in the final three minutes of the game, during which fans sent an average of 10,000 TPS. Madonna’s performance during halftime was a big hit, too—there was an average of 8,000 TPS sustained during her performance, with a peak of 10,245 Tweets.

In the image above, you can see a few more data points from last night, including how some of the hashtags performed that were displayed on-air in the ads. If you missed any of those ads, head to adscrimmage.twitter.com to check them out and vote for your favorite via Tweet.

http://blog.twitter.com/2012/02/post-bowl-twitter-analysis.html

#cloud news

0 notes

cloudarch-blog · 13 years ago

Text

State of NoSQL in 2012

Here is a guest post by Siddharth Anand about how all things NoSQL are shaping up in 2012

Preamble Ramble

If you’ve been working in the online (e.g. internet) space over the past 3 years, you are no stranger to terms like “the cloud” and “NoSQL”.

In 2007, Amazon published a paper on Dynamo. The paper detailed how Dynamo, employing a collection of techniques to solve several problems in fault-tolerance, provided a resilient solution to the on-line shopping cart problem. A few years go by while engineers at AWS toil in relative obscurity at standing up their public cloud.

It’s December 2008 and I am a member of Netflix’s Software Infrastructure team. We’ve just been told that there is something called the “CAP theorem” and because of it, we are to abandon our datacenter in hopes of leveraging Cloud Computing.

Huh?

A month into the investigation, we start wondering about our Oracle database. How are we are going to move it into the cloud? That’s when we are told that we are to abandon our RDBMS too. Instead, we are going to focus on “AP”-optimized systems. “AP” or high-availability systems is our focus in 2008-2010 as Netflix launches its video streaming business. What’s more available than TV right? Hence, no downtime is allowed, no excuses!

Fast forward to the end of 2011: the past 3 years have been an amazing ride. I helped Netflix with two migrations in 2010: the first was from Netflix’s datacenter to AWS’ cloud, the second from Oracle to SimpleDB. 2011 was no less exciting: with a move from SimpleDB to Cassandra, Netflix was able to expand overseas to the UK and Ireland. Since Cassandra offers configurable routing, a cluster can be spread across multiple continents.

Fast forward to today: I’ve completed my first month at LinkedIn. I’ve spent this month getting familiar with the various NoSQL systems that LinkedIn has built in-house. These include Voldemort (another Dynamo-based system), Krati (a single-machine data store), Espresso (a new system being actively developed), etc… LinkedIn is now facing similar challenges to the Netflix of 3 years ago. Traffic is growing and both Oracle and the datacenter face potential obsolescence.

NoSQL and Cloud Computing to the rescue?

The State of NoSQL Today

Ignoring the datacenter vs. public cloud question for the time-being, what would I pick today regarding a NoSQL alternative to Oracle? Not many people get a chance to solve the same tough problem twice.

For one, there are still quite a few NoSQL alternatives in the market. Some are supported by startups (e.g. Cassandra, Riak, MongoDB, etc..), some are supported indirectly by companies (e.g. LinkedIn’s support of Voldemort, Facebook & StumbleUpon’s support of HBase), and some are supported directly by companies (e.g. AWS’s S3, SimpleDB, and DynamoDB, etc… ).

In making a decision, I’ll consider the following learnings:

Any system that you pick will require 24-7 operational support. If it is not hosted (e.g. by AWS), be prepared to hire a fleet of ops folks to support it yourself. If you don’t have the manpower, I recommend AWS’ DynamoDB

Just because the company got by with one big machine for Oracle, don’t be surprised if the equivalent NoSQL option results in 36 smaller machines. All complete solutions to fault-tolerance support “rebalancing”. Rebalancing speed is determined by data size of a shard. Hence, it’s better to keep the size per shard reasonable to minimize MTTR in times of disaster.

Understand the limitations of your choice:

MongoDB, at the time of this writing, has a global write-lock. This means that only one write can proceed at a time in a node. If you require high write-throughput, consider something else

Cassandra (similar to other Dynamo systems) offers great primary key-based access operations (e.g. get, put, delete), but doesn’t scale well for secondary-index lookups

Cassandra, like some other systems, has a lot of tunables and a lot of internal processes. You are better off turning off some features (e.g. Anti-entropy repair, row cache, etc…) in production to safeguard consistent performance.

Many of the NoSQL vendors view the “battle of NoSQL” to be akin to the RDBMS battle of the 80s, a winner-take-all battle. In the NoSQL world, it is by no means a winner-take-all battle. Distributed Systems are about compromises.

A distributed system picks specific design elements in order to perform well at certain operations. These design elements comprise the DNA of the system. As a result, the system will perform poorly at other operations. In the rush to win mindshare, some NoSQL vendors have added features that don’t make sense for the DNA of the system.

I’ll cite an example here. Systems that shard data based on a primary key will do well when routed by that key. When routed by a secondary key, the system will need to “spray” a query across all shards. If one of the shards is experiencing high latency, the system will return either no results or incomplete (i.e. inconsistent) results. For this reason, it would make sense to store the secondary index on an unsharded (but replicated) system. This concept has been utilized internally at Netflix to support internal use-cases. Secondary indexes are stored in Lucene to point to data in Cassandra.

LinkedIn is following the same pattern in the design of its new system, Espresso. Secondary indexes will be served by Lucene. The secondary index will return rowids in the primary store.

This brings me to another observation. In reviewing Voldemort code recently, I was impressed by the clarity and quality of the code. In core systems, whether distributed or not, code quality and clarity goes a long way. Although Voldemort is an open source project and has been for years, a high degree of discipline has been maintained. I was also impressed by the simplicity of its contract - get(key), put(key,value), and delete(value). The implementors understood the DNA of the system and did not add functionality to the system that is ill-suited to the DNA.

In a similar vein, Krati, Kafka, Zookeeper, and a few other notable open-source projects stick to clear design principles and simple contracts. As such, they become reusable infrastructure pieces that can be used to build an Distributed System that you need. Hence, the system we end up building might be composed of several specialty component systems that can be independently tuned, in some ways similar to HBase. As a counter example, to achieve predictable performance in Cassandra without a significant investment in tuning, it may be easier to turn off features. This is because multiple features in a single machine contend for resources — since each feature has a different DNA (or resource consumption profile), performance diagnosis and tuning can be a pain point.

#nosql #cloud news #state-of-nosql-in-2012

12 notes · View notes

cloudarch-blog · 14 years ago

Text

Scaling to 14 millions users with 3 people - Instagram

Blog Post on Instagram about scaling to 14 Millions users with a team of 3 - link

What Powers Instagram: Hundreds of Instances, Dozens of Technologies

One of the questions we always get asked at meet-ups and conversations with other engineers is, “what’s your stack?” We thought it would be fun to give a sense of all the systems that power Instagram, at a high-level; you can look forward to more in-depth descriptions of some of these systems in the future. This is how our system has evolved in the just-over-1-year that we’ve been live, and while there are parts we’re always re-working, this is a glimpse of how a startup with a small engineering team can scale to our 14 million+ users in a little over a year. Our core principles when choosing a system are:

Keep it very simple

Don’t re-invent the wheel

Go with proven and solid technologies when you can

We’ll go from top to bottom:

OS / Hosting

We run Ubuntu Linux 11.04 (“Natty Narwhal”) on Amazon EC2. We’ve found previous versions of Ubuntu had all sorts of unpredictable freezing episodes on EC2 under high traffic, but Natty has been solid. We’ve only got 3 engineers, and our needs are still evolving, so self-hosting isn’t an option we’ve explored too deeply yet, though is something we may revisit in the future given the unparalleled growth in usage.

Load Balancing

Every request to Instagram servers goes through load balancing machines; we used to run 2 nginxmachines and DNS Round-Robin between them. The downside of this approach is the time it takes for DNS to update in case one of the machines needs to get decomissioned. Recently, we moved to using Amazon’s Elastic Load Balancer, with 3 NGINX instances behind it that can be swapped in and out (and are automatically taken out of rotation if they fail a health check). We also terminate our SSL at the ELB level, which lessens the CPU load on nginx. We use Amazon’s Route53 for DNS, which they’ve recently added a pretty good GUI tool for in the AWS console.

Application Servers

Next up comes the application servers that handle our requests. We run Django on Amazon High-CPU Extra-Large machines, and as our usage grows we’ve gone from just a few of these machines to over 25 of them (luckily, this is one area that’s easy to horizontally scale as they are stateless). We’ve found that our particular work-load is very CPU-bound rather than memory-bound, so the High-CPU Extra-Large instance type provides the right balance of memory and CPU.

We use http://gunicorn.org/ as our WSGI server; we used to use mod_wsgi and Apache, but found Gunicorn was much easier to configure, and less CPU-intensive. To run commands on many instances at once (like deploying code), we use Fabric, which recently added a useful parallel mode so that deploys take a matter of seconds.

Data storage

Most of our data (users, photo metadata, tags, etc) lives in PostgreSQL; we’ve previously written about how we shard across our different Postgres instances. Our main shard cluster involves 12 Quadruple Extra-Large memory instances (and twelve replicas in a different zone.)

We’ve found that Amazon’s network disk system (EBS) doesn’t support enough disk seeks per second, so having all of our working set in memory is extremely important. To get reasonable IO performance, we set up our EBS drives in a software RAID using mdadm.

As a quick tip, we’ve found that vmtouch is a fantastic tool for managing what data is in memory, especially when failing over from one machine to another where there is no active memory profile already. Here is the script we use to parse the output of a vmtouch run on one machine and print out the corresponding vmtouch command to run on another system to match its current memory status.

All of our PostgreSQL instances run in a master-replica setup using Streaming Replication, and we use EBS snapshotting to take frequent backups of our systems. We use XFS as our file system, which lets us freeze & unfreeze the RAID arrays when snapshotting, in order to guarantee a consistent snapshot (our original inspiration came from ec2-consistent-snapshot. To get streaming replication started, our favorite tool is repmgr by the folks at 2ndQuadrant.

To connect to our databases from our app servers, we made early on that had a huge impact on performance was using Pgbouncer to pool our connections to PostgreSQL. We found Christophe Pettus’s blog to be a great resource for Django, PostgreSQL and Pgbouncer tips.

The photos themselves go straight to Amazon S3, which currently stores several terabytes of photo data for us. We use Amazon CloudFront as our CDN, which helps with image load times from users around the world (like in Japan, our second most-popular country).

We also use Redis extensively; it powers our main feed, our activity feed, our sessions system (here’s our Django session backend), and other related systems. All of Redis’ data needs to fit in memory, so we end up running several Quadruple Extra-Large Memory instances for Redis, too, and occasionally shard across a few Redis instances for any given subsystem. We run Redis in a master-replica setup, and have the replicas constantly saving the DB out to disk, and finally use EBS snapshots to backup those DB dumps (we found that dumping the DB on the master was too taxing). Since Redis allows writes to its replicas, it makes for very easy online failover to a new Redis machine, without requiring any downtime.

For our geo-search API, we used PostgreSQL for many months, but once our Media entries were sharded, moved over to using Apache Solr. It has a simple JSON interface, so as far as our application is concerned, it’s just another API to consume.

Finally, like any modern Web service, we use Memcached for caching, and currently have 6 Memcached instances, which we connect to using pylibmc & libmemcached. Amazon has an Elastic Cache service they’ve recently launched, but it’s not any cheaper than running our instances, so we haven’t pushed ourselves to switch quite yet.

Task Queue & Push Notifications

When a user decides to share out an Instagram photo to Twitter or Facebook, or when we need to notify one of our Real-time subscribers of a new photo posted, we push that task into Gearman, a task queue system originally written at Danga. Doing it asynchronously through the task queue means that media uploads can finish quickly, while the ‘heavy lifting’ can run in the background. We have about 200 workers (all written in Python) consuming the task queue at any given time, split between the services we share to. We also do our feed fan-out in Gearman, so posting is as responsive for a new user as it is for a user with many followers.

For doing push notifications, the most cost-effective solution we found was https://github.com/samuraisam/pyapns, an open-source Twisted service that has handled over a billion push notifications for us, and has been rock-solid.

Monitoring

With 100+ instances, it’s important to keep on top of what’s going on across the board. We use Munin to graph metrics across all of our system, and also alert us if anything is outside of its normal range. We write a lot of custom Munin plugins, building on top of Python-Munin, to graph metrics that aren’t system-level (for example, signups per minute, photos posted per second, etc). We use Pingdom for external monitoring of the service, and PagerDuty for handling notifications and incidents.

For Python error reporting, we use Sentry, an awesome open-source Django app written by the folks at Disqus. At any given time, we can sign-on and see what errors are happening across our system, in real time.

You?

If this description of our systems interests you, or if you’re hopping up and down ready to tell us all the things you’d change in the system, we’d love to hear from you. We’re looking for a DevOps person to join us and help us tame our EC2 instance herd.

#scaling #Performance

15 notes · View notes

cloudarch-blog · 14 years ago

Text

A diary of scaling on the fly

A couple of weeks ago I launched a new site at work for a big retail company in the UK. The site serves up around 100,000 products and is dynamic in the fact it changes the products displayed based on popularity. The site has the following setup

Front end

Amazon Elastic Load Balancer

PHP site

Apache 2.2

Ubuntu 10.10

Amazon EC2 Medium instance

API layer

Ruby on Rails

Heroku running 2 dynos

Database Layer

Amazon RDS small - single availablity zone

Monitoring

IP Patrol running HTTP checks on the site availability, API and confirming specific data is being returned in the JSON.

Amazon Cloudwatch - monitoring RDS, ELB and EC2

Day One

So all is going well and the first day and site is holding up well.

Day Two

Traffic doubles and looking at the monitoring response times the application is slowing down. We are not exactly sure how much traffic we need to cope with but the the RDS instance is running at about 80%.

As with general traffic behavior the busiest time of the day is between 7PM and 9PM

Just to make sure and take no risks I increase the RDS instance to the next size which is Large. This is done with zero downtime as using RDS you create a snapshot of the database to a bigger size, once complete you simply login to Heroku and change the RDS connection parameter to point to the new one.

Great we have more throughput, response times have reduced and he have some calm. Yes its costing us more to run the site but this is not the time to be scrimping.

Day Three

Traffic doubles again. Checking the Cloudwatch stats for RDS the CPU usage is around 70% and the response times of the app have increased but I am fairly comfortable.

Day Four

8PM and I get an SMS alert notifying me the site is taking over 5 seconds to respond. Traffic has increased 3 times on the day before. Looking at the different monitoring points e.g. ELB, EC2, RDS and service monitoring on IP Patrol the database is struggling under the load. The CPU is thrashing at 100% which is not good and I am getting nervous.

First step is to migrate the DB to an even larger one - its now going up to an X Large. The migration goes smoothly and the response times return back to normal. The CPU on the RDS is still running at about 70% and not sure how much more traffic we are going to get. Time to catch a breathe and think about things in the morning, the site is now costing a fair amount of money.

Day Five

Get into work in the morning and put our heads together to work out a solution. First step is to work out what exactly is going on as we cannot just keep throwing hardware at the problem. We do some digging around and come across a Heroku add on called New Relic, this is a fantastic tool that for the basic version if free but you get a pro trial for a period of time. New Relic is installed with a few lines of code and instantly starts collecting analytics on how our API and database is being used. It quickly becomes clear what the problem is - everytime a visitor hits the site the API determines which products to display and in which order. This involves doing a MYSQL select on a database with 100,000 products + a join with another table that has 200,000 rows and then a sort, this is a very heavy query! What is happening is the data set is too large for MYSQL to query and each time it gets a request its writing a temporary table to process the data. This is causing the high CPU load. So how do we fix it?

Step 1 - change the tables so they have indexes installed - this speeds up the queries and reduces the load on RDS but by not enough

Step 2 - implement Varnish cache on the Heroku app. This does not work however as the API requests are coming from a PHP application and Varnish requirs HTTP headers to be set which CURL does not use. But Varnish gets me thinking...

Step 3 - install Varnish on the PHP front end Apache server. First of all I take an image of the front end webserver which is straight foward as its running on an EBS volume. Next start up a new instance of the image and install Varnish. The Varnish install is straight forward and is just a matter of using "apt-get install varnish". Now that its installed its a case of configuring it to work in the intended manner by doing the following

Configure Apache to listen on port 8080 instead of port 80

Configure Varnish to receive all HTTP connections on port 80 and then forward them port 8080 on Apache

Configure Varnish to tell it which elements to cache and which not to. The site has product details pages which uses AJAX calls which means you cannot cache them. Apart from that I go for a pretty aggressive caching policy and cache everything else. Next check the site works and funtions as normal and then time to do some load testing. To give some confidence that the site will hold up I use Apache Bench and hit the server first of all with small volume to make sure it doesnt hit the RDS database, the cache seems to be working well. Next stage is to hit it with as much as possible, the site is now taking around 500 transactions a second and the RDS database is only getting hit when visiting the product details page and when the cache purges and needs to re query.

But not so fast! Looking at top on the Ubuntu EC2 instance memory usage is pretty high and I am not very comfortable. The biggest 32bit instance on Amazon has 1.7Gb of RAM and that just isnt enough to cope with Varnish and Apache so its time for 64bit Large. This involves rebuilding the whole server on a 64bit instance which does not take that long. The test server is now running again so time for another round of load testing, this time round the Apache Bench is reporting about 1000 transactions a second, top is reporting around 4Gb out of the available 7Gb of RAM being used. Now for a longer soak test using Apache Bench just to provide some confidence, 1,000,000 transactions later and it is looking good.

Go live time - switch over the Elastic Load Balancer to use the new Varnish instance and off we go

The results are as follows

Site response time goes from 2-3 seconds down to 0.2 seconds as reported by IP Patrol

Cloud Watch Elastic Load Balancer EC2 response time from 1 second down to 0.1 second

RDS CPU usage on XLarge goes from 70% usage down to 10-20%

Varnish and Apache holding steady

Overall - amazing!

Day Six and Seven

Whole site and solution is holding steady. Response times, CPU usage are continuing at very good figures and New Relic is reporting a much more happy database and application

Day Eight

Now I am happy the site is holding up I reduce the Amazon RDS database to a Large and might even be able to go back down to a small. The site is really fast and I am a lot less nervous.

Summary

Make sure you have good monitoring at every level of your application to identify bottle necks

Test thoroughly before going live - I did do this to an extent but didnt appreciate the volume of traffic that would be coming my way

Caching is fantastic! - if you have the right sort of site or application and can cache elements do! it will save you time, money and stress

Useful Links

New Relic

Install guide for Varnish

IPPatrol

#scaling #Performance

3 notes · View notes

cloudarch-blog · 14 years ago

Text

There is no stopping the cloud

Good article on The Register about Cloud adoption with IT companies link

Open... and Shut Cloud computing is big business, in part because companies are happy to shell out lots of cash to buy themselves time and development flexibility.

In this quest to displace the operations bottleneck that exists within enterprises, developers are taking on more of the operations role for themselves and to reduce this new burden have started a mad rush to run anything and everything in the cloud.

We've long talked about the need to outsource everything but one's core business, but it's developers who are doing this far more than any other group within the enterprise.

Now we have databases in the cloud, logging in the cloud, and even network monitoring, thanks to Boundary's new service, in the cloud.

In this way the rising generation of cloud developers has managed to have its cake (writing code) and eat it, too (without being overwhelmed by Operations-induced bureaucracy). This hasn't been without pain. As Forrester analyst James Staten points out, old-school infrastructure and operations professionals have a very different attitude toward public cloud computing than developers do. The former group is willing to accept the cloud, but wants it on their terms when it comes to security, control and other things.

But Staten points out this hope is in vain, given that developers just want to move forward writing code with minimal friction. As a result: "If you can’t meet [developers'] demands with a public cloud solution – forget about meeting their needs with a private cloud or other capabilities."

Developers want the agility the public cloud offers, and given that they're the kingmakers, as Redmonk analyst Stephen O'Grady highlights, resistance from operations is somewhat futile.

Nor is it all that helpful. After all, it's striking how similar the two groups are in terms of desired outcomes, as two recent developer surveys call out. The first survey(warning: PDF) of more than 700 IT professionals was commissioned by Puppet Labs, a server management start-up behind the hugely popular Puppet server configuration management tool. The second survey of over 4,000 IT professionals was sponsored by IBM. The two companies couldn't be more different, but the results are surprisingly similar.

Given that Puppet provides an IT automation tool, it's not surprising that 55 per cent of respondents identified Automation as the top expected benefit of embracing the DevOps approach, with 68 per cent ranking it in their top three responses. This is, after all, what they get paid to do.

But consider the IBM survey data. The IBM survey took a different slant on the topic, but found that business analytics is the most-adopted technology area among respondents, with a full 59 per cent embracing analytics to increase automation and 46 per cent citing it as a way to streamline processes.

It's very possible that these survey respondents reflect the same demographics. Puppet Labs is a new kid on the block, while IBM's hoary head has been around the industry for decades, but Puppet as a tool tends to be used by those with a sprawling infrastructure.

But that's kind of my point. The survey demographics skew toward traditional system administrators, yet these "old school" IT professionals want the same thing from their IT as "new school" cloud developers do: automation. They want more of their operational busy work taking care of for them so that they can focus on writing applications or other code.

Given this common goal, it's hard to see the public cloud not winning out over time. It may take IT more time to get comfortable with the public cloud than their more agile developer cousins, but it's going to happen for the reasons cited above and for a range of others. Private clouds may well prevail in the short term as enterprises seek a middle ground between slow paths to automation and fast paths to automation, but long term the public cloud will win.

Morgan Stanley Research pegs (warning: PDF) public cloud adoption as 50 per cent within the next three years. Given that enterprise IT and its developers have a common goal, I suspect the different time horizons ("now" versus "later") will iron themselves out and we'll see even more workloads moved to the public cloud than Morgan Stanley predicts. The need for development speed demands it.

#cloud news

0 notes

cloudarch-blog · 14 years ago

Text

Cloud Computing Now Makes It Easier and Cheaper to Innovate: Study - Forbes

Source - Forbes

Who doesn’t want innovation these days? It’s the new mantra of organizations large and small as they attempt to navigate and get the upper hand in today’s hyper-competitive and unforgiving global economy.

Photo: Wikimedia Commons

But innovation is not cheap. It can be extremely risky, since a relatively small percentage of innovations actually deliver results in the end. The challenge is trying to figure out where to invest wisely, and which innovation is the potential winner. The natural reflex in the business world has been to avoid going overboard with innovation, since it means sinking considerable time and resources into ideas that don’t get off the ground.

However, cloud computing technology may be clearing the way to turn formerly hidebound businesses into innovation factories. That’s because it now offers a low-cost way to try and fail with new ideas. In essence, the price of failure has suddenly dropped through the floor. Failure has become an option.

A recent survey of 1,035 business and IT executives, along with 35 vendors, conducted by the London School of Economics and Accenture, has unearthed this new emerging role for cloud computing — as a platform for business innovation. Many people these days still see cloud within it’s information technology context, as a cheaper alternative for existing systems. But this may only be the first and most obvious benefit.

The study’s authors. Leslie Willcocks, Dr. Will Venters and Dr. Edgar Whitley — all of the London School of Economics and Political Science — identified three stages cloud computing moves into as it’s adopted by organizations:

1) Technology and operational changes. The one-of-one exchange of traditional applications and resources for those offered as services through the cloud — such as Software as a Service.

2) Business changes. Altering the way companies operate and serve customers, such as enabling faster service, faster time to market.

3) New ways of designing corporations themselves. “For especially forward-looking companies, cloud computing may provide a platform for radical innovation in business design—to the point where executives are actually provisioning and decommissioning parts of the business on an as-needed basis,” the study’s authors observe.

It’s in that third phase where things get really interesting. Cloud computing, the authors point out, enable companies to quickly acquire processing, storage or services as needs dictate. They can just as quickly shed those resources when a project is completed. As a result, companies with more advanced cloud sites are able to rapidly move through experimental or prototyping stages:

“Such a model supports “seed and grow” activities and faster prototyping of ideas. With traditional IT models, a decision to prototype a new system generally involves the procurement and installation of expensive hardware, with the associated checks and delays that conventional purchasing requires. Cloud provisioning, on the other hand, can be implemented rapidly and at low cost.”

An example, cited in the study, is an effort to innovate within processes and technologies related to sales support—for example, tracking contacts, managing and converting the sales pipeline, and generating revenue. Change would be difficult, if not impossible, for processes locked into traditional on-site IT systems. Consider the possibilities with cloud:

“The company could provision a combination of software as a service for sales, along with an enterprise system or financial management system. Sales personnel could have access to specialized sales support over the cloud. This ability to envision new combinations of cloud-based solutions and create new ways of performing end-to-end processes presents companies with new opportunities to be innovative in new-product development as well as in service and support.”

A couple of years back, Erik Brynjolfsson and Michael Schrage made similar observations about technology’s promise to lower the costs and risk of innovation in anarticle in MIT Sloan Management Review. It’s all about the power of online real-world simulations and samplings, which reduce the cost of testing new ideas to pennies. For example, with a Website, “companies can test out a new feature with a quick bit of programming and see how users respond. The change can then be replicated on billions of customer screens.” This capability can be extended to supply chain management and customer relationship management systems as well.

Implementation of new ideas is blindingly fast, Brynjolfsson and Schrage stated. “When a company identifies a better process for screening new employees, the company can embed the process in its human-resource-management software and have thousands of locations implementing the new plan the next morning.” Brynjolfsson and Schrage also predicted that thanks to technology, many companies will shift from conducting two or three real-world experiments to 50 to 60 a year.

“Technology is transforming innovation at its core, allowing companies to test new ideas at speeds—and prices—that were unimaginable even a decade ago,” they said. “Innovation initiatives that used to take months and megabucks to coordinate and launch can often be started in seconds for cents.”

We’ve already seen the impact of technology to shave tremendous time and costs in such areas as energy exploration and engineering. But now the ability to quickly test and deploy new innovations is available to all types of businesses. Add the ability to provision those workloads to on-demand cloud resources, and a huge weight — in cost and risk — has been lifted off innovation.

#cloud news

0 notes

cloudarch-blog · 14 years ago

Text

Elastic Load Balancer SSL Setup Guide - PEM encoded CSR

Amazon Elastic Load Balancers have the nice feature of being able to do SSL termination, but what does that mean? Usually when using SSL you have to configure each server to have your SSL certificate installed to provide HTTPS. Using Amazon ELB means that you only have to install you SSL certificate in one place and Amazon do everything for you. This also has the benefit of taking load off of your server by taking out the overhead of SSL encryption.

Setting up SSL on Amazon ELB is a bit different from configuring SSL in Apache and is a bit tricky, so I wanted to create a guide on how to do this. This guide is for when using GoDaddy as the SSL provider but this could by any SSL provider using a normal CSR process.

The reason why this is different from a normal CSR process is that Amazon require your keys to be PEM encoded instead of the normal format GoDaddy or other SSL providers generate.

I. Getting a GoDaddy SSL Certificate (Part I)

1. Purchase a GoDaddy Standard SSL Certificate

GoDaddy sells Standard SSL Certificates for anywhere from $12.99/year to $49.99/year. I highly recommend you do a search for 'cheap ssl' on Google and see if there are any advertisements for discounted GoDaddy Standard SSL Certificates. I was able to buy my certificate for $12.99/year this way.

2. After you complete your purchase, GoDaddy will give you a credit that you can trade for a certificate. Make the trade and click on 'Manage Certificate' next to your new certificate. This will bring you to a Credits control panel where we will click 'Request Certificate' next to your new certificate later on when we are ready to setup your certificate.

II. Creating a Certificate Signing Request (CSR)

Note: To create an SSL certificate, you must first generate and submit a Certificate Signing Request (CSR) to the Certification Authority (CA) (i.e. GoDaddy). The CSR contains your certificate-application information, including your public key. The CSR will also create your public/private key pair used for encrypting and decrypting secure transactions.

These instructions are based on generating the CSR on Ubuntu.

Steps to create a CSR:

1. Make a new directory to hold your project's SSL-related stuff. It doesn't really matter where you put this, but I recommend putting this somewhere safe and should not be in your web directory.

mkdir ssl-cert

2. Move to your newly created folder

cd ssl-cert

3. Use OpenSSL to generate an RSA host key ('host.key') using the triple DES encryption, with a 2,048-bit key length (as required by GoDaddy). Triple DES is just DES times three, but is more secure against brute force attacks because of its longer length.

openssl genrsa -des3 -out host.key 2048

It will ask you for a pass phrase. This should be a secret password. Don't forget the pass phrase you set, as we will need it later.

4. Use OpenSSL to generate a new self-signed certificate ('host.csr') using the host key we just created. This is what you'll be sending to GoDaddy to model your new SSL after.

openssl req -new -key host.key -out host.csr

You will be prompted with a bunch of questions. Answer all of them, except the last two 'extra' attributes are optional. Here are example responses:

Country Name (2 letter code) [AU]:UK

State or Province Name (full name) [Some-State]:Warwickshire

Locality Name (eg, city) []:Leamington Spa

Organization Name (eg, company) [Internet Widgits Pty Ltd]:Your Company Name

Organizational Unit Name (eg, section) []:secure.yourdomain.com

Common Name (eg, YOUR name) []:secure.yourdomain.com

Email Address []:[email protected]

Please enter the following ‘extra’ attributes

to be sent with your certificate request

A challenge password []:

An optional company name []:

It is very important you don't mistype anything here, as you can't change this information without buying a new SSL certificate. 'Organizational Unit Name' and 'Common Name' must be the hostname you are using for your domain.

III. Getting a GoDaddy's SSL Certificate (Part II)

1. Return to where we left off with GoDaddy. You should have clicked 'Request Certificate' and see a form where you need to answer the following questions:

Where is your certificate going to be hosted? Third Party

Enter your Certificate Signing Request (CSR) below: [copy the text contents of 'host.csr' here]

Select your certificate issuing organization: GoDaddy

Is this certificate for Intel vPro? No

2. Verify everything and then click through to finish. You should now be able to view and download your GoDaddy SSL certificate from GoDaddy from the 'Manage Certificates' section.

IV. Prepare SSL Certificate for Heroku

1. Download your new SSL certificate from GoDaddy's website into your 'ssl-cert' directory that we created in step I. You will get two files from GoDaddy: 'secure.yourdomain.com.crt' and 'gd_bundle.crt'. 'secure.yourdomain.com.crt' is your new SSL certificate. 'gd_bundle.crt' contains the SSL issuing certificate chain back to the root SSL certificate.

2. Combine 'secure.yourdomain.com.crt' and 'host.key':

cat secure.yourdomain.com.crt host.key > host.pem

3. Remove pass phrase from the public key certificate (required by Amazon)

openssl rsa -in host.pem -out nopassphrase.pem

openssl x509 -in host.pem >>nopassphrase.pem

You will be asked for the pass phrase you set in step I.

4. Open 'nopassphrase.pem' in a text editor and delete the 'private key' section:

-----BEGIN RSA PRIVATE KEY-----

...

-----END RSA PRIVATE KEY-----

5. Combine 'gd_bundle.crt' and 'nopassphrase.pem':

cat nopassphrase.pem gd_bundle.crt > public.pem

'gd_bundle.crt' is a chain file that links your certificate to a original trusted host certificate that GoDaddy owns.

6. Remove pass phrase from the private key certificate (required by Amazon)

openssl rsa -in host.key -out private.key

You will be asked for the pass phrase you set in step I.

You might be asking yourself: What do all of these file extensions mean? Well, here you go:

*.csr -- Certificate Signing Request used for submission to signing authorities that issue SSL certificates

*.crt -- Public key of a certificate (same as a *.pem file, but with different extension). May include a chain of certificates back to the host certificate. This is what you'll get from GoDaddy when you download a purchased certificate.

*.pem -- Public key of a certificate (same as a *.crt file, but with different extension). May include a chain of certificates back to the host certificate. This is what you'll get from GoDaddy when you download a purchased certificate.

*.key -- Private key of a certificate

V. Configure on Amazon ELB

You are now ok to to take the key files you have just generated and copy and paste them onto your Elastic Load Balancer configuration. When applying SSL you need to create a new ELB. When setting up the load you can make your server config easy by passing the HTTPS traffic from the ELB to port 80 on your server.

Thats it, you now have HTTPS/SSL on your site.

#ELB #SSL #guides

8 notes · View notes

cloudarch-blog · 14 years ago

Text

$1,279-per-hour, 30,000-core cluster built on Amazon EC2 cloud

Post on Ars Technica about a 30,000 core cluster running on Amazon EC2! - $1,279-per-hour, 30,000-core cluster built on Amazon EC2 cloud.

Amazon EC2 and other cloud services are expanding the market for high-performance computing. Without access to a national lab or a supercomputer in your own data center, cloud computing lets businesses spin up temporary clusters at will and stop paying for them as soon as the computing needs are met.

A vendor called Cycle Computing is on a mission to demonstrate the potential of Amazon’s cloud by building increasingly large clusters on the Elastic Compute Cloud. Even with Amazon, building a cluster takes some work, but Cycle combines several technologies to ease the process and recently used them to create a 30,000-core cluster running CentOS Linux.

The cluster, announced publicly this week, was created for an unnamed “Top 5 Pharma” customer, and ran for about seven hours at the end of July at a peak cost of $1,279 per hour, including the fees to Amazon and Cycle Computing. The details are impressive: 3,809 compute instances, each with eight cores and 7GB of RAM, for a total of 30,472 cores, 26.7TB of RAM and 2PB (petabytes) of disk space. Security was ensured with HTTPS, SSH and 256-bit AES encryption, and the cluster ran across data centers in three Amazon regions in the United States and Europe. The cluster was dubbed “Nekomata.”

Spreading the cluster across multiple continents was done partly for disaster recovery purposes, and also to guarantee that 30,000 cores could be provisioned. “We thought it would improve our probability of success if we spread it out,” Cycle Computing’s Dave Powers, manager of product engineering, told Ars. “Nobody really knows how many instances you can get at any one time from any one [Amazon] region.”

Amazon offers its own special cluster compute instances, at a higher cost than regular-sized virtual machines. These cluster instances provide 10 Gigabit Ethernet networking along with greater CPU and memory, but they weren’t necessary to build the Cycle Computing cluster.

The pharmaceutical company’s job, related to molecular modeling, was “embarrassingly parallel” so a fast interconnect wasn’t crucial. To further reduce costs, Cycle took advantage of Amazon’s low-price “spot instances.” To manage the cluster, Cycle Computing used its own management software as well as the Condor High-Throughput Computing software and Chef, an open source systems integration framework.

Cycle demonstrated the power of the Amazon cloud earlier this year with a 10,000-core cluster built for a smaller pharma firm called Genentech. Now, 10,000 cores is a relatively easy task, says Powers. “We think we’ve mastered the small-scale environments,” he said. 30,000 cores isn’t the end game, either. Going forward, Cycle plans bigger, more complicated clusters, perhaps ones that will require Amazon’s special cluster compute instances.

The 30,000-core cluster may or may not be the biggest one run on EC2. Amazon isn’t saying.

“I can’t share specific customer details, but can tell you that we do have businesses of all sizes running large-scale, high-performance computing workloads on AWS [Amazon Web Services], including distributed clusters like the Cycle Computing 30,000 core cluster to tightly-coupled clusters often used for science and engineering applications such as computational fluid dynamics and molecular dynamics simulation,” an Amazon spokesperson told Ars.

Amazon itself actually built a supercomputer on its own cloud that made it onto the list of the world’s Top 500 supercomputers. With 7,000 cores, the Amazon cluster ranked number 232 in the world last November with speeds of 41.82 teraflops, falling to number 451 in June of this year. So far, Cycle Computing hasn’t run the Linpack benchmark to determine the speed of its clusters relative to Top 500 sites.

But Cycle’s work is impressive no matter how you measure it. The job performed for the unnamed pharma company “would take well over a week for them to run internally,” Powers says. In the end, the cluster performed the equivalent of 10.9 “compute years of work.”

The task of managing such large cloud-based clusters forced Cycle to step up its own game, with a new plug-in for Chef the company calls Grill.

“There is no way that any mere human could keep track of all of the moving parts on a cluster of this scale,” Cycle wrote in a blog post. “At Cycle, we’ve always been fans of extreme IT automation, but we needed to take this to the next level in order to monitor and manage every instance, volume, daemon, job, and so on in order for Nekomata to be an efficient 30,000 core tool instead of a big shiny on-demand paperweight.”

But problems did arise during the 30,000-core run.

“You can be sure that when you run at massive scale, you are bound to run into some unexpected gotchas,” Cycle notes. “In our case, one of the gotchas included such things as running out of file descriptors on the license server. In hindsight, we should have anticipated this would be an issue, but we didn’t find that in our prelaunch testing, because we didn’t test at full scale. We were able to quickly recover from this bump and keep moving along with the workload with minimal impact. The license server was able to keep up very nicely with this workload once we increased the number of file descriptors.”

Cycle also hit a speed bump related to volume and byte limits on Amazon’s Elastic Block Store volumes. But the company is already planning bigger and better things.

“We already have our next use-case identified and will be turning up the scale a bit more with the next run,” the company says. But ultimately, “it’s not about core counts or terabytes of RAM or petabytes of data. Rather, it’s about how we are helping to transform how science is done.”

#chef #cluster #performance

6 notes · View notes

cloudarch-blog · 14 years ago

Text

Monitor Your Website For Free

If you need to monitor your website and do not want to pay for it there are a few different services you can use for free. These will allow you to monitor one website for free with email alerts, if you want SMS alerts you are going to have to pay however.

Combine this with Amazon free tier for the first year you can have your own basic platform for nothing.

List of companies providing free monitoring

Service Up Time

Site Up Time

I have tried the Service Up Time personally on the free option and it worked just fine

#monitoring

4 notes · View notes

cloudarch-blog · 14 years ago

Text

Ylastic - New Release

Ylastic released a new version of their great Amazon AWS management tool today. Here is that was released

Updated/reconfigured navigation menu to better organize items and to make space for upcoming new stuff - Monitoring for SQS queues. - Monitoring for SNS topics. - Added sparkline graphs for queues, topics, and elb. - You can view cw charts for queues, topics, elb from the resource page itself. - New metrics page that lets you view cw charts for any resource that has cw metrics.

The best part of the release was the new layout which is a great improvement for the tool.

Ylastic - keep up the good work!

#Ylastic

0 notes

cloudarch-blog · 14 years ago

Text

Building LAMP stack Architecture on Amazon AWS - Part 1 - The Basic Setup - FREE!

So, I want to write a series of guides on how to build LAMP stack on using Amazon AWS.

To start off, I will go through how to setup the most basic of stack using EC2 on a the free Amazon tier.

Step 1 - Login to the Amazon AWS console

First of all we need to start an instance. To do this, login to the Amazon AWS console and go to Elastic Computer Cloud (EC2) option - you can access it herehttps://console.aws.amazon.com/ec2/home

Step 2 - Select the right region

Now the region is the physical Amazon data centre where your instance will be run from, and this should be nearest to where you customers are located. For this I will be using EU which is actually located in Ireland. You can select the region on the left hand side.

Step 3 - Choosing your type of instance / operating system

Select the launch instance button and you will be presented with a list of Amazon preferred instances. I personally prefer to use Ubuntu instances when I need Linux and these are not listed by default. To work out which Ubuntu AMI ID you need go to the following page - http://cloud-images.ubuntu.com/releases/11.04/release/

You will now be presented with a list of AMI IDs. Depending on your region and type of instance you need will determine which AMI ID you use. For this tutorial we will be using a 32bit EBS instance in the EU region which give the following AMI ID ami-359ea941

What is even better is that Amazon will let you run 1 micro instance for free for a whole year!

The reason why we want to use EBS backed instances are that they are really easy to backup and image once you are using them. We will come in another post

Step 4 - Starting your instance

Now take your chosen AMI ID ami-359ea941 and enter it into Amazon AWS

Select the "Next" button

On the "Instance Details Screen" select the micro option under "Instance Type" and select "Continue"

On the next screen leave all the options to default and select "Continue" and "Continue" on the next.

At this point you will need to create a key pair - this is so you can securely connect to your instance when it has started.

Enter the name you want for your key pair and select the "Create & Download your key pair" make sure you save you .pem file some where safe as you will need this to connect to any of your instances in the future

On the Configure Firewall screen you select the default security group and select continue.

You will now be presented with your settings for review

Select the "Launch" button

It will now take a few minutes for the instance to start - you can monitor the progress on the dashboard here - https://console.aws.amazon.com/ec2/home?region=eu-west-1#s=Instances

Once you instance has finished starting it will move to the "Running" state on the dashboard

Step 5 - Allowing access to your instance

Now you have your instance running you need to configure the Security Groups to allow you access.

Select the "Security Groups" from the left hand side under the "Network & Security" section

Assume that you want to access your instance via the web and also connect to it on SSH you need to setup the default group in the following way

When you specify the source of 0.0.0.0/0 which allows anyone anywhere to connect on the specified port. You could restrict access, say for SSH to your IP address of your office, or home to improve security.

Port 80 is used for web traffic and port 22 allows you to SSH to your server

Your instance is now accessible on the internet.

Step 6 - Setting up LAMP on your instance

Now you need to connect to your instance using SSH and the key pair you downloaded a little while ago

First of all you need to know what the DNS of your server is. Go back to the "Instance" tab in AWS and select on the instance you just created, on the details at the bottom of the screen you will see the "Public DNS" setting. Copy and paste the DNS entry

On Mac/Linux/Unix open a terminal and type in:

chmod 600 <my-security.pem> - this is setting the file permission on the pem file

To connect to your instance type the following:

ssh -i <my-security.pem> ubuntu@<public-DNS-name>

On Windows you will need to use something like Putty and also create a .ppk file from your pem key, I have created a guide here

Step 7 - Install Lamp on your instance

Once logged in you need to type the following at the command line

sudo tasksel install lamp-server

You will be prompted to enter your MYSQL password during the install

Step 8 - Test It

You should now be able to connect to your Apache server on your new EC2 instance by opening a browser and typing in your public DNS name, you should get the following.

And that's it! You have just created your first LAMP server on Amazon EC2... and its free for the first year!

Below is some of the terminology I have used in this post to help you.

I will be doing subsequent posts on more advanced topics as this is very basic

Terminology

Instance - This is a server in the traditional hosting sense

AMI - Amazon Machine Image, these are prebuilt images of instances. Lots of people and companies build predefined images with packages pre-installed to save you time during setup

EBS - Elastic Block Store, this is basically like a USB hard drive whereby you can attach additional persistent storage to your instance. You can also run the whole instance on EBS which means its really easy to run backups/images of your instance

Key Pairs - instead of using usernames and passwords to connect to your different servers Amazon uses pre shared keys to access your resources

Security Groups - These are firewall rules that allow access to and between your Amazon AWS components

#aws #ec2

4 notes · View notes

cloudarch-blog · 14 years ago

Text

Create PPK file from PEM - Using Putty to connect to EC2

By default Amazon provides a PEM file to connect to AWS components. These .pem files will not allow you to connect on Windows when using Putty client which is the most widely used SSH client. To be able to connect to Amazon we need to convert the .pem file to a .ppk file using the following

Download puttygen from the following location - http://the.earth.li/~sgtatham/putty/latest/x86/puttygen.exe

Run the executable and the above window pops up. To convert an amazon key to a PuTTY key, use the menu option Conversions ? Import Key. Load the .pem file that you downloaded and press the Save Private Key button.

It will warn you about leaving the passphrase blank. That’s ok.

Save the file to the location that PuTTY has been configured to look in for it’s keys.

Now you can connect to your EC2 instances using the .ppk file you just created.

0 notes

cloudarch-blog · 14 years ago

Text

Another Amazon Outage - Lightening Strike

Finding out all the services and sites I manage are down.

Sunday night UK time I received alerts from IP Patrol (the monitoring service I use) saying that multiple service were not responding. I open up the laptop and login to Ylastic (the service I use for managing Amazon AWS Cloud) to see what is going on. First off I was receiving errors when trying to view my instances, is this problem with Ylastic? Probably not so I log into the Amazon AWS console and also receive errors... oh dear this is not good. Next stop is the Amazon AWS status page - http://status.aws.amazon.com/ - this is reporting all services as operating normally, lies I tell you. So where can I find out what is really going on?

Twitter comes to the rescue

Logging in to twitter and doing a search for #AWS or #EC2 confirms that Amazon AWS are having major problems in the EU region. People are reporting that they are unable to login to the management console, use the management APIs or access their servers. It took about another 45 minutes for Amazon to even say that they are looking into reported issues; this is a very long time when your business critical site is down. In these instances Twitter is great as you get a good idea of what is and which parts are not working. In particular companies like Ylasitc @Ylastic were giving great updates as they have a solid user base to report from.

So what happened? After an hour or so updates started to be released on the AWS status dashboard, these updates said that a power substation was hit by lightening and caused some significant outages. Amazon still did not seem to know which parts were affected or be able to tell its users what to do.

Getting my services back on line Eventually I was able to access the Amazon management console and APIs again and able to see the status of my systems. I was able to access most of my instances and due to my N+1 setup (running a minimum of 2 of every type of server) web traffic was getting through, some instanced could not be contacted but some could. However all my RDS MYSQL instances were running but said they had no data and so my application servers could not process any requests. Now this is pretty serious as our entire customer data is held there, I have implemented snap shots on the RDS instances which take a continuous backup and also take snapshots every night at midnight. So the data is held in three different places. Usually I would use a snapshot to restore the data but these are linked to the running RDS instance and so couldn’t be used. I had to resort to rolling back to the previous nights snapshot which meant loosing 18 hours worth of data! This is pretty horrific and very embarrassing to my customers. Amazon still had failed availability zones and so getting my RDS online meant turning one on for each AZ (total 3) and seeing which one worked. By 11pm on Sunday night I had restored services to the sites I manage. All in all not a great evening and this has made me nervous about using Amazon, but perhaps that is not a bad thing. It is easy to get complacent with services and systems after they have been working a well for while and it’s how you deal with these failures. It could have been a lot worse in the fact I could have lost all the data for the service completely.

Lessons to learn – Offsite Backups, Disaster Recovery and Business Continuity

Off site backups

Straight away I have implemented off site data backups; this is automatically transferring the database backups to a system and location outside the Amazon AWS platform. In the event of Amazon completely going offline and losing everything I could restore the service on a completely different hosting provider. It would take a while to build the systems, recover data and switch DNS. But this could be done and would mean the service I manage would continue to operate without loosing everything.

Disaster Recovery

Using off site backups is a good first necessary step but it takes time and right now I am not sure exactly how long. The next steps I will be taking is to work out exactly how to implement Disaster Recovery. This will involve sourcing external hosting providers, implementing, testing and writing process which will allow a pre determined sequence of events to be followed and will allow the business to acknowledge how long this could take.

Business Continuity

Implementing a Disaster Recovery process is a must but activating this could take anything from 6 hours upwards depending on the complexity of the systems, size of the data to restore and reliance on DNS propagation. So to make recovery from a catastrophic failure easier and quicker it is possible to implement hot or cold standby systems in a separate hosting environment.

Hot failover – This is when there is a failure of a system or group of systems service is automatically redirected to a different set of hardware. Effectively seamless to the end user

Cold failover – This is when a separate set of systems is on standby or can be started easily and data and configurations manually transferred over to them. This does cause downtime for users but should be minimal.

The downside to these is cost on both time to implement and having the standby systems in place for the eventuality of your main set of hardware failing. Hot failover is a lost more costly to implement and operate as you need to have real time replication of data between your live and standby systems.

Geographical Load Balancing

There is also the option of geographical load balancing which is similar to hot failover in the sense real-time replication occurs but both sets of hardware are used all the time. This is even more complicated and expensive to operate but the best solution.

Next Steps Implementing these systems will take time and as I progress I will update the blog with how I have tacked the issues.

Summary As of right now Amazon AWS EU still have issues 48 hours after the initial event. This is particularly concerning given that it was only a few months ago they had a major outage in the US. They will have to perform reputation management so that their users do not lose confidence in the service. It also gives more ammunition to the Cloud haters.

Even though I thought I had taken all the necessary steps to safe guard the systems and data including having the database data in three places, images of systems and DNS ready to switch over with a 5 minute TTL I had not planned for Amazon completely doing down and losing my data.

Remember do not become complacent & always plan for the worst happening, because eventually it will.

#Service Management #AWS #Business Continuity

0 notes

cloudarch-blog · 14 years ago

Text

Cloudflare – Magic caching and always on for free?

Cloudflare is getting a lot of good press lately about their service.

So what does it do?

Cloudflare sits their Content Delivery Network (CDN) in front of your website. By doing this they perform a number of functions including blocking security threats, making your website always online and caching to name just a few.

How do they do this?

Well Cloudflare started out trying to be a security tool, this was when a nasty bot or crawler accessed your website, Cloudflare would step in and block them. But by doing this Cloudflare slowed down your site, this was not a good sales pitch. So the clever people at Cloudflare went about building a caching layer which meant your website was not slowed down and probably even made it quicker. So what started out with one aim in mind changed into something else that is maybe the best feature of the service. They even provide the basic service that covers all these features for free!

Too good to be true? Maybe.

I have been testing out Cloudflare on a couple of smaller volume sites I manage to see how good the service is. One of these sites is a Wordpress site and the other runs on Silver Stripe, both of these are popular CMS website applications. I have mixed views on how it has gone so far due to the following

Always Up - One of the stated claims is keeping your site up even when your servers go down – this did not happen. I tested out both sites by leaving Apache running but making it so that no pages are returned. My site went down! So 0 marks for Cloudflare here. This could have been because by the nature of a CMS the site is dynamically generated on page load so maybe Cloudflare wasn’t able to serve it. I did install the Wordpress plugin built by Cloudflare that was supposed to optimise the site too.

Cloudflare Uptime – Last week I received a monitoring alert saying that one of the sites was down behind Cloudflare, indeed it was. I accessed the site on direct.mysitename.com which bypasses the Cloudflare CDN and the site was working. Five minutes later and the site was back online, it looked by Cloudflare had a blip with their service that took my site down. No reports on their site and their status dashboard said all was ok?

The good bits

Dashboard

This is a nice feature – you are able to see page views which gives you another source of analytics instead of just Google. Cloudflare analytics should and they say are more accurate than Google Analytics, the reason being is that GA relies on javascript to fire off stats which can be problematic. When using Cloudflare all transactions have to go through their servers so they are not reliant on javascript.

Threat Detections

Without having to install some clever software and keep it updated for the latest exploits Cloudflare detects and stops nasty traffic. This is an eye opener when you see the figures of how many attacks have been stopped. I am really impressed by this.

Caching

Perhaps the best feature is the caching. All static assets e.g. images, css, js etc are cached on the Cloudflare CDN, this means your server does not have to serve them. This means less load on your system, less bandwidth being used which equals faster and cheaper for you. Cloudflare saved well over 50% of bandwidth and requests to my server!

Switching your site to Cloudfront

First of all visit their site at www.cloudflare.com and create an account. Next you type in your URL and Cloudflare will scan your current DNS settings, it will then display these and ask you to confirm all is correct. The last step is changing your name servers with your current hosting provided and that is it. Sit back and wait. The last site I switched over to Cloudflare was hosted with Go Daddy and it took just over an hour to change over.

Conclusions

I have been using Cloudflare for a few weeks now and I am not completely sold or confident with it yet. I will continue to use it on the lower volume sites I manage but will not be putting the higher volume sites over to Cloudflare just yet. I want to get some more confidence in the service and due to the couple of issues I have had this will take time.

Once you have switched to Cloudflare you are completely reliant on their availability, if they go down you go down. Cloudflare has got to be bullet proof and have no outages, issues or negative performance impacts if it is going to succeed. It is a great idea and I hope they make it work, if they have no issues then they will get all my sites running through their service.

Cloudflare - www.cloudflare.com

#cloudflare #business continuity #performance

0 notes