A maelstrom of network, systems, and software engineering.
Don't wanna be here? Send us removal request.
Text
Cisco Live 2017 - A new experience even for the 9th time
Cisco Live US (Las Vegas) has come and gone - quite honestly, this year was a very different experience for me. First, I didn't know I was going until 2 weeks before the conference - which leads to many challenges I'll describe below. Second, my wife came to the conference on a social pass for the very first time (to Cisco Live and to Las Vegas). Third, unlike my previous eight Cisco Live events (starting with "Party Like It's 1989" in 2009), I did not attend as a customer. I was not weighed down with a significant training, certification, and vendor meet-and-greet checklist that had no hopes of getting done in the conference's 5 days.
Late but great - learning side of CLUS 2017
In previous years, I have registered as early as 9 months before the conference began. NetVet status secured. Early access to scheduling sessions. I never had to worry about not getting a session I wanted - instead it was the usual multiple week drama in March or April trying to select one of the 5 classes (all of which I absolutely wanted to attend) because each one only had one offering, all of course being at the same day and time.
No - this year was quite the opposite for as any Cisco Live late scheduler knows, it doesn't take long for the popular sessions to get full and start the seemingly futile "wait list" game. Registering two weeks before the conference translates into playing that game for just about every session. So a different tactic for identifying and attending sessions was in order.
The short, short version (TL;DR even though that tag is sooo 2016) - unlike previous years of shotgun selection of various topics (based on in-flight or near term projects), I opted to have laser focus on two subject areas (VXLAN fabrics and SD-Access, if you’re curious). Since scheduling sessions met with frequent "session is full" road blocks, most everything of interest ended up as a "favorite". The scheduler calendar view can't handle that many favorites though - even if there aren't many actual scheduled sessions.
However, you can get all schedule and favorites together in one view by printout out your schedule. It comes out in agenda format instead of calendar style but it is easily referenced - and is complete with room assignments. From there, the last task: rank the favorites so you know what your priorities are.
The first casualty of war is the battle plan
So, the plan of attack focused around one aspect of scheduling sessions that is communicated (by Cisco Live) but really isn't fully appreciated (by attendees): just because you scheduled the session doesn't guarantee you a seat. Many an attendee has been heard (or tweeted) complaining about not being given access to the session even though the had it scheduled. You see - 5 minutes BEFORE the session starts, anyone not scheduled but queued up will be allowed into the session. Once fire code room occupancy is reached, no session for you!
This is where my printout of favorites came into play - by knowing which of my favorite sessions were occurring and in which room it is occurring, I could very easily scout each location ahead of time to get an idea of the room size and interest in the session (queue size).
Now, regarding those complaints of being scheduled but not having a seat: in full disclosure, there were several sessions where the "wait list" line was admitted well before 5 minutes prior - leading to some legitimate grievances. However, for the most part, the 5 minute rule was honored - I know, because I was in many of those lines!
My strategy worked extremely well. With one exception, I got into every session for which I had registered or was marked as first favorite.
Phenomenal cosmic content, itty bitty little living space
Which brings me to that one exception and an area that the conference must simply do better at - room size selection as a function of the subject matter. I am relatively confident there is no reasonably accurate crystal ball which can properly anticipate subject interest (as a function of subject topic, attendees and other concurrently scheduled sessions) to then properly match the room size.
But - as Lee Corso is fond of saying - not so fast. The one, central (technical) theme at the conference (and arguably the most important) was "The Network. Intuitive" - around which a new platform of switching hardware and Software Defined Access was launched. Not surprisingly, there were sessions that covered multiple aspects of this new message and platform.
And, in an encouraging sign to Cisco leadership, every single session related to the new Catalyst 9K or SDA applied to "X" (wireless, e.g.) was full (from what I heard of the ones I didn't try) - and understandably so as there should be a large amount of interest in a new launch.
However, the "sold out" nature of those sessions needs a bit more context (such as when a substance or lifestyle behavior doubles your risk of cancer... from 1 in a billion to 2 in a billion). My limited statistics, personal experience noticed that each SDA/Cat9K session for which I was interested was allotted a smaller room size than other topics. As an example, and I'm horrible at head count estimates but, an overview of building VXLAN fabrics (2 yr old topic) was being held a room for 1000 people but was less than 50% attended... while the banner launch material for the new Cat9K and SD-Access were held in rooms for 150-200 people.
And, keep in mind, while the public announcement of the Catalyst 9000 was the week prior to Cisco Live, there were internal and partner launches prior to that. I’m sure the public session catalog couldn't say anything until the week before but otherwise it could not have been a surprise what the new launch and message at Cisco Live was going to be.
In short, the last minute scheduler take away is this: it worked and worked well for me. Keep in mind there are high demand subjects (like new launches at the conference) that can divert you so just remain flexible - and that’s where many pre-prioritized “favorite” sessions in each time slot help immensely.
The remains of the day
It was another record setting year for the U.S. edition of Cisco Live - early estimates of 28,000 attendees, use of multiple Vegas venues (for better or worse) to support the expanded content and activities, etc. Despite the frantic run up I had prior to the conference started, I had a much more laid back experience in attending this year - simply because I had a narrow session focus and an open (social) engagement agenda.
Everyone should keep in mind that any conference will have logistical challenges arise. The real measure of a conference hangs on: (during the event) what's the response and (after the event) were they preventable and how are they prevented next year.
The latter requires awareness (conference surveys and Cisco Live blog reviews) so make your opinion be known! I can honestly tell you that Cisco is survey driven (to a fault sometimes!) and they do listen. Just look at Justin Cohen's blog about CLUS 2016 meals for further proof - the egg and cheese options this year hit the spot!
As for "live response" during the conference, the conference does a great job making important information known through social media - and for that, we have Kathleen Mudge (@KathleenMudge) and the Social Media team to thank!
For example, when lunch meals ran out on Thursday, many folks left (upset of course) and went to find their own lunch on their dime. Less than 15 minutes from event staff turning people from the lunch hall, the Social Media team was announcing that lunch vouchers were being issued - a $20 tweet right there.
Not sure when seating for the keynotes was opening up? Tweet the question to @CiscoLive to find out... or, if you follow them (with notifications turned on), they were pro-actively posting that information.
Or, received that awesome #DEVNET solar 8000 mAh charger? How about a reminder not to pack it in your luggage?
Better yet, unexpected punny banter with the team that just makes you laugh.
Wrap up
As I started off saying, this was my ninth straight Cisco Live. I am a Cisco Live champion/evangelist - for technology training for engineers in all stages of their career, to learning about ecosystem products, to #DEVNET, to engaging fellow engineers - whether at meal time, receptions, or in the Social Media Hub lounge.
While I think there are real growing challenges the conference is experiencing, it is worth the effort and the expense to get there. The experience is definitely worth it - even after 9 straight conferences.
Disclosure
These thoughts, observations, and opinions are mine and mine alone. No one asked me to write them or publish them. As I said, I am passionate about Cisco Live and love personally writing about it to help people enjoy it more fully. If you have trouble believing that, feel free to check out my previous blog posts about Cisco Live.
That being said, I am now a Cisco employee (Virtual Systems Engineer, Data Center) and am completely unaffiliated with the conference planning and execution. These are my words and not the words of Cisco.
0 notes
Text
Assuring your network will do what you think it will do
TL;DR
"Sweet! I'd love to be able to simulate a change on the whole network!"
"*All* paths from source IP to destination IP? Even ECMP?"
"Wait, they aren't running VMs/containers of each device to simulate all this?"
"All of this can run on site, with one virtual appliance for the modeling engine?"
Cloaking Device Disabled
Forward Networks came out of stealth as a Silicon Valley startup company on Monday, 14 November 2016. So, if you haven't heard of them, do not feel bad. Their initial press release, like many new product announcements, says a great deal and makes a number of claims. But, I didn't really know what to expect.
Their exit coincided with their presentation to the Network Field Day 13 event (http://techfieldday.com/event/nfd13). In that presentation, they provided very detailed demonstrations about the product - showing what you can do and how it can make your operations much more streamlined. Side note: if you love demonstrations and engineers interrogating engineers about product features, you are going to love the Forward Networks videos. This was a product in which practically every delegate had some interest.
Network Assurance
The Forward Platform is described as a network assurance product but what does that mean? Even their website was a little vague on the matter. Simply put they break the concept of network assurance down to two specific categories: correctness and performance.
The latter (performance), they did not focus on (today) although they did mention customers are definitely interested in how current performance impacts network operations. As an example, performance assurance would answer the question "With my current traffic loads, will my connectivity to the DR site be able to handle a failover event?"
The former category - correctness - is very much the focus of their initial product launch. Given the business intent, are all the network devices currently configured and operating to support that intent? For example, does web traffic from the Internet pass through my load balancers and firewalls to reach my web servers? Even if I take down one load balancer? Or firewall? Or both?
The most intuitive aspect of the software to me was defining the business intent. Business intent as they talk about it is posing questions to the Forward Platform. Back at the office, this really translates to test cases - such as monitoring rules you set up. The types of monitoring you would put into SolarWinds or Nagios or filters for Wireshark, for example, are the same concepts/actions you used to define intent.
Except, you can also validate prohibited traffic - no DHCP traffic should reach the web server. Web traffic should not reach the database server. Etc.
So, upon first deployment, you might have to allocate some time to defining as much (or as little) as your business intent as you want. For a few scenarios, you will find in the demonstrations that the web interface is very lightweight and responsive - the session video (https://www.youtube.com/watch?v=Zg0u9a4ZW7Q&t=11s) at Tech Field Day's YouTube channel has an excellent walk through of how to create these checks.
Some highlights - the pre-defined checks that we all want (traffic from client can reach server) to some you necessarily wouldn't think of (traffic flows on all links in an etherchannel). Those look like this in the web interface:
Worried that defining all those tasks for your extremely large network could be daunting? Yes, Virginia - there is a Santa Claus and an API for the Forward Platform.
Searching Your Network
Those intents and alerts that can be defined fall under the "Verify" aspect of the Forward Platform. Another aspect is Search - one that is immensely useful for your operations desk. Users complaining that they can't get to the internal portal? (Cue announcer voice) Reduce your "mean time to innocence" with Search (End announcer voice).
Seriously, you can very quickly determine whether the network has a role in the issue by:
Getting the latest snapshot from your network - which collects the latest current configuration and running states from each device
Use the search function to show the path between users and the service
This capture from Brandon Heller's demonstration (https://youtu.be/__iaT7WQ41w?t=4m36s) really shows it all. With Google like adaptive search terms, you quickly get the current, live paths involved for the source and destination you are investigating:
As we discovered during questioning, this newly released product currently understands and models the physical "underlay" transport including VRFs - however, virtualization switches (VMware, e.g) and overlays such as VXLAN are coming soon. So, you might not see the complete "virtual last mile" as I call it but you will definitely get to the physical server hosting the VM in the case of vSwitch.
Modeling versus Simulation
Those two capabilities are exceptional technologies that really do help operations teams with extremely complex networks - especially ones that can't be simulated using virtual appliances in such environments as GNS3 or VIRL. As more overlays and SDN are introduced into the data center, even simulating a relevant subset of the network is becoming computationally difficult.
That's where the Forward Networks approach of modeling excels over the virtual appliance simulation approach. By reducing your network to data structure that can be modeled effectively and efficiently, they can ingest your entire network across a large range of vendors (Cisco, Juniper, and Arista) and devices (switches, routers, load balances, firewalls, etc.) to then perform those search and verify functions.
Prediction
As a network architect, though, the prediction capabilities of the platform were the most exciting portions of the demonstration. Ever wonder if this configuration change is going to break anything directly? That's usually easy to know ahead of time in simple networks.
How about whether your change will interrupt service on a different part of the network? What about a change someone made that broke redundancy for some of your VLANs but not others?
Because the Forward Platform has the complete model of your network encoded into their data structures, they can also predict the behavior of your network under different configurations. Ultimately, they would love to be able predict based on current conditions as well - but more on that later.
In my experience, most shops of any size have a change control process in place - one that, in the ideal ITIL sense, would correctly state the actual impact to the production environment so that the risk, value, and timing of the change could be intelligently discussed. Without a tool such as the Forward Platform, you can say what you *think* will happen based on what you might know - and maybe even leverage actual configurations to provide a sound foundation for that analysis.
How many times have we been burned by "wait - it shouldn't have done that?" or "who the heck put that configuration in there?"? In my opinion, this is where the serious value can be derived from this platform - the Forward Platform can pull the entire network configuration and run states on demand prior to simulating the change. There does not have to be a discrepancy between a stale configuration and what you are testing against.
An demonstration of the Forward Platform prediction software can be seen at this part of the product demonstration: https://www.youtube.com/watch?v=__iaT7WQ41w&feature=youtu.be&t=24m16s
As you might pick up, the more validation or unit tests you have in your network, the better the "Predict" functionality will work for you. As they (Forward Networks) point out - it's very much like a software development mindset: the more (meaningful) tests you embed, the more confidence you build. As I mentioned before, there's a great number of pre-defined checks but you must implement them to gain that confidence.
All is well but not perfect with Forward
Keep in mind - as I mentioned, this is modeling and not simulation. One of the first implications of that is the platform will not protect you from bugs in the switch/router/firewall code. Back end validation farm runs through as many code versions from as many vendors as they can. The impression is that they run through them "all" to validate their modeling engine correct represents them all. I'm not sure that is possible for all the resources that might be required to validate all protocols (routing, spanning-tree, etc).
As I also mentioned previously, the current platform does not take into account performance aspects of the network - although they did emphasize (without committing) that several customers expressed strong interest in that capability. I for one will be keeping an eye out for their future developments. Imagine the equivalent of integrating information from SolarWinds into a modeling and prediction engine - packet drops, interface counters, flow information.
Take flow information - say you wanted to know how web traffic for a certain customer would distribute across 2 more additional links? Stress those links because the customer preferred their DR site replication traffic over those links? You see the potential?
Before anyone takes any of that out of context - that is me brainstorming what might be. Forward Networks did not mention those examples, suggest those examples, hint at those examples, or commit to those capabilities.
Wrapping It Up
So I'll finish with a strong recommendation that you check out each and every videos from the Networking Field Day 13 event at http://youtube.com/techfieldday. The presentations in my opinion really were solid and informative. As I see it, the product seems strong coming out of the gate and their presentation certainly reinforced that impression.
Mea Culpa
The price of being too busy means sometimes keys, lists, sticky notes, and blog posts get lost. I was cleaning up my folders today and ran across my completed blog post on Forward Networks I wrote back in December 2016. So, please forgive any dated references are tenses that seem like I just saw these guys. I saw them back in November 2016.
Also reference my general disclaimer regarding my attendance - in short, no one asked me to write this, let alone what to say. GestaltIT/TechFieldDay provided travel accommodations to attend the event.
1 note
·
View note
Text
Programmability at Cisco Live 2016
Today I want to post my observations regarding the DevNet component of the Cisco Live 2016 conference.
Conference Strategy
In years past, I ate up every single breakout session and at least one technical seminar (extra cost). My philosophy was centered around the full exposure, massive brain infusion, and “drink from the fire hose” aspect of the conference. During those years, I was an IT architect for a university and needed to develop a strong sense of the best technologies (for that institution) as well as the direction those technologies were going.
This year (2016), I was in a new role at a new company (financial services). The technologies I oversee require a substantial level of automation, monitoring, and programatic analysis of complex system behavior - it's called High Performance Computing. As such, I spent about equal time in breakout sessions that were focused on the usual network technologies and the balance of time in the DevNet area presentations.
DevNet Zone
That DevNet time was primarily spent in two of the available formats: theater/classroom areas and the workbenches. The one area I did not frequent much were the topical booths where one could walk up and learn more about VIRL, for example. The theaters and classrooms were not terribly different in format from a breakout session - standard lecture format. Good, solid content and presenters.
By and large, though, I though the workbenches were a wonderful experience. Every session I sat in on provided a hands-on experience for the technology presented. Many of the workbenches had 8 or so MacBooks preloaded with the environments that, with the flexibility of virtualization, allowed each session to have clean environment for their own private lab. Most of the presenters provided session material via GitHub (https://github.com/CiscoDevNet) or other content sharing medium (I vaguely recall a DropBox might have been used).
The programmability topics ran a pretty wide gamut - UCS Manager for servers, APIC-EM for enterprise networking, and Mantl for container/microservices infrastructure. There was also a three part session on "Introduction to Python Network Programming for Network Architects and Engineers" (DEVNET 1040, 1041, and 1042) that takes participants from Python basics to creating network connections to making REST API calls - with all the sample code in between.
My only disappointment in all the programmability focus is that much of it was fairly high level and the sessions were barely 45 minutes long. I wanted the similar levels of depth and intensity that are available in the Networkers break out sessions. Despite my desire for more, though, the theater and workbench sessions definitely had strong content for the target audience - strong network engineers without a systems or programming background.
But, to counter that, many of the sessions leveraged a few of the existing demonstrations that exist online for people to freely run through at the DevNet Learning Labs (https://learninglabs.cisco.com). Additional, more advanced demo labs exist for users to continue developing their expertise.
Take Away
With my experiences in DevNet, the content from several breakout sessions, and some of the discussions with Cisco engineers, Cisco is definitely placing a LOT of focus on automation and network programmability. It is certainly laced all throughout their Digital Network Architecture (DNA). You can learn more about the fruits of what they are doing over at the DevNet website (https://developer.cisco.com/site/devnet/home/index.gsp).
I really hope the Cisco Live 2017 DevNet improves on three fronts:
Step up the depth of the content - some of the introductory material is great for network engineers looking for a helping hand into programming. Deep dives into the API of each platform, particularly in the workbench areas, would provide huge value. I did attend a solid expose on the UCS XML API and APIC-EM but more complicated uses of the API are in order. They may have been there and I didn't catch them because ...
Better scheduling - the workbenches were scheduled on the hour for the most part, practically ever hour. This made it very difficult to attend breakout sessions from the normal "Networkers" portion of the conference and many of the DevNet sessions.
Technical Seminars - "bootcamps" on programming against an API. Maybe that advanced material I'm looking for can be wrapped into an 8-hour session to build a particular, simple application based on that API?
Disclaimers
I was a member of the Cisco Champions program for 2016. Being a member of the program, I was provided access to some NDA presentations and stumbled upon access to a suite at the Cisco Appreciation Event. Other than no lines getting into the event or getting food/drinks, there were no other benefits to such access.
Also, I also tied for 3rd place in the Cisco Live 2016 blogger contest! The prize for the contest was front row seats at each of the keynote speeches for the conference. Other than a cool selfie with Chuck Robbins (while wearing a kilt!) and a great view, there were no other benefits to the prize. You can read about the contest at URL.
No one asked for any social media coverage nor did I commit to providing any for the NDA access or CAE suite. The blogger contest did require making quality blog posts about Cisco Live (duh!).
At the beginning of 2017 (in 4 short days!), I will be joining Cisco as a Virtual Systems Engineer (VSE). I am posting this article because I enjoyed the Cisco Live conference, the social media crowd I got to know, and the immense high-quality information that I learned (or at least was made aware of!). I was not asked to post this as a condition of employment or as a result of my employment.
1 note
·
View note
Text
Generating Maps of Your Traffic
Any network engineer is familiar with 'traceroute' - a network tracing tool that leverages ICMP or UDP to report the hop by hop path your (traceroute) traffic takes from your client to the specified destination address. In short - it send packets to the destination address with the TTL initially set to 1 and each successive sent packet has a larger TTL value. Each router attempting to route a packet whose TTL has reached zero drops that packet and returns an ICMP Time Exceeded message. Thus, each router in the path reports to you its existence.
Obviously, as you are trying to potentially troubleshoot connectivity between two hosts, this information can be helpful in determining where traffic is getting dropped or taking a right turn at Albuquerque instead of a left turn.
Except, many things can go wrong or provide an incomplete picture:
ICMP messages can be blocked by routers along
The path presented is the current singular path taken. If equal cost multipathing is in play, you may not see the other paths your traffic can take. Or worse, some of the probes take one path and other probes take another path.
If policy based routing is in play along your path, you will not see those various policy paths with simple traceroute.
And, of course, that path information from traceroute is "real time" - unless you have cobbled together your own script and database, you aren't going to have historical information for those paths.
In the past few months, I've run across presentations and demos on a few different ways you can get a better idea of the various paths available and their changes over time. This latter aspect playing an important role in troubleshooting potential issues on your network.
Those who don't know history are destined to repeat it.
The first enhancement to the traceroute approach I've encountered was NetBeez - a monitoring solution that I wrote about back in September (http://bit.ly/2cv2DEx). The approach is straight forward and campus LAN focused - using virtual or Raspberry Pi (or similar hardware) agents across the network, centrally schedule traceroute tasks from those agents to defined targets and store the results.
I was able to deploy a NetBeez at my home office and quickly understood some of the anomalous performance hits of my office internet connection, simply by having a historical record of all the traceroute data. Most of the time, the path information was complete because that traffic is not impeded by routers along the destination.
Similar story would be true for a campus network environment where you wouldn't block locally originated/destined ICMP traffic. The NetBeez approach is a great, simple solution for straightforward network designs and paths.
But what happens when you have to traverse the Internet with all of its various providers, peering relations, and (seemingly) ad-hoc policies toward ICMP? You need a better approach to understand the complexity of paths possible both in the forward and reverse direction.
Once You Start Down the Multiple Paths...
Enter SolarWinds - which initially previewed a network path discovery tool, something they labelled a "lab toy", called NetPath back in February 2016 (Tech Field Day 10). A month ago, (Networking Field Day 13), Chris O'Brien returned to present the latest updates in the context of now being an official feature available through the Network Performance Monitor product SolarWinds offers.
Chris points out that they quickly found that simple ICMP probing was insufficient for the task at hand - so SolarWinds developed some proprietary techniques for crafting packets to help discover multiple paths to the destination. With a couple patent pendings under their belt, Chris was able to more freely discuss how they conduct their probing to build out the path information. The best part: it was a white board discussion that you'll definitely want to check out here (http://bit.ly/2hvSZaQ). There is a wealth of wisdom in the mechanics of tracing your path that Chris relates in their experience developing the technology.
NetPath takes the simple traceroute approach with historical context to the next level by providing that multiple path information. The NetPath data is gathered by a SolarWinds provided Windows-based agent that you install in your network, at the various points where you would like to source the discovery (similar to the NetBeez approach). The probe data is stored in a SolarWinds SQL database for the historical trending/evolution of your traffic path.
The only location for that data is your local, on-site instance of SolarWinds. And, visually displayed, the path looks something like this:
SolarWinds doesn't stop there - path discovery is a neat tool but doesn't do you much good in a vacuum. The SolarWinds product suite, however, isn't in a vacuum - between NCM (network configuration manager), NPM (network performance manager), and NTA (network traffic analyzer), SolarWinds has a very rich understanding of your network and can leverage those components to help determine potential root causes for changes in the path behavior. The simplest example of this integration: correlating configuration changes with path behavior changes - namely, a redundant link disappears from your path because you shutdown the interface.
Coming from a strong Linux background, SolarWinds hasn't been one of my favorite tools - for a few technical reasons - but the power of correlating/integrating the components is fairly apparent. Quite frankly, the ease of use for these NetPath related features was pretty compelling as well - something I can't say for other products in the SolarWinds suite.
The Point of No Return?
We've been conditioned for decades by our local networks to think that level of visibility is sufficient for understanding our traffic flows. To some extent, at the local level, that would be true. However, when we expand that approach to the Internet - traffic flows are a lot more complicated. Asymmetric routing is a way of life because the route availability and policies for each transit AS can force the return traffic along a different path.
Currently, SolarWinds doesn't do the reverse path - "because it's really hard" - but with that explanation, I feel like they are working on it even if they are not saying so or committing to it. There are other companies, such as ThousandEyes, that are discovering the reverse path so it is just a matter of research and development.
Wrapping it up
While humorously calling it "mean time to innocence", a few products I have seen recently are really advancing the state of monitoring and problem detection. Network outages and hardware breaks have always been the low hanging fruit for these tools for quite some time. With overlays, SD-WAN, and PBR, networks have long since had complexities that make troubleshooting subtle but user impacting performance issues a manual and low effort for engineers.
It's encouraging to me from a "bread and butter" campus network background to see tools that are focusing quickly on isolating these issues - especially as network complexities, such cloud adoption and transitions from MPLS to more SD-WAN based services, are unavoidable in even the most traditional networks going forward.
0 notes
Text
Efficient Resource Use Takes Interesting Turn at Scale
TL;DR
The first couple of sections set the backdrop for why the technology by DriveScale is really interesting. If you want to skip it, for shame on you (but jump to Composable Big Data Platform section).
Background
Throughout the history of computing, there has always been the constant back and forth between producers and consumers of computational resources. Build a resource with more horsepower and users will run more workloads or run bigger workloads. We've seen this trend play out in scientific computing and, in many respects, the enterprise.
In the enterprise, we have, historically, been plagued with large numbers of business critical, resource-light applications with conflicting software dependencies - end result of course was a large data center full of underutilized servers. The enterprise drove efficiencies by virtualization of services/applications, consolidation of databases onto fewer servers, and leveraging of dedicated, optimized NAS appliances - among other technologies.
In scientific computing, we come to computing from the other side of the coin! Generally, there have not been enough servers to get the work done in a timely fashion so we throw more servers at it. Sometimes that derives from the complex nature of the calculation, other times it's because the data sets are somewhat large (<10TB today). End result of course were Linux HPC clusters that aggregate large numbers of servers, using scheduling software to efficient place the workloads.
Generally speaking, though, as Moore's Law has drive computational horsepower, eventually servers end up having more and more processor cycles idle because the other critical resources (networking and storage) have not had been able to keep the pace...
Enter Big Data
The last 3-5 years have seen a huge explosion of data growth - in both enterprise and scientific research. The rates have far exceeded what compute and networking can handle in order to effectively process it all. Thus, it is a very interesting development to me that both enterprise and scientific computing have both reach big data scales roughly at the same time.
Even more interesting that the lines between enterprise and scientific computing are blurring rapidly - both in mutual dependency (for example, medicine and hospitals) and in required skill sets (statistic analysis/machine learning and ... well, everything!). That, though, is a discussion for another time.
The primary notion behind big data is that the data sets have become too large to efficiently move to the compute resources for analysis. Additionally, those data sets are analyzed repeatedly - which would normally suggest a caching approach except for the scale involved - so you would have to build large, cost-prohibitive networks to support the old model.
So, in general, a new model arose for big data: move compute to the storage appliance. Generally speaking, throw 50 or so drives in a multi-socket server chassis and use novel techniques to distribute the analysis to the compute fronting those drives. Seems simple and problem solved, right?
Footnote: For those who want a deeper backgroung, head over to Wikipedia for Big Data, Apache Hadoop, and Apache Spark.
Big Data parallels the HPC story
As with HPC Linux clusters in the 2000s, big data has experienced a similar adoption pattern. Initially, we have large data sets that are CPU-bound. We build a big data "cluster" tuned to that application, throw all of our data into it, and essentially dedicate a massive set of resources to a single problem.
As the framework shows value, more applications are adapted to the new paradigm - originally, this approach was used by Google for search engine results. Today, you can easily find big data solutions for system log analysis, medical record mining, and stock market analyses.
The important thing to note, though, is that each application has its own large data set and different ratio of computation to storage requirements. These infrastructure bundles can be sizable and expensive so how do we support each application? The answer, very early on was: a big data cluster for every application! (Cue the obligatory Oprah clip...)
Except... just because we have a new model does not mean the various advancements in compute, network, and storage are no longer relevant. Computational power is eventually going to surpass what you can store locally. Eventually, big data platforms are going to experience the same resource inefficiencies that fed into the virtualization movement in the enterprise and dynamic scheduling of scientific computing.
Furthermore, what if you are a service provider? Do you build dedicated big data clusters for each application a customer has? Or can you build a Big-Data-As-A-Service platform efficiently?
Composable Big Data Platform?
These are the sorts of problems that the folks at DriveScale are beginning to tackle. The team they've assembled, as my friend John White points out in his blog, has a strong pedigree in the IT industry. The fundamental approach they've adopted involves disaggregating the storage from the compute without introducing too high a performance penalty in doing so.
In the traditional approach, you must select among chassis and server platforms whose primary goal is to provide you with the best quantity of storage with the needed performance. Those platforms have to make trade-offs that limit the server configurations. HPe Apollo 4500s for example - you can get 68 drives with one server node but must sacrifice drive slots to get a second node. Both HPe and Cisco platforms limit the amount of memory available to your servers in the interest of physical foot print.
The DriveScale approach starts with standard off-the-shelf JBOD storage shelves that are fronted by the DriveScale controllers using 12Gbps SAS interconnects. The controllers are extremely efficient data movers, transfering bits between the JBOD drives and the mapped servers. The servers? Any server configuration to fit your computational need - Dell, Cisco, HPe, Supermicro - so long as it has 10GE capability. And, yes, like servers - your choice of 10GE switch (there's an HCL too) to connect the 10GE servers to the 10GE DriveScale controllers.
Here's what a typical rack looks like physically:
Logically, this is how it connects together within the rack:
Drilling down into the magic hardware sauce, namely the DriveScale appliance and controllers - the 1RU appliance has four DriveScale controllers in it. Each controller has dual 20GE uplinks to your servers. A pair of those controllers attach to a single dual controller (SAS 12Gbps based) JBOD array. Here's the logical drawing for that appliance:
Note: the NVMe SSDs on the front are not quite available yet (Nov 2016).
Flexibility sacrificing performance?
If this reminds you of other architectures, it should. It's not very different (in boxes and lines drawing) from NAS/SAN controller architectures. However, as opposed to those architectures, the data path for all the storage is not focused through a single pair of controllers. The platform scales horizontally, based on their presentation, to 16 controllers per rack. Additionally, the expansion shelves are not unique to the controller vendor - more about that below.
Concerned about the 10GE fabric for scaling out? There are a few details to calm your fears. First, you are sending all your traffic across a single, non-blocking Ethernet switch within the rack. There is no oversubscription between storage and server (by crossing your network core, e.g.). Second, there is nothing to stop you from going 25GE or 40GE in your server and network switching...
Which leads to the third "detail", the roadmap. They are driving toward 25/40/100GE based architectures for their appliance. It wasn't clear to me but 25GE is certainly real soon now but I wasn't clear on when 40/100GE was arriving. On that note, you should check out their video presentations or the main page http://techfieldday.com/event/tfd12 for more information.
DriveScale is Software too!
Ultimately, the DriveScale approach is all about scaling out your capabilities using cost effective components. It very much follows the HPC cluster model of using commodity compute hardware (that meets your needs), using commodity storage hardware. Similarly to HPC, special middleware is needed to intelligently glue the components together to provide the ultimate services required. In HPC, that is both the job scheduler and the parallel filesystem
For DriveScale, the software not only provides the management layer to monitor hardware, map resources, etc., it manages distributing the load across the appliances, handling redundancy (presumably via erasure coding) if required, etc. Rather than try to reproduce the details, I will refer you to the 20-ish minute video on the capabilities and benefits of the software on the Tech Field Day YouTube channel. How they enable these capabilities is very interesting.
Wrap Up
One key component to these large environments that frequently gets overlooked is hardware life cycle management. In HPC, especially the large clusters at national laboratories, "upgrading" is rip and replace with a brand new resource (for good reason, by the way). In traditional big data platforms, the approach is the same - especially when it comes to HDFS. Operationally, HDFS doesn't easily rebalance data so incrementally adding space is not a great or easy option - which is why the best practice is to fully populate all drives in a node.
DriveScale, on the other hand, by physically separating the server from storage, permits you to refresh the server components completely independent of the storage as well as refreshing the storage separately as well. The "bridging" function of the DriveScale hardware (SAS to Ethernet) and software (logical mapping of drive to server) permit this extremely valuable flexibility.
Another key component at this scale, supply chain! The DriveScale simply sells the appliances and the software. You are responsible for buying the storage (JBODs) and servers. Given the scale of compute and storage, the DriveScale folks (rightly) do not want to insert themselves into the supply chain - it's not their expertise and the mechanics of SCM would easily dwarf the primary mission.
In my view - this is a very good thing! How many times do you have a great platform/product you need for your business but have to tough out a new set of management interfaces, different support mechanisms (and quality of service), etc. You get to choose my servers, my drives, and my network.
My perspective
While the technology is very fascinating to me, I am most excited by these points I made in the wrap up. As a infrastructure engineer, operational and lifecycle issues are really important. You can't simply vMotion an application off one server in order to replace the server! When you have a critical application or service that many groups (or the entire business) rely on, you want to be able to maintain it, refresh it, and improve it without having to take it offline.
In my HPC past, I have used IBM GPFS (now call Spectrum Scale) to perform similar independent upgrades of storage appliances from the storage servers. So, even though my academic HPC life has been lived at "the small scale" (of 100+TBs and 2000 cores), I can easily appreciate the approach DriveScale is taking at much larger scales of storage.
Disclaimer
Please see my general disclaimer on all content I write on this blog. I saw the DriveScale presentations in person, as a delegate for Tech Field Day 12.
0 notes
Text
Container-based ThousandEyes Enterprise Agent Install on Fedora 24
I had the wonderful opportunity to be a Networking Field Day delegate back in August (http://techfieldday.com/event/nfd12/) where we saw the latest developments from ThousandEyes. If you don't know anything about them, you should really check out their presentations (http://techfieldday.com/appearance/thousandeyes-presents-at-networking-field-day-12/) which include demonstrations of their product. The technology was impressive enough that one of the delegates was begging them to take his money!
Their instructions on setting up the Docker based agent are really straightforward as most container deployments tend to be. Here's a partial screen shot of how to get to the downloads:
While I've been following Docker a great deal and have used it a little bit in lab environments, I've never deployed a container that had persistent storage (volumes) - let alone 3! Not to worry - the ThousandEyes portal provides a nice form to walk you through the process:
Simply plug in the container name, Docker version, and volume parent directory. With that information, they provide the exact Docker commands to pull the image and instantiate the container instance as shown - they embolden the dynamic information you provide so you can also visually see how that information is specified to Docker.
But, before I can do that, I must point out something that may affect you but definitely affects me...
Filesystem and Logical Volume Interlude
With Docker, you need to have space for the image repository, container instances, and persistent volumes. On Fedora, this data defaults to the FHS suggested location of /var/lib/docker/volumes. If you have the default filesystem layout from a standard Fedora install, you might not have to worry (provided you have space!).
However, as you might have read in a previous post (https://tmblr.co/ZdckEt2Ea---v), I use a different layout. Check this second post (https://tmblr.co/ZdckEt2Eb2ZUy) to read more about how I prepared my filesystem for the Docker images.
Back to our regularly scheduled program...
Houston, We Have a Problem
All the commands worked just fine - except the enterprise agent did not appear in the portal. Docker reported that the container kept restarting. Running the docker logs blog-container showed:
*** Running /etc/myinit.d/60-copy-dnsrootkey.sh... cp: cannot stat '/var/lib/te-agent/dns-rootserver.key': Permission denied
/var/lib/te-agent is mapped to one of the persistent data volumes for the container. First thought - crap, what did I do wrong:
Double checked syntax for docker run.
Verified volume directories were created in /var/lib/docker/volumes/
Permissions looked good
Look at /var/log/messages - proceed to swear at systemd conspiracy
Look at journalctl output.
And this pops out of the noise:
Nov 08 09:58:24 minbari.docker.local audit[4310]: AVC avc: denied { write } for pid=4310 comm="cp" name="te-agent" dev="dm-0" ino=9224657 scontext=system_u:system_r:svirt_lxc_net_t:s0:c182,c898 tcontext=system_u:object_r:usr_t:s0 tclass=dir permissive=0
Yes, our old friend SELinux - except, I followed operating system standard procedures to ensure /var/lib/docker reset all of its SELinux contexts to the proper values.
Docker and SELinux
Some extra details are in order - As mentioned, I'm using a fully updated Fedora 24 VM (date 2016-11-10) running the latest kernel at the time (4.8.6-201.fc24.x86_64). I am using the Docker packages that are provided by the Fedora Project. This means v1.10.3 (current available is v1.13). I haven't had time to test this cleanly but, from my research, the Docker latest repo does not change the story.
A quick Google search found this post from Project Atomic - http://www.projectatomic.io/blog/2015/06/using-volumes-with-docker-can-cause-problems-with-selinux/
Essentially, when the container is using SELinux enforcement internally, the SELinux context that Docker is running with (svirt_lxc_net_t) must be able to read/write to the volume where it is mounted within the container. Generally speaking, svirt_lxc_net_t does not have permissions to do that. So, the internal container must reset the SELinux permissions on the volume's mount point.
Why do we use SELinux again? Because it's the right thing to do (TM).
Unfortunately, even though we followed the FHS and reset contexts, Red Hat SELinux policy hasn't been adapted for this issue. It might be because, as stated in that ProjectAtomic post, Docker accepted a patch way back in 1.7 to resolve this.
The solution is to append the '-v' volume argument with '-z' or '-Z'. The final 'docker run' command (based on the screenshot above) is:
docker run \ --hostname='blog-container' \ --memory=2g \ --memory-swap=2g \ --detach=true \ --tty=true \ --shm-size=512M \ -e TEAGENT_ACCOUNT_TOKEN=21zpxbjzsrvknlcgkcoxp4w04ung2261 \ -e TEAGENT_INET=4 \ -v '/var/lib/docker/volumes/thousandeyes/blog-container/te-agent':/var/lib/te-agent:Z \ -v '/var/lib/docker/volumes/thousandeyes/blog-container/te-browserbot':/var/lib/te-browserbot:Z \ -v '/var/lib/docker/volumes/thousandeyes/blog-container/log/':/var/log/agent:Z \ --cap-add=NET_ADMIN \ --cap-add=SYS_ADMIN \ --name 'blog-container' \ --restart=unless-stopped \ thousandeyes/enterprise-agent /sbin/my_init
Thousand Eyes and Support
Remember? This post was about the ThousandEyes Enterprise Agent container install. Well, I just want to give a shout out to the ThousandEyes support chat - which I engaged half way through this effort. I merely clicked on the little "cartoon bubble" chat icon and stated my problem.
Luis engaged the chat and worked with me on it for about 30 minutes. While chatting, he downloaded the Fedora 24 installer and went to work on replicating the issue - which, of course, he couldn't. My luck, eh? I had to leave the live chat but he opened a support case with the chat history.
A little discovery yielded the fact that he was cheating - his Docker process was running unconfined. I don't have detailed information but I imagine he is running Docker "native" (latest from Docker's repo) and not the Fedora version.
But, for someone who had left the trial period on his account, that was a level of support I was not expecting. Kudos ThousandEyes!
Final wrap
The ':z' and ':Z' options both resolved my problem and my container is up and running without issue. I’ve set up some tests between the enterprise and cloud agents. The container is generating data and I’m looking forward to seeing how it works.
0 notes
Text
Another Linux Interlude - Provisioning logical volumes for Docker
The context for today's post involves extending my filesystem layout (partitioning scheme) to accommodate a large Docker repository.
If you've read my previous post about my partitioning philosophy here (https://broadcaststorm.tumblr.com/post/153092095993/a-brief-linux-interlude-partitions-logical), you'll know that I do things a bit different from the default installation layout.
Short version, my Fedora 24 uses the following filesystem layout:
/dev/sda1 : /boot (1GB)
/dev/sda2 : LVM partition for hostVG (rest of space)
rootLV : / (10GB)
varLV : /var (10GB)
tmpLV : /tmp (5GB)
homeLV : /home (20GB)
Since I'm looking to add space to support Docker images, I need to be worried about the /var/lib/docker directory tree. For one container image and instance, 10GB might be fine. As I indicate in that other blog post though, running out of space in /var isn't good. Let's go ahead and add a new logical volume (LV) so that we protect the rest of /var and can keep an eye on Docker space consumption.
Note: these instructions assume that you have not installed Docker yet.
Nothing up my sleeve
The general steps to create this new mount point are:
Create LV - lvcreate -L 10G -n dockerlv rootvg
Format filesystem (XFS) - mkfs.xfs /dev/rootvg/dockerlv
Update FSTAB with this line - /dev/rootvg/dockerlv /var/lib/docker xfs defaults 0 0
Create /var/lib/docker - mkdir -p /var/lib/docker
Mount the LV - mount /var/lib/docker
At this point, you have /var/lib/docker with 10GB of its own space live on your system. However, there are a couple of important finishing touches that lots of folks forget:
Change owner, group, and permission settings to match use case.
Restore the SELinux contexts for that directory tree.
The new filesystem on the new LV has a fresh directory tree, the top-level of which defaults to root for the UID and GID with permissions of 0755/drwxr-xr-x. In this case, root UID and GID is correct but the directory permissions should 0700 (chmod 0700 /var/lib/docker) - only the root user should be able to read/write and access the directory.
A new filesystem defaults to having an unlabeled_t type context. The end result - everything created underneath that top-level will also be unlabeled_t. Fortunately, SELinux enabled systems have an easy fix for this: restorecon -vR /var/lib/docker
Secret filesystem mojo
Let's assume that you previously had a running Docker environment with a couple GB of images and container instances. You are getting tight on space and want to create a separate LV to permit for growth.
You stop Docker. You follow the steps above, as is. You start Docker. Docker reports a brand new, clean environment with no images, instances, or volumes.
Where did your existing images, containers, and volumes go? No where. In my partitioning scheme above, they still live on the varLV logical volume - in a sub-directory of the top-level lib directory.
In UNIX/Linux, mounting a partition or logical volume performs some redirection magic (WAY beyond scope of this post) so that /var/lib/docker points to the new LV mount point. However, the /var filesystem still has the underlying filesystem structure that points to the original data.
To see this easily (with my partition scheme above), on a freshly booted system, console in as the root user (not via sudo). Unmount the /home filesystem, create a temporary file, and remount the /home filesystem. Unmount again and watch your temporary file re-appear.
/var/lib/docker wrapup
I freely admit to not fully grasping the mechanics of Docker where the rubber meets the road (that is, where the pretty powerpoint presentations are converted to files and directories on disk).
That being said, I can discern the existence of black magic when I see the following:
# du -hs /var/lib/docker 2.4G /var/lib/docker # ls -lh /var/lib/docker/devicemapper/devicemapper total 1.4G -rw-------. 1 root root 100G Nov 12 10:33 data -rw-------. 1 root root 2.0G Nov 12 10:33 metadata
For those with poor eyesight, a 2.4G directory tree has a 100GB file in it. Why that is I will one day investigate... but not today.
Because, the relevance here is this: A typical method of mounting the new LV in a temporary location, syncing the data, and remounting the new LV in its final location does not work.
You will not easily be able to migrate the environment without some significant work and solid understanding of the inner ways of Docker. From what I've read online, that appears to change every 6 months too :)
I strongly suggest simply blowing away your old Docker environment (docker rm and docker rmi all containers and images), stopping Docker, and deleting the contents of /var/lib/docker before mounting the new LV based /var/lib/docker.
0 notes
Text
A Brief Linux Interlude - Partitions, Logical Volumes, and Layouts
In this day of dynamic applications, rapid development cycles, and throw away "OS instances" - be it VMs cloned from golden images or containers - often we are so focused on the forest we fail to realize the various parts of the ecosystem that all play a part.
One particular issue I had today brought to mind some aspects of provisioning I haven't thought about in years, instinctively following a general pattern that has served me well.
Filesystem Layout
For those who mainly consume Linux as an end user, you may not be aware of something called the Filesystem Hierarchy Standard (FHS) which defines the directory structure for a "Unix-like" Linux distributions. You can read a bit about it on Wikipedia (https://en.wikipedia.org/wiki/Filesystem_Hierarchy_Standard). It originally started as a Linux community effort, is maintained by the Linux foundation, but can be used by non-Linux operating systems as well.
The short of it: the directory tree you see in a Linux distribution isn't an accident or haphazard. There are specific directory paths for specific types/classes of data. Most major distributions conform to the standard - which is a good thing - as it permits developers to lay out software installations in predictable ways.
The standard also permits the OS to provide some default behaviors (such as SELinux contexts) that can be leveraged to help secure the system a bit better.
Disk Partitioning
A hierarchy is fine and must exist for the user and developer but... the system administrator who installs the operating system has the freedom to map that filesystem onto the underlying storage device(s) in any number of ways.
The first level - disk partitioning - has gotten very simple over the years. Linux systems rarely need more than 2 basic disk partitions on the primary disk: one for the /boot directory and one for the logical volume management subsystem. The separate /boot partition is required because the boot loader is fairly minimal, supporting only a few (sometimes just one!) filesystems.
The second partition is the LVM partition from which you build logical volumes to support all the other directory mount points for your operating system. For a primer on LVM, see Wikipedia (https://en.wikipedia.org/wiki/Logical_volume_management) or Fedora (https://docs.fedoraproject.org/en-US/Fedora/14/html/Storage_Administration_Guide/ch-lvm.html).
Logical Volumes/Mount Points
When you choose the default filesystem layout in a Fedora installation, you get that two physical partition approach for for /boot and LVM as described above. Additionally, Fedora creates two logical volumes:
one for the root filesystem (/)
one for your swap space.
With laptops and personal workstations, that's a decent enough approach. It gives an end user as much of the physical disk for whatever personal activities they want.
If you've built servers for a living, you quickly start twitching at all the potential pitfalls that ultimately result in filling up your root partition - which, if you didn't know, is bad (see https://youtu.be/wyKQe_i9yyo?t=47s):
/home - user data can grow until all space is consumed
/var/log : system logs that, even with log rotation/compression, grow over time (especially that new application which failed to put logrotate.d rules in place)
/var/cache : application cache data (software updates, e.g.) that never seem to get clean up over the life of the installation
/var/lib : application state data
/tmp - temporary files
Under this default layout, Murphy's Law presents itself in any number of ways, such as:
Logs got full 1-2 weeks ago so the debug messages are lost for the issue you are experiencing with (your web app you are building|the VM you are kickstarting from your laptop|etc.)
You're downloading software that takes an hour or two so you go to bed to wake up to an incomplete transfer
You do commit/push your code into your laptop/workstation repository only to run out of space mid-commit. Ouch.
My preference
To avoid those calamities with the default filesystem layout, I've built my own layout - none of which is terribly original as most of the structure follows guidelines from days long gone by. Given the twitches I outlined above, you might have guessed that the layout looks something like (no matter how big the system drive is):
(physical partition) /boot : 1GB
/ : 10 GB
/var : 10 GB
/tmp : 5 GB
/home : 20 GB
It's only recently (last year) that I had to bump /boot from 512MB to 1GB because of the increase in consumption (under Fedora) caused by the intersection of multiple kernels and systemd. Also, /var went from 5GB to 10GB because the yum/dnf/PackageKit cache data has started to consume a large quantity of space.
So, I have done pretty well with the layout. The best part about that starting point is that those logical volumes are not static in size - they can be grown at any time as your needs require.
Room to Grow Intelligently
And note: I stated those sizes are regardless of the drive's size. I have a standard practice of leaving the rest of the system disk unallocated. Doing so helps you in many obvious ways:
space usage discipline : namely, when you run out of space in /home, you can evaluate whether or not to clean up the usage or grow it based on actual need.
emergency reserve of space
flexibility to expand only those directory mount points (logical volumes) that really need extra space
ability to create new mount points when, for example, you want to create an isolated mount point for website content (/var/www/html) or a PostgreSQL database (/var/lib/pgsql).
Summary
So, many times - especially for "throw away" installs - you don't really care how your filesystem is deployed. However, for longer living systems, such as a home workstation or Linux VM on a Mac or a server, a little bit of preparation can save you some headache later down the road.
Check out my next post where I pull in the Docker angle to this refresher.
0 notes
Text
My Take on #NFD12 Heading Into #TFD12 and #NFD13
Networking Field Day 12 marked my first attendance as a delegate to a Tech Field Day event. I honestly didn't know what to expect from being "on the inside". I had viewed many of the previous events online, both live and in review. I had been to many Cisco Live (#CLUS) events (8 straight years) and so thought I was prepared for the technical information overload.
In the Field Day environs, though, the content for me was substantially different, being a good bit outside of my normal, day-to-day experiences. Some of the presentations involved some novel ideas dealing with newly emerging pain points in the industry.
As a result, you could (and we did) very easily have a start up from Silicon Valley that just came out of stealth making a presentation in order to convince you of (among other things):
The gravity and importance of the problem they are trying to solve
Their technology they've developed to uniquely address that problem
Launch countdown on hold
Not truly knowing what was coming, I began the event with a technical conference approach: consume and absorb all the information, taking copious amounts of detailed notes in the process. Under that model, though, much of the information tends to be just a newer generation of existing experiences or products. The challenge is merely the volume of information and not necessarily the evolution of ideas.
Understanding/comprehension, relevance and perspective would typically form in discussions at the conference, on the trip home, or while reminiscing with colleagues back at the ranch.
I won't lie - it took me the better part of the first morning to truly understand a new approach was needed in order to better participate in and benefit from the event. And ... I had to develop it fast.
Smacking the launch timer
To start with, for #NFD12, I had to quickly discard my notion of taking detailed notes. From one perspective - an extraordinary amount of very cool technologies and architectures - it was very easy to stay focused on the presentations. From another perspective - what the heck did they just say? what did they say yesterday? - it was extremely hard because detail oriented is core for me. As I internally munge the content, it isn't unusual to miss a key detail.
What ended up working pretty well for me was taking a step back and realizing something that probably should have been obviously to me going into #NFD12. Simply put myself in the position of having the problems being highlighted by each vendor and, based on my past and current roles, analyze the architectures and solutions in the light of possibly buying the product. Does it, can it, will it ever do what they say it will do and does it really solve a problem I might realistically have.
Achieving orbit
In the end, the conference approach ultimately was grounded in the goals of attendance: trying to drive me to become something of an SME on the issues, technologies, and products. That goal was rather unrealistic - especially for my first Field Day event.
After those adjustments, I felt a great deal more relaxed during the event. I enjoyed the presentations much more. I engaged more with my fellow delegates and had great off-camera conversations with them.
Post-launch assessment
For the next events - Tech Field Day 12/#TFD12 and Networking Field Day 13/#NFD13 in November 2016 - I am seriously contemplating using pen and paper for any small amount of note taking that might be required. The open editor for notes simply didn't prove useful.
Besides, once the videos went online at #NFD12, I was able to take much better, detailed notes because I had seen the material once already and I had the convenience of pause and rewind functions. So, other than live tweeting and side conversations, going full electronic was not all that helpful.
0 notes
Text
Network Analytics That Helps You and Your Help Desk
In my previous post, I talked about the monitoring tools and visibility that you might typically see at a University. With those tools, you get a great sense of what services/hosts/devices are up/reachable. You even can extract/deduce that some degradation might be going on. I even talked about how the "typical" open source (or inexpensive) tools leave an important aspect of visibility out of the picture.
Visibility is an extremely powerful component of maintaining and supporting your network. The right data in the right hands can lead to quick and effective problem resolution.
The Wireless Dilemma
It's no secret that end users are increasingly mobile. Equally unrevealing is that troubleshooting wireless is not easy - even for seasoned network engineers. As mobile applications put more stress on the network, users increasingly experience connectivity issues that affect the business. Given the complexities involved, help desks simply resort to escalating a large number of incidents to be resolved by the (wireless) network team.
As a result, that creates a huge operational expense for large scale wireless deployments - especially at universities that are so dependent on mobility for learning spaces as well as accommodating the "residential experience" students require.
Any tool or technique that can help engineers quickly identify the root cause for a reported wireless issue provides a huge win for the university - especially as they (like many businesses) are under increasing pressure to reduce operational expenses.
Time and Materials
Because, in my mind, the problem maintaining a wireless network isn't an engineer training issue - it's a time issue. The massive drain on time is the manual collation of information from several different sources to begin deducing the root cause: wireless controllers, authentication servers, SNMP managers, SYSLOG servers, etc. all contain various pieces of the puzzle. Typically, a site visit is involved by the engineer. Many issues have smoking guns, some do not.
For some salt in the wound, at a University, users (mostly students) typically have a low report rate for incidents. If they've come to expect that "wireless isn't so great here", they may live in silent suffering especially if some level of wired connection is available. Occasionally, though, someone may think better of it and report an issue. As a result, you are left with an incomplete picture of the scope of the problem.
Worst of all, wireless being wireless and residential spaces ripe with dirty air (electromagnetic and aromatic), you have a large potential for your network equipment being completely functional and your clients not.
Nyansa at #NFD12
All of these pain points came to mind when I was a delegate at Networking Field Day 12 (http://techfieldday.com/events/nfd12). The first presentation of Thursday was Nyansa (http://www.nyansa.com) and their product Voyance.
The TL;DR of the presentation: it will tell you in simple plain English what the problem could be and possible solutions to fix it. And they showed an example of it on a live network (Brandeis University).
But how they manage to accomplish that is very interesting...
How Voyance Works
The longer version: they use appliances (OVA or physical) distributed across the network - as many as needed to get a complete view of all traffic - to collect a wide range of metrics off the packet header via port traffic replication (SPAN/Monitor). Some protocol information (HTTP response times, e.g.) are also gathered. The packet payloads are discarded.
The crawlers (as they call them) also communicate with popular wireless controllers (Cisco, Aruba, Ruckus) to pull configuration and performance information from the wireless network. This integration permits Voyance to have that visibility into client wireless experience by having access to signal strength, association information (client to AP maps), and assists with mapping users to clients.
All that "performance metadata" (as I'll call it) is sent to Nyansa's cloud analytics engine. In short, the product uses analytics on all the visibility you have given it to do a few very cool things (among others):
Collates all those sources of information to associate them to a specific client and/or user.
Recognizes commonalities between clients (e.g. 5 clients on the same AP) in order to detect common behaviors between related devices
Can determine baselines for various aspects of your network automatically
What Voyance Can Do For Operations
When it's all said and done, the two parts that really got me interested about the technology was automated correlation/grouping and the wireless troubleshooting. As I wrote in my intro, it ain't fun and it frequently has a lot of incomplete information.
So, to see where I'm coming from, take a look at the Nyansa Voyance Demo video, starting at the 27:30 mark and watch for 2 minutes. In short - you can search for a particular user, a start and end date and get an overview of what issues that user experienced with possible causes.
Just as I said in TL;DR and no I really wasn't kidding. The analytics engine is making its best inference/deduction based on the data at hand. The best part? You can see the same underlying data for those determinations as well, all in one spot:
Generally speaking, there is the potential for real operational savings with this tool enabling first level support to absorb the load from the engineering teams. With some workflows and training, the help desk can field/resolve some of the network tickets coming in. Or, with insight from the Voyance analytics, have a much better chance at assigning the ticket to the correct team for resolution.
On top of that, there's operational savings for your engineers as well. Not sure you are willing to wholesale trust the result from Voyance? That's okay - you will have with all the related data available in one location related to the user/problem at hand.
The presentations also highlighted a few other features which (for brevity) I won't discuss in depth. I didn't want to leave them unmentioned though as they are also difficult operational tasks that could be assisted by Voyance.
1. Dashboard overview of your network - yes, many products have one. But, look at this screenshot. Notice the security "Global Advisories Notifications" (my edit: big red box). This was not covered during the #NFD12 presentation but I think your security team might be interested in the demo. I would have loved to have had the time to see a deeper dive of the security advisories on which it can report.
2. Quality of Experience - how "good" was user X's Skype call? Or, why was it so bad? We didn't ask but I imagine the same questions can be asked of Avaya or Cisco telephony and/or video conferencing technologies. Maybe even WebEx? (pure conjecture on my part)
3. Capacity planning - it has all your performance data and trending information so this a natural outcome.
End Game
The bulk of my experience derives from academic institutions. They are great, challenging, and exciting environments to work in. With students being the ultimate customer and bringing the latest gadgets from home, new and more complicated technologies continue to put pressure on staff.
For the last 6-8 years or so, many of us have been driving home the point that the (wired) network is a utility, like power and simply must be on at all times and rock solid. Now, we have the industry pushing wireless to meet those same standards as well as dropping more applications on it than ever.
So for education, I think there are several "gold stars" that make Voyance very appealing: 1. Deep packet inspection without decrypting or storing payload 2. No agent on client devices, no network topology changes 3. Does not need a map or topology of your network 4. Better guidance for help desks which at universities can be staffed by student workers a fair amount of the time.
Bonus!
In doing a bit more leg work for this post, I ran across this link: http://www.nyansa.com/educause/
If you are going to Educause at the end of October, Nyansa is scheduling one-on-one demonstrations of the Voyance product! That would be a great opportunity to check them out and really pick their brains on the features that interest you. The team from Nyansa at #NFD12 were very approachable and, as you can see from the videos, excited about their product.
Also, after the #NFD12 presentation, I asked about pricing - it's based on the number of users or devices and comes in 4-5 different tiers (see http://www.nyansa.com/faqs/ for a slightly more wordy version). Flat-ish pricing is very attractive to academic institutions as well.
Disclaimer
I was a Network Field Day delegate - please be sure to review my disclaimer post: https://broadcaststorm.tumblr.com/post/150287365748/disclaimer-for-my-blog
Now, if only I had a network with which to try out this product...
0 notes
Text
Seeing the Network Through Your User’s Eyes
Monitoring at a University
In a former life at an academic institution, we had a pretty solid set of monitoring solutions in place: Nagios and Cacti to monitor servers and services, coupled with SolarWinds NPM and Prime Infrastructure to monitor the network. Between the 4 tools, you could at least deduce the health of the data center if not outright know for certain its health.
However, there were two pretty glaring blind spots IMO in the network: the campus wired and wireless network - okay, really it was mostly wireless. To the point though, it wasn’t whether physical switches or APs were up and functional - that’s easy - it was whether clients could get to the services they needed with a good user experience.
Those 4 tools, at least the way we had implemented them, were certainly not providing that view.
To get that visibility, you would need the analog of a Nagios server running on every single wired/wireless client checking response times of the gateways, how long it took to query the campus DNS/web/LDAP servers, Google, Office365, etc. And, honestly, having the historical performance data is the real gold mine - help desk tickets rarely are delivered in a timely fashion and the problems are often transient so being able to view performance an hour, a day, or a week ago can make all the difference!
NFD12
There were several vendors presenting at Networking Field Day 12 (http://techfieldday.com/event/nfd12/) where I had the privilege of being a delegate for the three days of sessions. A few of them were looking to fill exactly this kind of blind spot in the network. Each one though, in my humble opinion, had a different target “demographic” for the product they were evangelizing. There was one presentation though that wasn’t aimed (in my opinion) at ripping out your existing tools...
NetBeez
NetBeez (https://netbeez.net) overviewed and demonstrated their Beez and BeezKeeper product. The concept is fairly straightforward and not overly complex. Distributed agents with a central management console - stop me if you’ve heard that before. However, what struck me as interesting about NetBeez was:
The reason they started the company spoke exactly to the blind spots we experienced in our operations! (see https://vimeo.com/178818193 from start until 02:30 mark)
The little hardware agents you can buy are built on Raspberry Pi (or similar) devices running Ubuntu. So there is potential for monitoring anything that you can script up under a Linux environment.
The command/control functions were available in an onsite appliance (OVA) that will receive and store all the metrics. This helps calm the privacy and data security concerns that can accompany a cloud service.
Of keen interest, you were not required to install software on client devices or even touch them in any way to gather this "end user monitoring" data.
At this point, you should pause because the last reason contradicts what I had said I was originally looking for (Nagios running on every client). And, that's true. Except...
At an academic institution, you don't own a vast majority of the end user devices and therefore can't forcibly install this kind of software on them. Even the machines the institution buys for faculty are typically verboten.
That's where reason #1 comes into play. Using IT owned clients, you can still perform the monitoring you need (namely, what the network looks like from the user's perspective) without dealing with client agents on student or faculty machines.
You still have clients/agents - the Beez - but they are centrally managed and can be placed in a network closet or anywhere near a network jack. They connect to the network just like other client devices, whether by wire or wireless, in order to probe the network.
A Look at the BeezKeeper Portal Data
Below are some screenshots of the type of data (wired) that the Beez will monitor and collect. The supported tests right now are:
ping - for reachability
DNS - how quickly the name resolves
traceroute - test path to specified target
HTTP - basic HTTP GET requests
iperf - requires a iperf target
speedtest - leverages the CLI based Oookla speedtest service
VoIP - latency, jitter, delay, MOS
First up, some historical ping data from my Beez to the Tech Field Day website, during a period where some substantial changes in ping RTT occurred:
As you can make out in the plots, this is pretty late at night and I certainly wasn’t doing anything on the network. So, let’s take a look at the traceroute data, focusing around the 01:50 time frame. First, the “before” scenario:
Now, the “after” scenario:
You can see not only did I add 3 hops to my route but that I went from a single hop with 35ms RTT to multiple hops with 35ms RTT or greater. And, at 03:44, you can see another significant routing change occur:
What’s clear from these plots is that I was having some performance degradation and some routing changes were taking place in the path to the specified target (namely, techfieldday.com). What this data doesn’t show, though, is the return path traceroute - not unexpectedly, though, given the architecture. It is a crucial component in the monitoring picture.
Why, you ask? The reason being that the traceroutes - excluded here for the sake of length - look fairly normal and quite nice (RTT around 25ms or less, 11 hops or less) - even during the 03:50 to 04:20 period. I do not think that the return path playing a significant role here is an unreasonable deduction in this case.
Fortunately, better path analysis is a feature on their near term road map!
Just for kicks, here are the historical data from the speedtest monitoring the Beez was conducting:
Sweet Nectar or Angry Swarm?
As much as I love the Raspberry Pi angle and the Linux/IoT angle to the product, the test suite is what could be termed the core, basic tests for a remote network agent. As such, I think there are a couple of checks that are really lacking - especially for small to medium businesses (which I think is a natural play for the product):
Directory checks! Whether Active Directory or OpenLDAP, I think it would be very crucial to know if clients are having issues connecting or authenticating against the directory service. You definitely have to be careful on how this gets implemented though!
Web Service checks! Currently, the HTTP checks seem to only perform a “GET /” request. While that does test reachability and exercise the web server a little bit, it doesn’t necessarily exercise the login or Ruby/Tomcat/name_your_favorite_stack aspects of the site. Support for Salesforce or calling REST APIs on a site to exercise the app engine would go a long way.
Mail checks! Exchange/OWA/POP/IMAP does not matter - at the very least, testing the SMTP protocols enough to know that the service is up. Ideally, with the onsite appliance, be able to trigger a special monitoring email to be sent through email system with ultimate deliver to the appliance for metrics.
For Beez deployed at remote sites, I’d even add the ability to test the VPN connectivity - whether that’s the ability to connect via remote access or site-to-site mechanisms or simply have a mechanism to recognize or classify metrics transit a tunnel.
Honestly though, in regards to e-mail checks, as most academic institutions are running to the free email services of Google and Microsoft, this might not be as critical as it once was. Outside of academia, though, email drives communication which drives revenue so your mileage may vary.
Those suggestions I provide are on top of the roadmap they announced at NFD12 (https://vimeo.com/178818193, 17:27 mark). I would say the container support is probably the lowest hanging fruit and biggest win, especially given the continuing trend of vendors moving toward containers running directly on switch hardware.
Additionally, as suggested by another delegate, container based agents could be more easily integrated into a CI-based unit testing workflow very easily to allow more comprehensive testing of application changes.
Too Much to Talk About
The Beez can also monitor wireless service but I don’t have the wireless model so didn’t get a chance to play with that at all. Wired is nice but we all know wireless is where almost all users reside. Blindly wishing here - I’d love to see a wireless Beez collect signal strengths for neighboring APs and any other client devices in the area.
On the portal, we delegates had a great deal of internal discussion about it (layout, visibility, etc) and opinions varied greatly. I can’t stand developing web portals myself (I miss the blink tag) so I’m not about to critique any else’s web design.
My only real difficulty with the web site is remembering/finding the location to add/remove agents to/from monitoring targets. Everything else to me seemed to be found in the right locations for the use cases to which they were designing.
“Super short” wrap up
As a Linux engineer, I am very intrigued by what they are doing and definitely want to see them succeed where they have a market play. At this time, I think the Beez product would be a good complement to an existing set of more sophisticated (Linux-based scripted service monitoring checks).
Which is a good place to be for them - after all, they even admitted in the overview talk that the product was born of a desire for more visibility from the end user perspective beyond what their existing tools were providing. To that end, I think they’ve definitely laid the foundation of that goal with the features they’ve shown at NFD12.
As the ability to extend the checks that run on the Beez is supported, more involved monitoring actions can be scripted and run in a coordinated, distributed fashion. Maybe the product develops into a comprehensive solution?
They’ve got some work ahead of them but, by using Linux at the core, they’ll have a flexible, customizable foundation that will give them a leg up on products whose features are less extensible.
Supplemental Disclaimer
As a refresher, be sure to review my full blog disclaimer.
Panos and Stefano over at NetBeez provided the NFD12 delegates with a free, hosted service account as well as one of the Raspberry Pi based Beez (for 100Mbps monitoring).
This account and the Beez were provided as a giveaway for attending Networking Field Day 12 and were not provided with any requirement to write about the product or service.
I simply love the Raspberry Pi concept and support novel uses of the technology. I think the product has a good start with the Beez concept and wish them well.
And, as stated before, I’m a Linux engineer (among other things) and have used Linux since the early 90s. I might have a (more than) slight bias for Linux based products. Just saying.
0 notes
Text
Disclaimer for My Blog
At some point, I’ll find a theme or another blog site/format to allow this to be a top-level link for constant reference. Until then, here is some important information you need to know about my content!
TL;DR
This is my blog that contains my opinions, my observations, and the works of my hands (Tim Miller) - not of my employer (past, present, or future) or of any vendor. I do not get compensated for the content in my blog in order to provide any endorsements, or write any specific content, or recommend one product or technology over another.
Specifics on Content
I write content solely about the technology and trends that interest me. In the case of lab scenarios or troubleshooting, I write that content in the hopes it helps my fellow engineers just as I have been helped by other technology blogs.
Period.
That being said, you should know how I come about that information.
As a IT professional, I have access to a wide variety of information from vendors with which I interact. Most of this information is publicly available - e.g. from official websites, publicly available webinars, and conference events. Other information may come from confidential discussions that are covered by non-disclosure agreements (NDAs) or embargos - a request by the provider to not talk about the material until a certain, specific date.
Early or privileged access to information (either by NDA or Embargo) is in no way a commitment by me to write about that information nor, when I do write about the subject, is it a commitment to write favorably about that information. I do not accept access to such information from any vendor that is conditioned upon such writing and/or favorable mentions.
Almost exclusively, access to NDA information is for the purpose of intelligently planning and recommending solutions for my employer OR for my personal understanding of technology trends.
For such privileged information, all content will remain confidential until the terms covering that content have expired.
Specifics on Expenses
From time to time, either through the course of normal employment or through conferences or through special invitation events, vendors and/or third parties will cover my expenses at events where I may hear about their products and technologies.
More often than not, these expenses are simply the meals provided during a presentation (”lunch and learns”, for example). On occasion, my travel expenses to these events are also covered, which may include mileage, airfare and/or hotel.
At some of these events, such as conferences or technology events, there are free items provided as a result of participation in the event. Many times, this is the usual “conference swag” ranging from free t-shirts to backpacks to battery packs.
Additionally, either through the course of normal employment or my special request, a vendor may provide loaner demonstration equipment or free software trials or, in rare cases, a free partially/fully functional product sample. For loaner or demonstration environments or product samples (samples costing less than $50), I may or may not indicate the circumstance for obtaining access to the technology and I do not commit to indicating so for any or every post.
Any content that is the result of a free product sample (costing more that $50), I will indicate on each post the source of the equipment.
HOWEVER, regardless of expenses or giveaways or equipment/license provided to me, at no time are these requested by me or provided by the vendor with the understanding by the vendor or a commitment by me that I write material about those products or technologies, whether favorably to the vendor or unfavorably to a competitor.
Again - as I said before - I write content solely about the technology and trends that interest me.
In the case of lab scenarios or troubleshooting, I write that content in the hopes it helps my fellow engineers just as I have been helped by other technology blogs.
Period.
0 notes
Text
Final roundup of “Know Before You Go” Cisco Live 2016
check out the #CiscoChampion Radio, S3|Episode18: Cisco Live Tips & Tricks bit.ly/29hzGKfcheck out the #CiscoChampion Radio, S3|Episode18: Cisco Live Tips & Tricks bit.ly/29hzGKfVI get it. I had the same week. It’s the last few days before you leave the office for the week and the project had to get done. And, there was a service outage. And, there were the ongoing operational and project handoffs.
Next thing you know - it’s Saturday and Cisco Live is here. You’ve missed out on checking twitter, or listening to podcasts, or reading your favorite blogs.
Admit it - you just got to take a breath and have no idea what to pack or to expect. Here’s a summary of mine and fellow #CiscoChampion tweets, some blogs, and some #CLUS (that’s Cisco Love) pages.
Trip Packing Tips
Bring a light jacket or heavy shirt for sessions. May be 113F/45C outside but inside below 70F/52C!
You know that dry heat? Two words: chapped lips. Don't forget your favorite lip balm.
Can’t leave space for bringing swag home? Pack empty soft bag to check on return flight.
Gunning for lots of t-shirts? USPS Priority Mail has free box is cheaper than checking a bag.
General Conference Tips
Get the CL app - Shuttle schedule, conf maps and your schedule in app! http://www.ciscolive.com/us/attend/attendee-info/#mobile-app
(Courtesy @CiscoLive) #CLUS registration opens TOMORROW [Saturday] from 3-7pm. Pick up your badge a day early to beat the lines. Reg. will be open 7am-7pm on Sun. - Some years miss the mark or have technical issues so you want to try to get registered at off time (not 7am!)
Take a session to expand your bag of tricks - BRKEWN-3000 Analyzing and Fixing Wifi Issues
Be engaged in your session, ask questions, and answer them. As bonus, some speakers give free books! Or, UADP chips (courtesy @petergjones)!
Download the session PDFs for your sessions now via #CLUS Session Catalog. Avoid 20k #Wifi d/l at start of session #Oof
A sneak peak at the conference back pack (courtesy @cantechit)
Last Minute Reminders
(Courtesy @cantechit) Video tweets!
cantechit Final work day prep before @CiscoLive I don't even have time to blog so here's my vlog on last minute prep #clus https://t.co/LtlHZ2XFpK 7/8/16, 13:12
cantechit Ok shopping done prepping for @CiscoLive #clus - off to dinner.. #20000 https://t.co/f72UonDs3w 7/8/16, 18:01
More tips from other #CiscoChampions - S3|Episode18: Cisco Live Tips & Tricks bit.ly/29hzGKf
Finally, the official “Know Before You Go” conference website: http://www.ciscolive.com/us/attend/attendee-info/know-before-you-go/
See you all there!
0 notes
Text
Strategies for Last Minute Cisco Live 2016 Scheduling
You made the pitch. You got management buy-in. You registered. Rock and roll.
Except, work set in. You got busy. The session catalog opened months ago. And, now, with less than 3 weeks to go, you are looking at hundreds of possible sessions to attend. Possibly, many of the interesting ones are full.
Eyes glaze over. Panic sets in. You freeze. What now?
One of the fantastic features about Cisco Live is also its most daunting challenge - with so many sessions to learn so much about a plethora of technologies and products, you can wind up facing tough decisions between 5 different courses for the same time slot.
Even with several of them full/wait-listed :)
Another fantastic aspect to Cisco Live - there are multiple avenues to learn, from breakout sessions, to technical seminars (extra cost), to instructor led labs (also extra), to self-paced labs. So, let’s break down how to get the best bang for your buck.
First, book those Breakout Sessions
Start simple - pick a couple of areas from your current employment where you really wish you knew more. Let’s say as an example - “Voice”.
Head on over to the scheduler page (http://www.ciscolive.com/us/learn/sessions/session-catalog)
Expand “Technology” filter
Select “Voice and Unified Communication”
Expand “Session Type” filter
Select “Breakout Session”.
Voila! 36 sessions. You know your environment. You know the upcoming projects. There is certainly a session or two in there for you. Got PRIs and you bury your head in the sand when someone says “SIP” -
BRKUCC-2006 - SIP Trunk Design...
BRKUCC-2932 - Troubleshooting SIP...
BRKCOC-2008 - Inside Cisco IT: ... Consolidating Voice Circuits using SIP
Information you need to know what you’d be getting into when SIP comes bursting through the door like a battering ram.
Second, leverage technical seminars
I love the technical seminar sessions at Cisco Live for three reasons, even though they do cost extra money:
They are focused on a single subject which, depending on the 4-hour or 8-hour variety, can be extremely comprehensive or extremely detailed (sometimes both!)
They are chaired by multiple Cisco engineers which allows for a broader spectrum of questions and answers.
Many are on Sunday and so you get another day of Cisco Live!
Technical seminars are also convenient tools for condensing subject matter into an “out of band” time slot (Sunday), permitting you to take several breakout sessions worth of material and place them into a single (longer) session. This shuffle frees up available precious breakout session times (about 14 in all) for other subjects!
For example, if you are looking at learning more about Cisco’s software defined networking approaches, you can look at either:
TECACI-2009: ACI - The Policy Driven Data Center
TECSDN-3600: APIC EM - SDN in the Enterprise
If you’ve secured additional training funds to accomplish this, you’ll need to revisit your breakout schedule from step one. It’s an iterative process of sorts.
Third, meet with the engineers!
If you have an upcoming project for which you are still researching solutions, why not brainstorm with some product team members about what works well, what doesn’t work well, what they’ve seen others do?
Cisco Live has two different avenues for this level of access:
Meet the Expert (http://www.ciscolive.com/us/activities/the-hub/#meet-the-expert)
Technical Solutions Clinic (http://www.ciscolive.com/us/activities/the-hub/#technical-solutions-clinic)
The MTE is scheduled ahead of time and requires some information to be provided at time of setup. In the past, you needed a paragraph or two of your problem you wished to discuss but, this year, it seems to be a bit more scaled back.
The TSC is much less formal - you drop in, wait your turn, and discuss with TAC engineers or other experts what’s on your mind.
Now, the bulk of my experience has been with higher education account teams and they were fantastic! So much so that I’ve rarely used the MTE or TSC meetings much. But, the times I have used it have been phenomenal. When you come across an issue no one is clear about, you get answers pretty fast - usually during the conference.
So, if you are looking for a Cisco perspective to keep the Cisco Partners honest, these options are great opportunities.
Lastly, be flexible!
Leave some time in your schedule for three things:
The “World of Solutions” exhibit hall (http://www.ciscolive.com/us/activities/exhibitors/)
Walk-in, self-paced labs (http://www.ciscolive.com/us/activities/the-hub/#walk-in-labs)
DevNet Zone (http://www.ciscolive.com/us/activities/devnet-zone/)
You never know what interesting topic you will come across during the week. You want to leave yourself a little time to find out more about something new - whether a product or technology. The exhibit hall is one place where you can learn more about Cisco products as well as the partner ecosystem.
Just to be clear - I’m not talking about the opening night where many folks are on a mad dash to get conference giveaways. I mean during the day (around lunch time is usually the best) when the folks at the booth are available to have a real conversation. Many really do enjoy the honest conversation and are excited to talk about their products.
I personally enjoy the WISP labs though as they allow you a little “no pressure” fun time that permits playing with solutions you may not normally see or use. I tried DMVPN a few years back. The engineer overseeing the lab sent me the PDF of the lab instructions, which included network drawing and solutions.
Finally, my DevNet Zone recommendation may give you pause - especially if you are a IOS CLI junkie! Why would you care about software development? Well, remember this section’s title: be flexible!
First, you are expanding your horizons.
Second, you are going to see what crazy things people are doing with your network.
Third, you can also see what possibilities exist for you to do/see/enable more with your network.
Keep in mind, you don’t have to be the programmer - you can get excited about possibilities and pitch ideas to your in-house developers so that they can help make them a reality!
In short, don’t panic!
Yet another awesome aspect to Cisco Live - there are many opportunities for ad-hoc learning experiences. Just keep your eyes open for them once you get there.
And keep an eye out for me while you are there - I’ll be following my own advice above during the conference as well, namely TECSDN-3600 on Sunday, DevNet and WISP throughout the week.
0 notes
Text
Cisco Live 2016 (#CLUS)
For the past few years, I’ve written a blog post about Cisco Live - whether it’s a motivational post to attend or a cool wrap-up of how awesome an event it was.
This year, I’d like to discuss the benefits, specifically to new network engineers, of attending the Cisco Live 2016 conference. Seasoned engineers can look at the online content (http://ciscolive365.com) and understand the value of the presentations. Cisco Live regulars (“NetVets”) have tangibly experienced the value that attendance brings personally (for career development) and professionally (real skills for your company). If you are somewhere in between, Cisco has posted a form (http://bit.ly/1YqIfGt) on the Cisco Live website to help you make the case as well.
New engineers, though, might not be seasoned enough to build a coherent case for professional development in general or specifically for Cisco Live. So, I hope the following builds a strong desire for learning and excitement in attending this amazing technical conference.
As a new engineer, you probably fall into one of two camps -
a specialist focused on supporting a single service (VoIP or Video - or, Collaboration in Cisco’s focus)
an operations person with a focus on making all services are functioning normally
Cisco Live 2016 has something for either camp to help you reach the next level.
Deep as you want to go
As a new specialist, you have some experience with the particulars of your area. Requests come in, tickets get resolved. There are periodic service interruptions that the senior engineer handles and even briefs you on. Now you want to take on resolving bigger issues. Let’s take the wireless specialty, as an example.
Understanding Design:
BRKEWN-2010 : Design and Deployment of Enterprise WLANs
BRKEWN-2017 : Understanding RF Fundamentals and the Radio Design for 11ac Wireless Networks
BRKEWN-3014 : Best practices to deploy high-availability in Wireless LAN Architectures
Troubleshooting:
BRKEWN-3000 : Analyzing and fixing WiFi issues - Cisco WLC tools and packet capture analysis techniques
BRKEWN-3011 : Advanced Troubleshooting of Wireless LANs
Security:
BRKEWN-2015 : Wireless LAN Security and Threat Mitigation
A pretty comprehensive, multi-topic crash course in owning/operating an Enterprise WLAN environment, don’t you think? How to find those sessions is revealed a bit later on.
Broad as you want to be
For the other end of the spectrum, where you need a solid understanding across multiple technologies, Cisco Live literally has more opportunities than can be attended to broaden your experience. Of the available breakout sessions, there are:
75 Introductory Sessions
9 General Sessions
283 Intermediate Sessions
73 Advanced Sessions
There are 14 different learning paths. Within one of them, the Enterprise Networking path, there are 15 different sub-categories such as:
Campus LAN Design and Deployment
Identity and Control
L3 VPN
MPLS
Multicast
QoS
Routing Protocols
WAN Design
You see, the most frustrating aspect of Cisco Live (to me) is that there are only 14 or so breakout session time slots during the week of the conference. So, you have to adjust your approach to finding sessions that fit the areas of technology in which you are likely to be involved.
Fortunately, there’s a tool for that!
The Cisco Live 2016 session catalog is online now (breakout sessions here - http://bit.ly/1VPc44q) and includes a filter category called “Technical Level”. I used it to get the number of sessions listed above. For example, you can identify Intermediate level Breakout Sessions in the Enterprise Networks track that cover Collaboration, Security, Switching, and Wireless and find 38 sessions (http://bit.ly/1UXitvw), such as:
Campus QoS Design-Simplified
Cisco Catalyst 3850 and 3650 Series Switching Architecture
Design and Deployment of Enterprise WLANs
Enterprise MPLS - Customer Case Studies
Multicast Troubleshooting
This year, the tool is much simpler and more streamlined. The team has made some good improvements to it to make filtering the sessions easier and hopefully you can find the sessions that interest you.
“You never walk alone”
If your company is large enough, a new engineer and a senior colleague will likely attend Cisco Live together. This arrangement is the best possible scenario because the opportunities for mentoring are immense - as new material is encountered, the senior can contextualize the material with the junior for the benefit of the company and the services they support. Even better, the senior can provide a practical perspective on the balance of positives versus negatives for solutions.
What about small companies, though, where more than one engineer attending Cisco Live is too much a burden on operations to bear?
The “Cisco Live Mentor Program” exists to pair a first time attendee with a “NetVet” (like myself). The NetVet provides guidance and answers questions before and during the conference. At the start of the conference, there is a first time attendee meet up to put a face to a name. A great way to network (with people) and have a better experience of the conference.
New first-time attendees can fill out a New Attendee form (http://goo.gl/forms/DNc6e8hRIx) to register your interest in participating.
Be sure to take advantage of it!
Advice for the Senior Engineers
While it may seem like I have forgotten you today, the senior engineer has an important role in all of this - support your colleague’s request to go to Cisco Live! Help them make sense of the session abstracts and provide guidance on what seems appropriate for their enrichment.
Look at previous content as a level setting tool. Many of the entry level courses are offered for a couple years and have similar material content with updates. A little bit of guidance in selection will vastly improve their experience as nothing has the greatest potential for demotivation as being overwhelmed by material beyond your years.
If you have one of those nagging little issues in your network, one that doesn’t have priority to investigate but really should be resolved, I have a suggestion - perhaps your new engineer can visit the TAC engineers onsite and pick their brains for ideas. I found the TAC engineers to be extremely knowledgable, eager to share what they know, and love white boarding possible solutions for you. What better experience for your colleague?
Official information to help your cause
The Cisco Live website has some great information to help you, beyond what I’ve mentioned here, make your case to attend using Cisco’s “Attendee Proposal Letter” found on their “New To Cisco Live” page. It’s a quick stop, short list of good tidbits and tricks for the conference.
But best of all…
Wait, it saves money?!
Amazingly, the cost of the conference (as low as $1795 if you register early) is less than half of most traditional courses such as a CCNA boot camp or CCNP boot camp. The various locations of each conference will cause fluctuations in the travel, hotel, and meal expenses. However, for the breadth and depth of material and sheer number of hours to avail yourself of dedicate experts, you simply cannot beat the value for attending the conference.
Convinced?
I hope you have found a new level of excitement for attending the conference. Ever since my first Cisco Live (2009), my number one training priority has always been Cisco Live each year. It’s an exciting time to hear great keynote speakers discuss trends and roadmaps. I simply love learning from the very technical engineers who are helping drive the latest innovations.
There are a couple of different packages that you can use to attend Cisco Live. You can find them at the registration page. But hurry, the longer you wait, the more expensive the conference becomes. Worse yet, the further your hotel is from the convention center!
Come Meet Some NetVets!
There will be a pre-conference “Tweet up” at Cisco Live 2016. We don’t bite. We are (mostly) harmless. Some wear kilts. All have a great time catching up. So, come on by and introduce yourself! When the details are finalized, I’ll post it here and on Twitter (@broadcaststorm).
See you there!
0 notes
Text
Static and BGP routes - Some Cisco IOS Routing Table Fundamentals
In our last episode on fundamentals, I revealed that a site-wide power outage brought down our primary network closet and, despite the secondary network closet being fully operational, we were no longer able to reach the Internet. This post will dig into how that happened. Sadly, it was not a misconfigured server that was to blame (see t-shirt http://amzn.to/1VCzidN).
The Network
Enough time has passed such that the detailed notes of the incident have sadly gone missing. Tragic because a lot of good log entries were captured and would have been extremely useful for all you folks as well.
When all you have are lemons, make lemonade - I took the opportunity to get my simulation environment cleaned up, build out a BGP template network, and clone it with the details below. The design replicates the topology in place during the outage enough to demonstrate the fundamentals here today.
Most of the high level design goals were “the usual” - when R1 goes down, R2 services entire traffic load of the enterprise and vice-versa. Proper traffic re-routing when individual links go down as well. ISP1 simply advertises a default route to us and to ISP2 (which is then advertised to us as a backup link).
One important goal that comes into play today was that all incoming traffic to the enterprise flow through R1 even if the Gi0/0 interface to ISP1 went down. Therefore, Internet traffic coming from ISP2 into R2 (when ISP1 fails) should be routed to R1 before sending into the enterprise - assuming R1 is up, of course.
The relevant configurations of the switches are here:
! R1 router bgp 65001 bgp router-id 172.30.0.1 bgp log-neighbor-changes network 172.30.0.0 neighbor 192.168.11.1 remote-as 65004 neighbor 192.168.21.2 remote-as 65001 neighbor 192.168.21.2 next-hop-self ! ip route 172.30.0.0 255.255.0.0 Null0 ! ! R2 router bgp 65001 bgp router-id 172.30.0.2 bgp log-neighbor-changes network 172.30.0.0 neighbor 192.168.21.1 remote-as 65001 neighbor 192.168.21.1 next-hop-self neighbor 192.168.22.1 remote-as 65005 ! ip route 172.30.0.0 255.255.0.0 192.168.21.1 ! ! R4 router bgp 65004 bgp router-id 172.10.1.1 bgp log-neighbor-changes network 0.0.0.0 network 172.10.0.0 neighbor 192.168.11.2 remote-as 65001 neighbor 192.168.11.2 prefix-list BGP_PERMIT_DEFAULT_ONLY out neighbor 192.168.12.2 remote-as 65005 neighbor 192.168.12.2 prefix-list BGP_PERMIT_DEFAULT_ONLY out ! ip prefix-list BGP_PERMIT_DEFAULT_ONLY seq 10 permit 0.0.0.0/0 ip route 0.0.0.0 0.0.0.0 Null0 ip route 172.10.0.0 0.0.0.0 Null0 ! ! R5 router bgp 65005 bgp router-id 172.20.1.1 bgp log-neighbor-changes network 172.20.0.0 neighbor 192.168.12.1 remote-as 65004 neighbor 192.168.22.2 remote-as 65001 ! ip route 172.20.0.0 255.255.0.0 Null0 !
Lesson #1 - Static routes need a reachable next-hop
After the power was restored and chaos subsided, some investigation of the syslog entries on the secondary gateway revealed the static route for our network address space was deleted from the routing tables.
From our simulation network, you can see this in the console output with “debug ip routing” and “debug ip bgp updates” enabled on R2:
*Apr 5 03:14:20.115: RT: del 192.168.21.0 via 0.0.0.0, connected metric [0/0] *Apr 5 03:14:20.115: RT: delete subnet route to 192.168.21.0/30 *Apr 5 03:14:20.116: RT: del 192.168.21.2 via 0.0.0.0, connected metric [0/0] *Apr 5 03:14:20.116: RT: delete subnet route to 192.168.21.2/32 *Apr 5 03:14:20.120: RT: del 172.30.0.0 via 192.168.21.1, static metric [1/0] *Apr 5 03:14:20.120: RT: delete subnet route to 172.30.0.0/16 *Apr 5 03:14:20.126: BGP(0): route 172.30.0.0/16 downd
Because the interface to R1 went down, the connected subnet and route to the next-hop was removed from the routing table. As such, without a next-hop, the static route was removed as well.
Despite a few years under my belt on data center environments, this was a big surprise to me back then - with my DC using Catalyst 6500s using EIGRP to core and static routes to downstream firewalls, the next-hop never went away because the connected subnet was a Vlan interface.
Lesson #2 - BGP requires routes it originates to be in the routing table
The seasoned BGP folks out there have probably seen this moment coming since the R2 configs were shown. Without that 172.30.0.0/16 static route loaded on R2, BGP had nothing to advertise to the ISP2 router. Links are up. BGP peer state is Established. No Google to search for answers.
This was probably the most fascinating aspect to BGP that I learned - it’s not the Internet that disappears, it’s your enterprise that withdrew from the Internet. It is your responsibility to announce to the Internet that you are prefix A with ASN B and can be reached at next-hop C.
From R2, you can communicate to your ISP2 BGP neighbor without issue, giving a false sense of security in this situation.
Lesson #3 - Simulation environments are great but real hardware is better.
Okay, this lesson was learned while prepping the blog and not during the post-event root cause analysis. Even an event from years ago can provide learning opportunities so (word of advice time) always be looking for opportunities to learn, to lab up an scenario and tinker with it (in a LAB!!!).
But, at the same time, as in normal troubleshooting, ensure what you are seeing makes sense! Take for example the following two issues I ran across:
When a peer router interface goes down, the local interface that connects to it does not. Specifically, when you shutdown Gi0/1 on R1, Gi0/1 on R2 stays up/up. Now, eventually BGP times out and the BGP peer is torn down. As a result, you need to carefully consider whether you need to manually bring down local interfaces to simulate failures.
Even with shutdown on both R1 and R2 Gi0/1 interfaces, the iBGP peer status did not immediately come down despite the neighbor IP address belonging to the connected subnet. I would really love some real hardware to determine if this is the true behavior. I don’t recall this happening on the Catalyst 6500s (multi-layer switches)... but it’s been long enough I would prefer to verify which case is true.
Bringing it back to the simulation now ... In the “normal operating environment”, R1 advertises 172.30.0.0/16 via BGP:
Using eBGP to send it to R4
Using iBGP to send it to R2
As a result, for the prefix 172.30.0.0/16, R2 has a static route with Administrative Distance of 1 and an iBGP route with AD of 200. Under normal running, the AD=1 static route is loaded into the routing table.
Once the static route is withdrawn though, the AD=200 iBGP route is now eligible to be loaded into the routing table:
*Apr 5 03:14:20.127: BGP(0): Revise route installing 1 of 1 routes for 172.30.0.0/16 -> 192.168.21.1(global) to main IP table *Apr 5 03:14:20.127: RT: updating bgp 172.30.0.0/16 (0x0) : via 192.168.21.1 0 1048577 *Apr 5 03:14:20.127: RT: add 172.30.0.0/16 via 192.168.21.1, bgp metric [200/0]
Except there are two problems with that -
With the iBGP peer relationship not being torn down, the prefix from iBGP stayed in the BGP table until the hold timer expired. It eventually expired and the route disappeared. As I said before, I would like some real hardware to verify if that is correct behavior.
BGP is not supposed to use a route to a prefix if the next-hop is not reachable (remember N WLLA OMNI?). That next-hop is definitely not reachable. It clearly indicated the interface down and the route removed. Something in the simulation is not quite right.
So, in short, while simulations help with prototyping a scenario, you still must use your brain to make sure what you are seeing makes sense.
Lesson #4 - Professional development is key to you and to your employer
Your stable network is great for the business, but that doesn’t necessarily translate to you knowing what you need to do your job. This was my first real introduction to network operations back in the day. While we all learned many things about operations and our perimeter routing configuration, I personally learned a very important lesson - I was smart enough to figure out how to fix a broken thing that was previously working.
However, I was a good many lessons short on being able to drive new solutions. Obviously, from today’s post, routing was at the top of the list. To ensure I had some quick basics, I spent a couple days reading CCNP ROUTE 642-902 sections on core routing and BGP.
Even if your employer doesn’t support (financially) your professional development, you absolutely have to continue to pursue expanding your knowledge base. Whether it is fighting hard to get certified or simply reading the certification guides - do something! Have a plan - first for what will benefit your current employer and role, and second for your future responsibilities.
Don’t know what you want to do 3-5 years down the road? Don’t worry. As you broaden and deepen your current understanding, I’ve found the road map presents itself.
The Solution to Our Perimeter Issue
So, what is the solution? We looked at two options:
On R2, set the next-hop on the static route for 172.30.0.0/16 to be Null0.
On R2, we could leave the existing configuration and simply add another static route for 172.30.0.0/16 with AD=201 and next-hop of Null0. When the AD=1 static route is withdrawn because the interface is down, the AD=201 route will be loaded.
Both options solve the primary issue of the 172.30.0.0/16 prefix always being in the routing table. However, we do have to consider additional changes in this simulation to meet the design requirements that (a) all traffic flow through R1 before entering the Enterprise and (b) when R1 is down, R2 should forward traffic directly to R3.
Fortunately, in our real production environment, we already had additional components in the design (not shown here) that handled the traffic flow so that option 1 was all that was needed.
Teaser
I just licensed VIRL (http://virl.cisco.com/) so I am going to get that set up and see if the official Cisco product has any of these discovered deficiencies. I will let you know.
0 notes
Text
Why fundamentals matter
With the wrap up of March Madness and the inevitable withdrawal symptoms beginning, let’s focus on fundamentals of a different sort. This post will set up the scenario and discuss the generic IT fundamentals in monitoring. The next post will look at some network routing fundamentals.
Back story
A few months after making the official move into a senior network engineering role [see beginnings of this blog for back back story on that], we experienced a site wide power outage. Both primary communications closets are UPS and generator backed as was the data center. Monitoring of switches, ISP uplinks and hosts in the DMZ (with public facing website) all showed green and were thought to be safe ...
Until the closet A lost power 15 minutes into the blackout. What?!?!
“Okay, not good. But, fortunately we have the secondary closet.” All the switches and remaining links shown green. Someone find a car (we were 5 minutes offsite) and go check out closet A. Joe heads out the door.
Then, 5 minutes later, systems team walks in saying that the DMZ and internal DC networks can no longer connect to the Internet. Cell service works and we can’t connect to our website. What?!?!
Monitoring hasn’t changed since closet A went offline and devices/links up are still up. Lots of discussion ensues.
10 minutes later (25 minutes into blackout), closet B loses power. Okay, screw why we can’t get to the website. We have no Internet connections now so it’s all moot.
10 minutes later (35 minutes into blackout), closet A power is restored. Switches online. Internet online. Website online. Joe reports back - generator was running but surge from power failure tripped breaker and UPS was not getting power.
10 minutes later (45 minutes into blackout), closet B power is restored. Joe took facilities guy to closet B and found same issue with generator and UPS.
Many hours later, site power restored, breathing resumes.
Lesson #1 - Monitoring the easy things is easy. Monitoring the right thing is important.
The generators were too old to have any means of remotely monitoring them directly. Upon site-wide power failure, facility personnel have standing procedures to drive the site and verify the generators are operational. And they did --- By driving up to the building, rolling down the window, hearing it running, and moving on to the next generator.
The circuit trip that Joe reported was a new surprise as it had never happened before. The entire site losing power is a BIG DEAL with a lot of safety issues that come with it. People are rushed because (rightly so) people safety comes first.
Okay, live and learn - continual process improvements with facilities who, by the way, were extremely apologetic and eager to modify their process. They now check the output in addition to making sure its running.
Lesson #2 - Monitoring is of little use if it is not reliable. Monitoring that is automated tends to be more reliable.
Having a human, “manual”, process was a crucial component to ensuring a functioning generator was operational in a power loss scenario especially since there was no electronic means of remote communication.
But, what about those UPS units? They were only 2 years old ... and both were equipped with SNMP-based monitoring cards.
Unfortunately, neither management card was set up or actively monitored. “Loss of input power” would have been an important alert to receive during the outage while the UPS still had battery charge.
Automated monitoring does not stop to weigh the priority of checking a service versus taking care of public safety. It has one dedicated function that runs continuously and reliably.
Lesson #3 - Condition green does not necessarily mean all is well.
When primary closet A failed, all the critical links in secondary closet B showed green on the board (except for connections to closet A). We had connectivity to our secondary peer. We had connectivity internally. Yet, “the Internet was down” or, more fatally, our website was offline.
Oddly enough, it wasn’t an automated process that revealed the website was down. Systems monitoring had shown it to be all green as well. (IIRC) Someone from outside IT noticed it down and contacted someone they knew in IT.
Services that are not monitored frequently lead to assumptions that it is always working when something else is working. If the link to the secondary ISP is up, routing to the Internet must be working... except when it is not.
Lesson #4 - Mistakes happen. Learn from them. “Always leave your site better than you found it.”
My last statement in lesson #3 says it all. Most of our monitoring was actually fairly comprehensive. But, as we all run the risk of doing, you can spend 90% of the time on the last 10% of the project over-engineering a service that then increases its cost by 90%.
Okay, I made those numbers up. The point though is...
Monitoring comprehensively is time consuming and, without the right tool, could lead to a large number of alarms being triggered by a single event. Monitoring that is too noisy can be ignored (unmanage it all!) or, more fatally, the real problem can be lost in a sea of alarms from dependent services. Anyone who has dealt with IDS/IPS can preach a sermon or two about that.
So, here’s a strategy to tackle the right balance in monitoring:
Monitor as many aspects of a service as you reasonably can when deploying it.
Have a peer review of your service and monitoring with your technical lead or a colleague.
Discover what else should be monitored and add it to the project
Identify what metrics reasonably should not be monitored.
Document those unmonitored metrics and the reasons!
Usually, when you have to put justification in writing, your intuition quickly informs you whether you should really put in the effort to monitor something or whether it seems safe to defer it.
But, keep in mind, we are all under deadlines. The caffeine isn’t always firing all cylinders for you or your peer (blame the all-hands meeting). Something will be missed, forgotten, or your reasoning for not monitoring it will be wrong.
You most likely discover the truth during an outage. When you do, leverage your documentation as best you can, fix it, and notch another “continual service improvement” metric in your functioning ITIL operations.
Stay tuned for the next blog...
“How the Internet went down” or a crash course in the behavior of Cisco IOS routing tables
0 notes