#manual stacktrace | Explore Tumblr posts and blogs

hackersandslackers · 7 years ago

Photo

A Dirty Way of Cleaning Data (ft. Pandas & SQL)

Code Snippet Corner ft. Pandas & SQL

Warning The following is FANTASTICALLY not-secure. Do not put this in a script that's going to be running unsupervised. This is for interactive sessions where you're prototyping the data cleaning methods that you're going to use, and/or just manually entering stuff. Especially if there's any chance there could be something malicious hiding in the data to be uploaded. We're going to be executing formatted strings of SQL unsanitized code. Also, this will lead to LOTS of silent failures, which are arguably The Worst Thing - if guaranteed correctness is a requirement, leave this for the tinkering table. Alternatively, if it's a project where "getting something in there is better than nothing", this can provide a lot of bang for your buck. Actually, it's purely for entertainment purposes and not for human consumption. Let's say you were helping someone take a bunch of scattered Excel files and CSVs and input them all into a MySQL database. This is a very iterative, trial & error process. We certainly don't want to be re-entering a bunch of boilerplate. Pandas to the rescue! We can painlessly load those files into a DataFrame, then just export them to the db! Well, not so fast First off, loading stuff into a DB is a task all its own - Pandas and your RDBMS have different kinds of tolerance for mistakes, and differ in often-unpredictable ways. For example, one time I was performing a task similar to the one described here (taking scattered files and loading them into a DB) - I was speeding along nicely, but then ran into a speedbump: turns out Pandas generally doesn't infer that a column is a date unless you tell it specifically, and will generally parse dates as strings. Now, this was fine when the dates were present - MySQL is pretty smart about accepting different forms of dates & times. But one thing it doesn't like is accepting an empty string '' into a date or time column. Not a huge deal, just had to cast the column as a date: df['date'] = pd.to_datetime(df['date']) Now the blank strings are NaT , which MySQL knows how to handle! This was simple enough, but there's all kinds of little hiccups that can happen. And, unfortunately, writing a DataFrame to a DB table is an all-or-nothing affair - if there's one error, that means none of the rows will write. Which can get pretty annoying if you were trying to write a decent-sized DataFrame, especially if the first error doesn't show up until one of the later rows. Waiting sucks. And it's not just about being impatient - long waiting times can disrupt your flow. Rapid prototyping & highly-interactive development are some of Python's greatest strengths, and they are great strengths indeed! Paul Graham (one of the guys behind Y Combinator) once made the comparison between REPL-heavy development and the popularizing of oil paints (he was talking about LISP, but it's also quite true of Python, as Python took a lot of its cues from LISP): Before oil paint became popular, painters used a medium, called tempera , that cannot be blended or overpainted. The cost of mistakes was high, and this tended to make painters conservative. Then came oil paint, and with it a great change in style. Oil "allows for second thoughts". This proved a decisive advantage in dealing with difficult subjects like the human figure. The new medium did not just make painters' lives easier. It made possible a new and more ambitious kind of painting. Janson writes: Without oil, the Flemish Masters'conquest of visible reality would have been much more limited. Thus, from a technical point of view, too, they deserve to be called the "fathers of modern painting" , for oil has been the painter's basic medium ever since. As a material, tempera is no lesss beautiful than oil. But the flexibility of oil paint gives greater scope to the imagination--that was the deciding factor. Programming is now undergoing a similar change...Meanwhile, ideas borrowd from Lisp increasingly turn up in the mainstream: interactive programming environments, garbage collectgion, and run-time typing to name a few. More powerful tools are taking the risk out of exploration. That's good news for programmers, because it means that we will be able to undertake more ambitious projects. The use of oil paint certainly had this effect. The period immediately following its adoption was a golden age for painting. There are signs already that something similar is happening in programming. (Emphasis mine) From here: http://www.cs.oswego.edu/~blue/xhx/books/ai/ns1/section02/main.html A little scenario to demonstrate: Let's pretend we have a MySQL instance running, and have already created a database named items import pymysql from sqlalchemy import create_engine import sqlalchemy import pandas as pd cnx = create_engine('mysql+pymysql://analyst:badsecuritykills@localhost:3306/items) pd.io.sql.execute("""CREATE TABLE books( \ id VARCHAR(40) PRIMARY KEY NOT NULL \ ,author VARCHAR(255) \ ,copies INT)""", cnx) df = pd.DataFrame({ "author": ["Alice", "Bob", "Charlie"], "copies": [2, "", 7, ],}, index = [1, 2, 3]) #Notice that one of these has the wrong data type! df.to_sql(name='books',con=cnx,if_exists='append',index=False) #Yeah, I'm not listing this whole stacktrace. Fantastic package with some extremely helpful Exceptions, but you've gotta scroll a whole bunch to find em. Here's the important part: InternalError: (pymysql.err.InternalError) (1366, "Incorrect integer value: '' for column 'copies' at row 1") [SQL: 'INSERT INTO books (id, author, copies) VALUES (%(id)s, %(author)s, %(copies)s)'] [parameters: {'id': 2, 'author': 'Bob', 'copies': ''}] (Background on this error at: http://sqlalche.me/e/2j85) Soo, let's tighten this feedback loop, shall we? We'll iterate through the DataFrame with the useful iterrows() method. This gives us essentially an enum made from our DataFrame - we'll get a bunch of tuples giving us the index as the first element and the row as its own Pandas Series as the second. for x in df.iterrows(): try: pd.DataFrame(x[1]).transpose().to_sql(name='books', con=cnx, if_exists='append', index_label='id') except: continue Let's unpack that a bit. Remember that we're getting a two-element tuple, with the good stuff in the second element, so x[1] Next, we convert the Series to a one-entry DataFrame, because the Series doesn't have the DataFrame's to_sql() method. pd.DataFrame(x[1]) The default behavior will assume this is a single column with, each variable being the address of a different row. MySQL isn't going to be having it. Sooo, we transpose! pd.DataFrame(x[1]).transpose() And finally, we use our beloved to_sql method on that. Let's check our table now! pd.io.sql.read_sql_table("books", cnx, index_col='id') author copies id 1 Alice 2 It wrote the first row! Not much of a difference with this toy example, but once you were writing a few thousand rows and the error didn't pop up until the 3000th, this would make a pretty noticeable difference in your ability to quickly experiment with different cleaning schemes. Note that this will still short-circuit as soon as we hit the error. If we wanted to make sure we got all the valid input before working on our tough cases, we could make a little try/except block. for x in df.iterrows(): try: pd.DataFrame(x[1]).transpose().to_sql(name='books', con=cnx, if_exists='append', index=False,) except: continue This will try to write each line, and if it encounters an Exception it'll continue the loop. pd.io.sql.read_sql_table("books", cnx, index_col='id') author copies id 1 Alice 2 3 Charlie 7 Alright, now the bulk of our data's in the db! Whatever else happens, you've done that much! Now you can relax a bit, which is useful for stimulating the creativity you'll need for the more complicated edge cases. So, we're ready to start testing new cleaning schemes? Well, not quite yet... Let's say we went and tried to think up a fix. We go to test it out and... #Note that we want to see our exceptions here, so either do without the the try/except block for x in df.iterrows(): pd.DataFrame(x[1]).transpose().to_sql(name='books', con=cnx, if_exists='append', index=False, ) #OR have it print the exception for x in df.iterrows(): try: pd.DataFrame(x[1]).transpose().to_sql(name='books', con=cnx, if_exists='append', index_label='id') except Exception as e: print(e) continue #Either way, we get... (pymysql.err.IntegrityError) (1062, "Duplicate entry '1' for key 'PRIMARY'") [SQL: 'INSERT INTO books (id, author, copies) VALUES (%(id)s, %(author)s, %(copies)s)'] [parameters: {'id': 1, 'author': 'Alice', 'copies': 2}] (Background on this error at: http://sqlalche.me/e/gkpj) (pymysql.err.InternalError) (1366, "Incorrect integer value: '' for column 'copies' at row 1") [SQL: 'INSERT INTO books (id, author, copies) VALUES (%(id)s, %(author)s, %(copies)s)'] [parameters: {'id': 2, 'author': 'Bob', 'copies': ''}] (Background on this error at: http://sqlalche.me/e/2j85) (pymysql.err.IntegrityError) (1062, "Duplicate entry '3' for key 'PRIMARY'") [SQL: 'INSERT INTO books (id, author, copies) VALUES (%(id)s, %(author)s, %(copies)s)'] [parameters: {'id': 3, 'author': 'Charlie', 'copies': 7}] (Background on this error at: http://sqlalche.me/e/gkpj) The error we're interested is in there, but what's all this other nonsense crowding it? Well, one of the handy things about a database is that it'll enforce uniqueness based on the constraints you give it. It's already got an entry with an id value of 1, so it's going to complain if you try to put another one. In addition to providing a lot of distraction, this'll also slow us down considerably - after all, part of the point was to make our experiments with data-cleaning go faster! Luckily, Pandas' wonderful logical indexing will make it a snap to ensure that we only bother with entries that aren't in the database yet. #First, let's get the indices that are in there usedIDs = pd.read_sql_table("books", cnx, columns=["id"])["id"].values df[~df.index.isin(usedIDs)] author copies 2 Bob #Remember how the logical indexing works: We want every element of the dataframe where the index ISN'T in our array of IDs that are already in the DB This will also be shockingly quick - Pandas' logical indexing takes advantage of all that magic going on under the hood. Using it, instead of manually iteration, can literally bring you from waiting minutes to waiting seconds. Buuut, that's a lot of stuff to type! We're going to be doing this A LOT, so how about we just turn it into a function? #Ideally we'd make a much more modular version, but for this toy example we'll be messy and hardcode some paramaters def filterDFNotInDB(df): usedIDs = pd.read_sql_table("books", cnx, columns=["id"])["id"].values return df[~df.index.isin(usedIDs)] So, next time we think we've made some progress on an edge case, we just call... #Going back to the to_sql method here - we don't want to have to loop through every single failing case, or get spammed with every variety of error message the thing can throw at us. filterDFNotInDB(cleanedDF).to_sql(name='books', con=cnx, if_exists='append', index_label='id') Actually, let's clean that up even more - the more keys we hit, the more opportunities to make a mistake! The most bug-free code is the code you don't write. def writeNewRows(df): filterDFNotInDB(df).to_sql(name='books', con=cnx, if_exists='append', index_label='id') So, finally, we can work on our new cleaning scheme, and whenever we think we're done... writeNewRows(cleanedDF) And boom! Instant feedback!

- Matthew Alhonte

#HackersAndSlackers #Hackers #Ghost #Pandas #Python #SQL

0 notes

iyarpage · 8 years ago

Text

Running an Infinispan server using Testcontainers

Recently I discovered a library called Testcontainers. I already wrote about using it on my current project here. It helps you to run software that your application depends on in a test context by providing an API to start docker containers. It’s implemented as a JUnit 4 rule currently, but you can also use it manually with JUnit 5. Native support for JUnit 5 is on the roadmap for the next major release. Testcontainers comes with a few pre-configured database- and selenium-containers, but most importantly it also provides a generic container that you can use to start whatever docker image you need to.

In my current project we are using Infinispan for distributed caching. For some of our integration tests caching is disabled, but others rely on a running Infinispan instance. Up until now we are using a virtual machine to run Infinispan and other software on developer machines and build servers. The way we are handling this poses a few problems and isolated Infinispan instances would help mitigate these. This post shows how you can get Infinispan running in a generic container. I’ll also try to come up with a useful abstraction that makes running Infinispan as a test container easier.

Configuring a generic container for Infinispan

Docker Hub provides a readymade Infinispan image: jboss/infinispan-server. We’ll be using the latest version at this time, which is 9.1.3.Final. Our first attempt to start the server using Testcontainers looks like this:

@ClassRule public static GenericContainer infinispan = new GenericContainer("jboss/infinispan-server:9.1.3.Final"); @Before public void setup(){ cacheManager = new RemoteCacheManager(new ConfigurationBuilder() .addServers(getServerAddress()) .version(ProtocolVersion.PROTOCOL_VERSION_26) .build()); } @Test public void should_be_able_to_retrieve_a_cache() { assertNotNull(cacheManager.getCache()); } private String getServerAddress() { return infinispan.getContainerIpAddress() + ":" + infinispan.getMappedPort(11222); }

You can see a few things here:

We’re configuring our test class with a class rule that will start a generic container. As a parameter, we use the name of the infinispan docker image alongside the required version. You could also use latest here.

There’s a setup method that creates a RemoteCacheManager to connect to the Infinispan server running inside the docker container. We extract the network address from the generic container and retrieve the container IP address and the mapped port number for the hotrod port in getServerAddress()

Then there’s a simple test that will make sure we are able to retrieve an unnamed cache from the server.

Waiting for Infinispan

If we run the test, it doesn’t work and throws a TransportException, though. It mentions an error code that hints at a connection problem. Looking at other pre-configured containers, we see that they have some kind of waiting strategy in place. This is important so that the test only starts after the container has fully loaded. The PostgreSQLContainer waits for a log message, for example. There’s other wait strategies available and you can implement your own, as well. One of the default strategies is the HostPortWaitStrategy and it seems like a straightforward choice. With the Infinispan image at least, it doesn’t work though: one of the commands that is used to determine the readiness of the tcp port has a subtle bug in it and the other relies on the netcat command line tool being present in the docker image. We’ll stick to the same approach as the PostgreSQLContainer rule and check for a suitable log message to appear on the container’s output. We can determine a message by manually starting the docker container on the command line using:

docker run -it jboss/infinispan-server:9.1.3.Final.

The configuration of our rule then changes to this:

@ClassRule public static GenericContainer container = new GenericContainer("jboss/infinispan-server:9.1.3.Final") .waitingFor(new LogMessageWaitStrategy() .withRegEx(".*Infinispan Server.*started in.*\\s"));

After this change, the test still doesn’t work correctly. But at least it behaves differently: It waits for a considerable amount of time and again throws a TransportException before the test finishes. Since the underlying TcpTransportFactory swallows exceptions on startup and returns a cache object anyway, the test will still be green. Let’s address this first. I don’t see a way to ask the RemoteCacheManager or the RemoteCache about the state of the connection, so my approach here is to work with a timeout:

private ExecutorService executorService = Executors.newCachedThreadPool(); @Test public void should_be_able_to_retrieve_a_cache() throws Exception { Future> result = executorService.submit(() -> cacheManager.getCache()); assertNotNull(result.get(1500, TimeUnit.MILLISECONDS)); }

The test will now fail should we not be able to retrieve the cache within 1500 milliseconds. Unfortunatly, the resulting TimeoutException will not be linked to the TransportException, though. I’ll take suggestions for how to better write a failing test and leave it at that, for the time being.

Running Infinispan in standalone mode

Looking at the stacktrace of the TransportException we see the following output:

INFO: ISPN004006: localhost:33086 sent new topology view (id=1, age=0) containing 1 addresses: [172.17.0.2:11222] Dez 14, 2017 19:57:43 AM org.infinispan.client.hotrod.impl.transport.tcp.TcpTransportFactory updateTopologyInfo INFO: ISPN004014: New server added(172.17.0.2:11222), adding to the pool.

It looks like the server is running in clustered mode and the client gets a new server address to talk to. The IP address and port number seem correct, but looking more closely we notice that the hotrod port 11222 refers to a port number inside the docker container. It is not reachable from the host. That’s why Testcontainers gives you the ability to easily retrieve port mappings. We already use this in our getServerAddress() method. Infinispan, or rather the hotrod protocol, however is not aware of the docker environment and communicates the internal port to the cluster clients overwriting our initial configurtation.

To confirm this analysis we can have a look at the output of the server when we start the image manually:

19:12:47,368 INFO [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (MSC service thread 1-6) ISPN000078: Starting JGroups channel clustered 19:12:47,371 INFO [org.infinispan.CLUSTER] (MSC service thread 1-6) ISPN000094: Received new cluster view for channel cluster: [9621833c0138|0] (1) [9621833c0138] ... Dez 14, 2017 19:12:47,376 AM org.infinispan.client.hotrod.impl.transport.tcp.TcpTransportFactory updateTopologyInfo INFO: ISPN004016: Server not in cluster anymore(localhost:33167), removing from the pool.

The server is indeed starting in clustered mode and the documentation on Docker Hub also confirms this. For our tests we need a standalone server though. On the command line we can add a parameter when starting the container (again, we get this from the documentation on Docker Hub):

$ docker run -it jboss/infinispan-server:9.1.3.Final standalone

The output now tells us that Infinispan is no longer running in clustered mode. In order to start Infinispan as a standalone server using Testcontainers, we need to add a command to the container startup. Once more we change the configuration of the container rule:

Now our test now has access to an Infinispan instance running in a container.

Adding a specific configuration

The applications in our project use different caches, these can be configured in the Infinispan standalone configuration file. For our tests, we need them to be present. One solution is to use the .withClasspathResourceMapping() method to link a configuration file from the (test-)classpath into the container. This configuration file contains the cache configurations. Knowing the location of the configuration file in the container, we can once again change the testcontainer configuration:

public static GenericContainer container = new GenericContainer("jboss/infinispan-server:9.1.3.Final") .waitingFor(new LogMessageWaitStrategy() .withRegEx(".*Infinispan Server.*started in.*\\s")) .withCommand("standalone") .withClasspathResourceMapping( "infinispan-standalone.xml", "/opt/jboss/infinispan-server/standalone/configuration/standalone.xml", BindMode.READ_ONLY); @Test public void should_be_able_to_retrieve_a_cache() throws Exception { Future> result = executorService.submit(() -> cacheManager.getCache("testCache")); assertNotNull(result.get(1500, TimeUnit.MILLISECONDS)); }

Now we can retrieve and work with a cache from the Infinispan instance in the container.

Simplifying the configuration

You can see how it can be a bit of a pain getting an arbitrary docker image to run correctly using a generic container. For Infinispan we now know what we need to configure. But I really don’t want to think of all this every time I need an Infinispan server for a test. However, we can create our own abstraction similar to the PostgreSQLContainer. It contains the configuration bits that we discovered in the first part of this post and since it is an implementation of a GenericContainer, we can also use everything that’s provided by the latter.

public class InfinispanContainer extends GenericContainer { private static final String IMAGE_NAME = "jboss/infinispan-server"; public InfinispanContainer() { this(IMAGE_NAME + ":latest"); } public InfinispanContainer(final String imageName) { super(imageName); withStartupTimeout(Duration.ofMillis(20000)); withCommand("standalone"); waitingFor(new LogMessageWaitStrategy().withRegEx(".*Infinispan Server.*started in.*\\s")); } }

In our tests we can now create an Infinispan container like this:

@ClassRule public static InfinispanContainer infinispan = new InfinispanContainer();

That’s a lot better than dealing with a generic container.

Adding easy cache configuration

You may have noticed that I left out the custom configuration part here. We can do better by providing builder methods to create caches programatically using the RemoteCacheManager. Creating a cache is as easy as this:

cacheManager.administration().createCache("someCache", null);

In order to let the container automatically create caches we facilitate the callback method containerIsStarted(). We can overload it in our abstraction, create a RemoteCacheManager and use its API to create caches that we configure upfront:

... private RemoteCacheManager cacheManager; private Collection cacheNames; ... public InfinispanContainer withCaches(final Collection cacheNames) { this.cacheNames = cacheNames; return this; } @Override protected void containerIsStarted(final InspectContainerResponse containerInfo) { cacheManager = new RemoteCacheManager(new ConfigurationBuilder() .addServers(getServerAddress()) .version(getProtocolVersion()) .build()); this.cacheNames.forEach(cacheName -> cacheManager.administration().createCache(cacheName, null)); } public RemoteCacheManager getCacheManager() { return cacheManager; }

You can also retrieve the CacheManager from the container and use it in your tests. There’s also a problem with this approach: you can only create caches through the API if you use Hotrod protocol version 2.0 or above. I’m willing to accept that as it makes the usage in test really comfortable:

@ClassRule public static InfinispanContainer infinispan = new InfinispanContainer() .withProtocolVersion(ProtocolVersion.PROTOCOL_VERSION_21) .withCaches("testCache"); @Test public void should_get_existing_cache() { assertNotNull(infinispan.getCacheManager().getCache("testCache")); }

If you need to work with a protocol verision below 2.0, you can still use the approach from above, linking a configuration file into the container.

Conclusion

While it sounds very easy to run any docker image using Testcontainers, there’s a lot of configuration details to know, depending on the complexity of the software that you need to run. In order to effectivly work with such a container, it’s a good idea to encapsulate this in your own specific container. Ideally, these containers will end up in the Testcontainers repository and others can benefit of your work as well. I hope this will be useful for others, if you want to see the full code, have look at this repository.

The post Running an Infinispan server using Testcontainers appeared first on codecentric AG Blog.

Running an Infinispan server using Testcontainers published first on http://ift.tt/2fA8nUr

0 notes

mobilenamic · 8 years ago

Text

Running an Infinispan server using Testcontainers

Configuring a generic container for Infinispan

You can see a few things here:

Then there’s a simple test that will make sure we are able to retrieve an unnamed cache from the server.

Waiting for Infinispan

docker run -it jboss/infinispan-server:9.1.3.Final.

The configuration of our rule then changes to this:

Running Infinispan in standalone mode

Looking at the stacktrace of the TransportException we see the following output:

To confirm this analysis we can have a look at the output of the server when we start the image manually:

$ docker run -it jboss/infinispan-server:9.1.3.Final standalone

Now our test now has access to an Infinispan instance running in a container.

Adding a specific configuration

Now we can retrieve and work with a cache from the Infinispan instance in the container.

Simplifying the configuration

In our tests we can now create an Infinispan container like this:

@ClassRule public static InfinispanContainer infinispan = new InfinispanContainer();

That’s a lot better than dealing with a generic container.

Adding easy cache configuration

cacheManager.administration().createCache("someCache", null);

If you need to work with a protocol verision below 2.0, you can still use the approach from above, linking a configuration file into the container.

Conclusion

The post Running an Infinispan server using Testcontainers appeared first on codecentric AG Blog.

Running an Infinispan server using Testcontainers published first on http://ift.tt/2vCN0WJ

0 notes

doorrepcal33169 · 8 years ago

Text

Automated root cause analysis for Spark application failures

Reduce troubleshooting time from days to seconds.

Spark’s simple programming constructs and powerful execution engine have brought a diverse set of users to its platform. Many new big data applications are being built with Spark in fields like health care, genomics, financial services, self-driving technology, government, and media. Things are not so rosy, however, when a Spark application fails.

Similar to applications in other distributed systems that have a large number of independent and interacting components, a failed Spark application throws up a large set of raw logs. These logs typically contain thousands of messages, including errors and stacktraces. Hunting for the root cause of an application failure from these messy, raw, and distributed logs is hard for Spark experts—and a nightmare for the thousands of new users coming to the Spark platform. We aim to radically simplify root cause detection of any Spark application failure by automatically providing insights to Spark users like what is shown in Figure 1.

Figure 1. Insights from automatic root cause analysis improve Spark user productivity. Source: Adrian Popescu and Shivnath Babu.

Spark platform providers like Amazon, Azure, Databricks, and Google clouds as well as application performance management (APM) solution providers like Unravel have access to a large and growing data set of logs from millions of Spark application failures. This data set is a gold mine for applying state-of-the-art artificial intelligence (AI) and machine learning (ML) techniques. In this blog, we look at how to automate the process of failure diagnosis by building predictive models that continuously learn from logs of past application failures for which the respective root causes have been identified. These models can then automatically predict the root cause when an application fails[1]. Such actionable root-cause identification improves the productivity of Spark users significantly.

Clues in the logs

A number of logs are available every time a Spark application fails. A distributed Spark application consists of a driver container and one or more executor containers. The logs generated by these containers have information about the application as well as how the application interacts with the rest of the Spark platform. These logs form the key data set that Spark users scan for clues to understand why an application failed.

However, the logs are extremely verbose and messy. They contain multiple types of messages, such as informational messages from every component of Spark, error messages in many different formats, stacktraces from code running on the Java Virtual Machine (JVM), and more. The complexity of Spark usage and internals make things worse. Types of failures and error messages differ across Spark SQL, Spark Streaming, iterative machine learning and graph applications, and interactive applications from Spark shell and notebooks (e.g., Jupyter, Zeppelin). Furthermore, failures in distributed systems routinely propagate from one component to another. Such propagation can cause a flood of error messages in the log and obscure the root cause.

Figure 2 shows our overall solution to deal with these problems and to automate root cause analysis (RCA) for Spark application failures. Overall, the solution consists of:

Continuously collecting logs from a variety of Spark application failures

Converting logs into feature vectors

Learning a predictive model for RCA from these feature vectors

Of course, as with any intelligent solution that uses AI and ML techniques, the devil is the details!

Figure 2. Root cause analysis of Spark application failures. Source: Adrian Popescu and Shivnath Babu. Data collection for training

As the saying goes: garbage in, garbage out. Thus, it is critical to train RCA models on representative input data. In addition to relying on logs from real-life Spark application failures observed on customer sites, we have also invested in a lab framework where root causes can be artificially injected to collect even larger and more diverse training data.

Structured versus unstructured data

Logs are mostly unstructured data. To keep the accuracy of model predictions to a high level in automated RCA, it is important to combine this unstructured data with some structured data. Thus, whenever we collect logs, we are careful to collect trustworthy structured data in the form of key-value pairs that we additionally use as input features in the predictive models. These include Spark platform information and environment details of Scala, Hadoop, OS, and so on.

Labels

ML techniques for prediction fall into two broad categories: supervised learning and unsupervised learning. We use both techniques in our overall solution. For the supervised learning part, we attach root-cause labels with the logs collected from an application failure. This label comes from a taxonomy of root causes that we have created based on millions of Spark application failures seen in the field and in our lab. Broadly speaking, the taxonomy can be thought of as a tree data structure that categorizes the full space of root causes. For example, the first non-root level of this tree can be failures caused by:

Configuration errors

Deployment errors

Resource errors

Data errors

Application errors

Unknown factors

The leaves of this taxonomy tree form the labels used in the supervised learning techniques. In addition to a text label representing the root cause, each leaf also stores additional information such as: (a) a description template to present the root cause to a Spark user in a way that she will easily understand (like the message in Figure 1), and (b) recommended fixes for this root cause. We will cover the root-cause taxonomy in a future blog.

The labels are associated with the logs in one of two ways. First, the root cause is already known when the logs are generated, as a result of injecting a specific root cause we have designed to produce an application failure in our lab framework. The second way in which a label is given to the logs for an application failure is when a Spark domain expert manually diagnoses the root cause of the failure.

Input Features

Once the logs are available, there are various ways in which the feature vector can be extracted from these logs. One way is to transform the logs into a bit vector (e.g., 1001100001). Each bit in this vector represents whether a specific message template is present in the respective logs. A prerequisite to this approach is to extract all possible message templates from the logs. A more traditional approach for feature vectors from the domain of information retrieval is to represent the logs for a failure as a bag of words. This approach is mostly similar to the bit vector approach except for a couple of differences: (a) each bit in the vector now corresponds to a word instead of a message template, and (b) instead of 0s and 1s, it is more common to use numeric values generated using techniques like TF-IDF.

More recent advances in ML have popularized vector embeddings. In particular, we use the doc2vec technique[2]. At a high level, these vector embeddings map words (or paragraphs, or entire documents) to multidimensional vectors by evaluating the order and placement of words with respect to their neighboring words. Similar words map to nearby vectors in the feature vector space. The doc2vec technique uses a three-layer neural network to gauge the context of the document and relate similar content together.

Once the feature vectors are generated along with the label, a variety of supervised learning techniques can be applied for automatic RCA. We have evaluated both shallow as well as deep learning techniques, including random forests, support vector machines, Bayesian classifiers, and neural networks.

Conclusion

The overall results produced by our solution are very promising. We are currently enhancing the solution in some key ways. One of these is to quantify the degree of confidence in the root cause predicted by the model in a way that users will easily understand. Another key enhancement is to speed up the ability to incorporate new types of application failures. The bottleneck currently is in generating labels. We are working on active learning techniques[3] that nicely prioritize the human efforts required in generating labels. The intuition behind active learning is to pick the unlabeled failure instances that provide the most useful information to build an accurate model. The expert labels these instances and then the predictive model is rebuilt.

Manual failure diagnosis in Spark is not only time consuming, but highly challenging due to correlated failures that can occur simultaneously. Our unique RCA solution enables the diagnosis process to function effectively even in the presence of multiple concurrent failures as well as noisy data prevalent in production environments. Through automated failure diagnosis, we remove the burden of manually troubleshooting failed applications from the hands of Spark users and developers, enabling them to focus entirely on solving business problems with Spark.

References:

[1] S. Duan, S. Babu, and K. Munagala, “Fa: A System for Automating Failure Diagnosis”, International Conference on Data Engineering, 2009. (Return)

[2] Q. Lee and T. Mikolov, “Distributed Representations of Sentences and Documents”, International Conference on Machine Learning, 2014 (Return)

[3] S. Duan and S. Babu, "Guided Problem Diagnosis through Active Learning", International Conference on Autonomic Computing, 2008 (Return)

To learn how to use analytic tools to manage your big data infrastructure, check out Shivnath Babu's session "Using Machine Learning to Simplify Kafka Operations" at the Strata Data Conference in San Jose, March 5-8, 2018—registration is now open.

Related resource:

Spark: The Definitive Guide, by Bill Chambers and Matei Zaharia

Continue reading Automated root cause analysis for Spark application failures.

from FEED 10 TECHNOLOGY http://ift.tt/2m06KlG

#FEED 10 TECHNOLOGY

0 notes

repmywind02199 · 8 years ago

Text

Automated root cause analysis for Spark application failures

Reduce troubleshooting time from days to seconds.

Figure 1. Insights from automatic root cause analysis improve Spark user productivity. Source: Adrian Popescu and Shivnath Babu.

Clues in the logs

Figure 2 shows our overall solution to deal with these problems and to automate root cause analysis (RCA) for Spark application failures. Overall, the solution consists of:

Continuously collecting logs from a variety of Spark application failures

Converting logs into feature vectors

Learning a predictive model for RCA from these feature vectors

Of course, as with any intelligent solution that uses AI and ML techniques, the devil is the details!

Figure 2. Root cause analysis of Spark application failures. Source: Adrian Popescu and Shivnath Babu. Data collection for training

Structured versus unstructured data

Labels

Configuration errors

Deployment errors

Resource errors

Data errors

Application errors

Unknown factors

Input Features

Conclusion

References:

[1] S. Duan, S. Babu, and K. Munagala, “Fa: A System for Automating Failure Diagnosis”, International Conference on Data Engineering, 2009. (Return)

[2] Q. Lee and T. Mikolov, “Distributed Representations of Sentences and Documents”, International Conference on Machine Learning, 2014 (Return)

[3] S. Duan and S. Babu, "Guided Problem Diagnosis through Active Learning", International Conference on Autonomic Computing, 2008 (Return)

Related resource:

Spark: The Definitive Guide, by Bill Chambers and Matei Zaharia

Continue reading Automated root cause analysis for Spark application failures.

http://ift.tt/2m06KlG

#FEED 10 TECHNOLOGY

0 notes

csemntwinl3x0a1 · 8 years ago

Text

Automated root cause analysis for Spark application failures

Reduce troubleshooting time from days to seconds.

Figure 1. Insights from automatic root cause analysis improve Spark user productivity. Source: Adrian Popescu and Shivnath Babu.

Clues in the logs

Figure 2 shows our overall solution to deal with these problems and to automate root cause analysis (RCA) for Spark application failures. Overall, the solution consists of:

Continuously collecting logs from a variety of Spark application failures

Converting logs into feature vectors

Learning a predictive model for RCA from these feature vectors

Of course, as with any intelligent solution that uses AI and ML techniques, the devil is the details!

Figure 2. Root cause analysis of Spark application failures. Source: Adrian Popescu and Shivnath Babu. Data collection for training

Structured versus unstructured data

Labels

Configuration errors

Deployment errors

Resource errors

Data errors

Application errors

Unknown factors

Input Features

Conclusion

References:

[1] S. Duan, S. Babu, and K. Munagala, “Fa: A System for Automating Failure Diagnosis”, International Conference on Data Engineering, 2009. (Return)

[2] Q. Lee and T. Mikolov, “Distributed Representations of Sentences and Documents”, International Conference on Machine Learning, 2014 (Return)

[3] S. Duan and S. Babu, "Guided Problem Diagnosis through Active Learning", International Conference on Autonomic Computing, 2008 (Return)

Related resource:

Spark: The Definitive Guide, by Bill Chambers and Matei Zaharia

Continue reading Automated root cause analysis for Spark application failures.

http://ift.tt/2m06KlG

#FEED 10 TECHNOLOGY

0 notes