okal - Tumblr blog

okal · 7 years

Text

Adventures in Python static analysis

I had an opportunity, in mid-2017, to work on tooling to weed out dead code from one of the codebases I work on at Jumo¹. We use an in-house framework for building USSD² applications, the main mode of interaction users have with our lending platform.

There isn’t much of a FOSS USSD ecosystem, so the framework had to solve a lot of basic problems in novel ways. Some of the most interesting aspects of it have to do with session management - USSD being a stateless protocol without any widely adopted session management standards, and a metaclass³ based routing model. We experienced some growing pains around this, as both the team, and the product pipeline expanded. Debugging, for instance, often proved difficult as we had entire user journeys left behind by feature development. The routing registry uses string based keys, rather than actual references, making it difficult for automated tools to tell apart dead from “living” code. You could rely on grep to an extent to naively check which handler classes - somewhat analogous to a view/controller hybrid in a typical MVC model - are linked to by others, but that didn’t reveal end-to-end flow information.

This set of problems proved to be a headache for feature development, as modules grew to unmanageable sizes, slowing down delivery.

Having had some prior experience building parsers in Python, this seemed as good a chance as any to exercise what little knowledge I had lingering from Compilers 101.

Tool survey

During an internship with Wezatele, one of the predecessor companies to Jumo, I had a chance to work on a pseudo-natural language driven SMS-based ordering system. I don’t know if it ever went into production, but it was some of the most technically fulfilling work I had done (and have done) up until now. The “pseudo” is important. It was actually just a constrained DSL allowing customers to place orders without access to an Internet connected device. This was a time before wide smartphone penetration in Kenya, so it was essential to target low tech devices.⁴ I still remember my prized Nokia N97, then, to my young mind, the pinnacle of technical refinement, before the infamous burning platform memo that presaged the downfall of Nokia. But I digress…

For that particular problem, the CTO suggested PyParsing⁵, a library for - you guessed it - building parsers in Python. With that background, my first thought was to build a parser that understood the intricacies of the USSD framework routing protocol. I quickly realised that it would have been a fool’s errand to attempt to write a full blown Python parser, as several already existed, but whatever solution there was had to have high-level semantics for me to work with it comfortably. That’s when I stumbled upon the Python ast⁶ module. It ships with the standard library, and is essentially a wrapper around the lower level _ast module. (Note the underscore).

Static analysis enables you to inspect code without actually running it, which helps keep build times short.

The approach I took boiled down, in essence, to

1. Reading source files into strings. 2. Parsing the source into a high level Abstract Syntax Tree representation using ast.parse. 3. Filtering to only subclasses of the base handler class, which were the most affected by the dead code scourge. 4. Flattening the AST nodes nested within each class to pull out the routing information. There were multiple different ways, each suited to different contexts, for doing this. These were, thankfully, widely understood within the team. As a precursor to this line of work, I worked on a document detailing the internals of the framework. One did not exist yet, as development had been driven organically by product needs up until that point. 5. The code would then construct all possible routes from each handler, and use that to build a tree detailing, from start to finish, all possible user journeys supported by the handlers in question. 6. Given these possible paths, the correct one would be determined by filtering out those that didn’t start from the known correct starting point. These would then be returned as a list of “dead handlers”. 7. I then wrote a Django⁷ management command that would accept as parameters, the file path to a module containing handlers, and the starting point. This led to a great reduction in dead code, though some of the checks resulted in false positives. Our CI⁸ tool would flag any dead code with a broken build, allowing reviewers to add false positives to a whitelist to reduce noise.

A quick primer

Given the following code, loosely mimicking the USSD framework handler model, your task is to determine which handlers are no longer part of the user journey.

handlers.py

class Welcome(Handler): def handle(self, request): if request.user.is_logged_in: return request.redirect('Feed') else: return request.redirect('SignIn') class SignIn(Handler): next_handler = 'Feed' class Feed(Handler): next_handler = 'Quit' def handle(self, request): return self.display("What's happening") class LegacySignIn(Handler): previous_handler = 'LegacyWelcome' def handle(self, request): if request.user.has_active_subscription: return request.redirect('LegacyFeed') else: return self.display('Please supply your credentials') class LegacyFeed(Handler): next_handler = 'Quit' def handle(self, request): return self.display("What happened back then") class SomeUtilityClass: def do_useful_stuff(self): return self.done() class Quit(Handler): def handle(self, request): return self.display('Adios!') def some_utility_function(params): return do_stuff_with_the_params(params)

detectors.py

from ast import parse from _ast import ClassDef, FunctionDef, Assign, If, Expr, Call, Return from operator import concat def is_class(node): return isinstance(node, ClassDef) def is_handler(node): type_ids = map(lambda base: base.id, node.bases) return reduce( lambda type_id, id_is_Handler: type_id is 'Handler' or id_is_Handler, type_ids, None ) def get_ast(module_path): with open(module_path) as module_file: source = module_file.read() tree = parse(source).body return tree def get_handler_nodes(tree): classes = filter(lambda node: isinstance(node, ClassDef), tree) handlers = filter(is_handler, classes) return handlers def deconstruct_body(node): accumulator = [] if hasattr(node, 'body'): accumulator.extend(map(deconstruct_body, node.body)) if hasattr(node, 'orelse'): accumulator.extend(map(deconstruct_body, node.orelse)) else: accumulator.append(node) return accumulator def flatten_nodes(deconstructed_handler_body): accumulator = [] for item in deconstructed_handler_body: if isinstance(item, list): accumulator.extend(flatten_nodes(item)) else: accumulator.append(item) return accumulator def is_assignment_exit_path(node): return isinstance(node, Assign) and node.targets[0].id in ['next_handler', 'previous_handler'] def extract_exit_path(node): if isinstance(node, Assign) and node.targets[0].id in ['next_handler', 'previous_handler']: return node.value.s elif isinstance(node, Return) and isinstance(node.value, Call) and node.value.func.attr == 'redirect' and node.value.func.value.id == 'request': return node.value.args[0].s def get_handler_exit_paths(handler_class_ast): deconstructed_body = deconstruct_body(handler_class_ast) flat_node_list = flatten_nodes(deconstructed_body) return ( handler_class_ast.name, filter( lambda exit_path: exit_path is not None, reduce( lambda exit_path_list, node: [extract_exit_path(node)] + exit_path_list, flat_node_list, [] ) ) ) def is_handler_in_journey(test_handler, initial_handler, exit_path_registry): if test_handler == initial_handler: return True exit_paths = exit_path_registry[initial_handler] if len(exit_paths) == 0: return False for handler in exit_paths: if test_handler == handler: return True else: return is_handler_in_journey(test_handler, handler, exit_path_registry)

To Test

def test_handler_is_in_journey(handler_to_test, initial_handler, module_path): module_ast = get_ast(module_path) handler_nodes = get_handler_nodes(module_ast) exit_path_registry = dict( map( get_handler_exit_paths, handler_nodes ) ) assert is_handler_in_journey(handler_to_test, initial_handler, exit_path_registry)

The below code passes silently

test_handler_is_in_journey('SignIn', 'Welcome', 'handlers.py')

The below code throws an AssertionError

test_handler_is_in_journey('LegacySignIn', 'Welcome', 'handlers.py')

Last words

While this particular illustration is quite specific to the Jumo context, these same tools and approaches can be used to reduce manual effort in your development workflow. An example would be detecting unused views in a web application that uses string-based routing.

¹ https://jumo.world ² Unstructured Supplementary Service Data: https://en.wikipedia.org/wiki/Unstructured_Supplementary_Service_Data ³ http://python-3-patterns-idioms-test.readthedocs.io/en/latest/Metaprogramming.html ⁴ As a side note, it’s only now struck me that most of my career has been around targeting low tech delivery media. ⁵ PyParsing: http://pyparsing.wikispaces.com/ ⁶ https://docs.python.org/2/library/ast.html ⁷ Django: https://www.djangoproject.com/ ⁸ Continuous Integration: https://www.thoughtworks.com/continuous-integration

#python #static-analysis

1 note · View note

okal · 7 years

Audio

Sunny War - If It Wasn’t Broken

18 notes · View notes

okal · 7 years

Text

Dev Misadventures, Part I: Boy Meets Asynchronous World

While working on a consulting gig a few years ago, I was tasked with implementing a content syncing system to support a mobile app intended to work in limited connectivity contexts.

When I took over development of the platform, the distribution model involved pre-installing the application onto low-cost Android tablets, downloading the content, and then sending the tablets off to user sites. This was made necessary by the fact that the content was a huge, monolithic multiple GB dump. Most of our users had limited Internet access, if any at all, making it impossible for them to set it up themselves. It proved to be a costly logistical nightmare whenever the app or content was updated, as the tablets had to be returned to the office. This had the additional unfortunate effect of lengthening development cycles, since we could not send out regular updates.

My brief was, basically, to:

Reduce the need for content downloads.

Reduce the amount of data transferred when downloading content, a problem that arose from the monolithic nature of the data dump.

Make it possible to update the content on-site, rather than having the tablets returned whenever there was need to.

I designed and built a system that:

Provided endpoints for specific, filtered content. The original design pushed all the data to the app, and filtered on the device.

Sent a push notification to client devices whenever content was updated. The device would then query an endpoint that returned a list of content changes, each with a unique ID. It would then use the IDs to determine what content needed to be pulled, rather than indiscriminately downloading the entire dump. This is where my problems started.

The notification system hinged around sending a Google Cloud Messaging (GCM) request whenever a model was persisted. The flaw in the approach I took was basically that the GCM request would need to be sent within the same thread as the request served to the content editors. During the initial test phase, the system worked like a charm due to low load. It still stands as one of the proudest moments in my career, where I built a feature that measurably reduced a business pain point. I basked in the glow of my seemingly flawless architecture for two long weeks of testing. The celebration was, however, cut short when we finally launched the sync feature. It broke down quite badly, and in ways that, in hindsight, I felt I should have been able to predict.

If, for some reason, the GCM request took too long a time, the request updating/creating the content would time out, leaving me at the mercy of (justifiably) frustrated content editors, who often wondered if they’d lost their changes. This happened often. The service was hosted by a now-defunct local Kenyan cloud hosting provider to take advantage of low in-country latency, as they peered at the Kenya Internet Exchange Point. They, however, throttled external traffic, which the GCM requests were.

The long running requests tied up resources, making the system unresponsive for other users.

It was clear that I needed to come up with an asynchronous notification model. At that point, my experience with async programming consisted entirely of working with Camel on a pre-existing Grails/Spring codebase, so I had little knowledge, outside of light reading, of how to go about redesigning the Django based system I was working on. The launch fiasco came at a pretty bad time, as I was preparing to start a new job. While I didn’t get a chance to work on it, I drafted a proposal in my handover notes for the dev who took my place to rebuild the system using Celery, an asynchronous task queue, which from conversations with him, I believe he ended up implementing.

I was quite happy when, at a later engagement, I found the same problem at play in a different codebase after the team working on it struggled for a while to figure out why a server was choking under what seemed like perfectly normal request loads. I felt a sudden sense of déjà vu. It turned out that a user had enabled webhooks whenever new content was submitted. The webhook requests were fired off in a post-save signal handler, slowing the entire app to a crawl because of an unresponsive destination. Felt nice to work that out as quick as I did, from that initial humbling experience. The team ended up reworking the post-save signal handler to use an asynchronous task, and all was right with the world once more.

#software-development #ramblings

0 notes

okal · 8 years

Video

youtube

Abstractions. An ambient bass/guitar duet.

#bass #ambient #post-rock

3 notes · View notes

okal · 8 years

Video

0 notes

okal · 9 years

Text

Commander's Intent

There is a wonderful section in the book “Made to Stick” that applies 100% to interaction design. It is the section on what is called “Commander’s Intent”.

The book describes Commander’s Intent as a tactic the U.S Army uses to prioritize decision making. As you might imagine, it’s impossible to devise and communicate a strategy at the start of a military operation that holds throughout the engagement. Too many things change along the way: the situation becomes more or less intense, the enemy introduces some difficult variable, the battle morphs in some other unpredictable way.

Commander’s Intent is a simple goal made at the beginning of an engagement that holds throughout, no matter if the situation changes. Instead of describing in detail a step-by-step, day-by-day strategy of how your battalion is going to take over the mountaintop from the enemy as the initial conditions might suggest, you simply say “We will take control of the mountaintop within ten days”. This leaves autonomy in the hands of the commanders on the ground, the people who know a lot more about the situation as it changes during battle. No matter what happens there, the Commander’s Intent will hold.

You arrive at a Commander’s Intent by asking a simple question: “If we do nothing else during tomorrow’s mission, we must…”. The answer to this question becomes the overarching goal of the mission and allows all who work on that mission to prioritize each action they make.

It turns out that asking this question of each screen we design is extremely valuable. “If this screen does nothing else, it must…”. This can help designers focus completely on the most important action the screen is designed to support, even as design conditions change as the result of user testing and feedback. Even though the screen may change over time, with designers emphasizing some parts or de-emphasizing others, the original purpose should stay intact over time.

Some tips when using the Commander’s Intent in design:

If you don’t know what the one purpose of the screen is, then get rid of the screen.

If you think a screen has two primary purposes, you can probably break it into two screens.

If something on the screen doesn’t support the one purpose, get rid of it.

If the one purpose does not add to a positive user experience, then get rid of the screen.

In most cases bad screen design is caused by a lack of prioritization on the part of the design team. Commander’s Intent came out of a completely different world, but can nonetheless help designers ruthlessly prioritize the decisions they have to make.

23 notes · View notes

okal · 9 years

Text

For Sandra Bland. For Sam DuBose.

There’s room enough, for family, Room enough, for you. We have little, But we’ll make do. Come back home, Will you?

#blacklivesmatter

0 notes

okal · 9 years

Video

youtube

0 notes

okal · 9 years

Video

youtube

0 notes

okal · 10 years

Text

An Idea: Open Data Bounties

Background

I needed (well, I still do) historical data on annual Kenyan budgetary allocations from 1963, the year of independence from Britain, to the present. First place I looked was http://www.opendata.go.ke, which yielded little of use for my particular use case. It emerged, after a short conversation on Twitter with @ItsSifa that the best place to look would be the Kenya National Bureau of Statistics. Problem is, the data they publish for free is, at best, patchy. You'd have to send in a request before they'd let you even pay for a non-free dataset. The idea applies in broader contexts, so I thought I'd share it.

Why should it be free, anyway?

The KNBS is funded largely from the public coffers. Any work it produces should, ethically, go into the public domain. The people who need to see it most are often the least likely to afford it, standing in the way of an informed citizenry.

Okay, I'm listening. Why ISN'T it free?

Good research costs good money to produce, and it's only natural that the KNBS and other similar organizations, would want monetary compensation for their work.

Can it be free?

Yes. Progressive governments all over the world, at national and local levels, provide data on their operations, as well as statistical data they've collected, free-of-charge to their citizens. This, however, requires a greater amount of goodwill and financial investment than can be typically expected of most governments. This leaves little incentive for organizations charged with data collection to release it for free, as it simply wouldn't be sustainable.

A proposal

To get around these complications, I propose a system where datasets have bounties placed on them, fulfilled via crowdfunding. The provider sets whatever price they deem fair. If/when this is met, the dataset is automatically released into the public domain, with the author ceding all (or most) copyright claims. It would then be released in a torrent, moving the cost of hosting and distribution into the hands of the interested.

Disclaimer

I'm very sleepy, and this is probably a pretty useless idea :-)

#open data #crowdfunding #free culture

1 note · View note

okal · 10 years

Text

The Sound of November Dying

0 notes

okal · 10 years

Photo

142K notes · View notes

okal · 10 years

Text

Truism

“Broken gets fixed. Shoddy lasts forever.”

One of the developers I work with said this after I complained about a lingering issue in one of our products. It rings true. When deadlines are tight, and there is more work to get done than there are developers or hours in the schedule, it’s not the squeaky wheel, but the jammed one that gets the grease. The lesson, then, is to make sure it gets done right the first time. You never know when you’ll have the opportunity to revisit it.

92 notes · View notes

okal · 10 years

Audio

Many evolutionists believe that humans have a drive for waging war. But they are wrong and the idea is dangerous. Read by Sam Dresser.

#SoundCloud #Aeon Magazine #Evolution #Aeon #Barash

0 notes

okal · 10 years

Text

The Tyranny of Language

I find that practically all of my conscious thought, since I left high school, the voice in my head that explains things to itself, has been expressed in English. Mostly because the words for the meat of my experience simply don't exist in the other languages I grew up with: Kiswahili and Dholuo. I wonder if this has led to a thinning of said experience, how much the language in which I think shapes and confines me. I more readily relate to the stories of melancholic White American men, shouting themselves hoarse on stage, than I do to what I imagine Eva, my grandmother's story was when she was 25 like me, told in a richness I can't comprehend for my fragile grasp of the only language we share, Dholuo. I wonder, too, what thoughts English keeps me from thinking. Does an Anglo-normative world even allow for the possibility that there are thoughts closed off to it? That there are parts of a wider human experience that a single language simply can't capture by means of a widening vocabulary? My mind wanders to the Newspeak of Orwell's Nineteen Eighty Four. Thinking bad, sorry, "ungood", thoughts about Big Brother is impossible, because you can't speak them. Paul Graham's essay, Beating the Averages, captures the idea quite well, though it concerns itself with programming, rather than spoken languages: "...they're satisfied with whatever language they happen to use, because _it dictates the way they think_ about programs." Perhaps the larger problem is that I even need a tool as broken as language for conscious thought. I think of my nephew, Dungani, the beautiful place his mind must be, as yet untainted by speech. I'd love to have such purity of thought. What story does he tell himself when he takes a pan to the couch and pretends he's cooking? What does this story sound like? Does he realize the spoon is empty, when he lovingly offers it up to me to taste his efforts? Does he have any conception of "empty"? Is the world he's contrived for himself bigger, or more constrained for his lack of language? I want to go back 24 years. I want to know what it's like to know, without the awkward contortions of speech. I can't, and I'm incredibly, embarrassingly saddened by that.

0 notes

okal · 11 years

Text

Setting up a VIM-based Common Lisp development environment on Ubuntu

Spurred on by Adam Tornhill's book, Lisp for the Web [1], I restarted my on-again, off-again attempt at picking up Lisp. Getting up and running on my machine wasn't as easy as I'd hoped. Specifically, getting Quicklisp to work with GNU Clisp wasn't that straightforward - operator error, as it turned out :-) - and SBCL, which was easy to get working with Quicklisp, has what is, in my opinion, an unusable REPL. What I wanted was:

1. Ability to persist REPL sessions so I could pick up where I left off at a later time. 2. Easy integration with Quicklisp. 3. Tab-completion in the REPL. It's 2014, this shouldn't be too much to ask.

I live by my ~/.vimrc, so SLIME [2] - Superior Lisp Interaction Mode for Emacs - wasn't an option. I didn't feel inclined to pick up a new environment, Emacs, to work through a 45-page book. Enter SLIMV [3], a vim port of SLIME. It's meant to work out of the box, but I had a hard time getting it running. This “guide” should help you avoid some of the pain I went through. At the end of it, you should have Quicklisp and SLIMV running, all without suffering the indignity of using Emacs :-)

NOTE: I wound up using SBCL, since I couldn't get SWANK running on GNU Clisp, because of a threading issue of some sort, if memory serves.

That's it. Just open up your .lisp file and enter ",c" in normal mode to connect to the SWANK server. You may also want to checkout the SLIMV tutorial to get a sense of its capabilities. [4]

Footnotes

1. “Lisp for the Web” - https://leanpub.com/lispweb. A great little introductory text for writing web applications in Common Lisp. Author's a friendly guy, too. 2. SLIME on Wikipedia - http://en.wikipedia.org/wiki/SLIME 3. SLIMV public repository - https://bitbucket.org/kovisoft/slimv/ 4. The SLIMV tutorial, Part I - http://kovisoft.bitbucket.org/tutorial.html

#lisp #vim #slimv #sbcl #ubuntu

1 note · View note