atreecodesinbrooklyn - Tumblr blog

atreecodesinbrooklyn · 8 years ago

Text

Volunteering -- It is Fun

I had a great time volunteering this week with Women in Technology and Entrepreneurship in New York. My co-coach Joe Kohlmann of the NYTimes and I helped five CUNY incoming freshmen go from zero Python to running a Flask web server. They also got up and running with JavaScript and jQuery, collaborating via their own new GitHub accounts. Here’s the stylish prototype for a senior center kiosk that they wire-framed, coded, debugged, and shipped to production, in just two days! Joe and I were so proud we teared up. We hope these noobs will study computer science and will not be underestimated. Note to self: Save the window.alert() lesson for after the UX lesson.

0 notes

atreecodesinbrooklyn · 8 years ago

Text

Great tutorial for getting Started with Kafka Streams

Sorry for the lack of updates. I’m doing a lot of things here, including getting up to speed on Kafka. I recently attended the NYC Kafka Summit and then helped MIchael Noll (one of the speakers from Confluent) improve a tutorial on the Kafka Streams API.

0 notes

atreecodesinbrooklyn · 8 years ago

Text

Going Serverless

Just tried out a Lambda a few weeks ago. It’s part of the evolution of software architecture. We are going from patching the same servers for years to using Kubernetes to spin up a whole new server for any code change. The latter is the concept of “immutable infrastructure.” And now, for some apps, we're going “serverless,” which really just means letting AWS handle the server scaling, upgrades, and security.

It was super fast to get the app deployed. We used Node.js and Serverless, a library that simplifies deploying to AWS Lambda. “Serverless” is both a concept and a specific free and open source framework.

A few tips:

Pryjs

My favorite Ruby debugging tool has a javascript version which works great in Serverless. Make a breakpoint with: eval(pry.it)

Make, or invoke a request, with specific parameters

serverless invoke local --stage dev -f experiment -d '{ "httpMethod": "GET", "path": "/search.json", "queryStringParameters": { "plate": "1883NZB", "site": "example-site", "device": "cheesefries", "signature": "262e8a26829bd54789f38" }, "headers": { "Accept": "application/json" } }'

"experiment" is the name of the function we're calling. "stage" refers to what we usually call the environment.

Tail the logs:

serverless logs -f experiment --stage dev

0 notes

atreecodesinbrooklyn · 8 years ago

Text

Keeping It Simple

One of my first editors in journalism told me one of journalism's most important principles: KISS, keep it simple stupid. If a reader can’t understand the first paragraph, he will probably stop reading. The lesson stuck with me, because it is so… simple.

A career change later, I found out that KISS was actually coined by engineers. The term is credited to Kelly Johnson, the head head of Skunkworks, Lockheed Martin's top secret R&D team during the Cold War.

Why is KISS so important, and so necessary? There are many incentives to overcomplicate our code. For one, complexity may seem more creative and more clever. In writing, overwrought language is called 'purple prose.' In engineering, it’s ‘reinventing the wheel.’ For obsessives or perfectionists, it might be hard to know when enough is enough, they keep 'gilding the lily.'

Another incentive to overcomplicate comes from the mistaken sense that if others can't understand, they will be impressed. The most cynical may even do so to make themselves indispensable, because no one else can read their work, not to mention take it over.

Three is that some people think, oh, the more labor the better. They may measure their own contribution by sweat and time rather than skills and knowledge. Easy to see why: If you were to call a lock picker who takes only thirty seconds to open the door, might you resent paying the same $150 fee as for a lock picker who took an hour? In the first case, you might be paying for thirty years of experience, but it may not feel the same.

The above are bad reasons to complicate code, but there are also some really important reasons. My favorite is Joel Spolsky’s parable on the life and death of Netscape. He says that a codebase gets hairy after so many bugfixes, which we cannot streamline. Another reason code may be hard to read is that jargon saves words.

Having given some legit reasons why code is hard to understand, here’s an overview of the philosophy of simplicity, spanning many disciplines:

In science, the age-old principle of Occam's razor, a heuristic to guide scientists in developing theoretical models. Occam was a medieval friar, and the razor is for shaving off the hairy, overcomplicated theories. “All evidence being equal, simpler explanations are generally better than more complex ones.” The heliocentric model of the universe is a lot simpler than the geocentric one, and it ended up being true, although there may be some new debate about that, as the Flat Earth model has gained popularity lately.

In writing, there’s Strunk and White, the Elements of Style, one of the most influential books written in English in the last century. "Make every word tell.” “Omit needless words."

In UX design, there’s “Don't Make Me Think,” by Steve Krug. The main idea is, 'Don’t be subtle. Make it obvious.' An app should be easy to understand and easy to use. “It doesn’t matter how many times I have to click, as long as each click is a mindless, unambiguous choice.”

In programming, there’s Sandy Metz, who wrote POODR, a classic on Ruby design. The cardinal rule of coding, says Joel Spolsky, is that it's twice as hard to debug as to write it. Hell is other people's code, as Sartre might have said, if he were a computer programmer.

~From a lightning talk I gave

0 notes

atreecodesinbrooklyn · 8 years ago

Text

First flight with Kubernetes and Terraform

Just deployed with Kubernetes and Terraform for the first time. It’s our shift towards “immutable infrastructure,” which means that instead of tinkering with a long-running linux instance -- one of ours has been running for about a year and a half since I moved it to AWS EC2 -- any code change means a fresh total rebuild.

Our deployment process: When we push code changes to GitHub, GitHub posts a WebHook notification to CircleCI, which will build the project. If it’s on a designated branch, CircleCI will deploy, which means building a new docker container image, replacing the old image in our AWS ECR container registry with the latest, and rebuilding the app with that latest image on our Terraform-provisioned resources.

0 notes

atreecodesinbrooklyn · 8 years ago

Text

Beyond Conventional Testing

Conventional testing frameworks have fixed inputs and outputs. Developers spend a lot of time writing up rules for what code should do and then implementing them. We create fake input called fixtures, then make sure the output exactly matches at each step. The smaller we break down and verify the steps, the better, is the philosophy of unit testing.

Sometimes we can allow a little bit of flexibility by writing regex or a json schema and then checking if the output fits an outline. But for the most part, the input and output are rigid, and each test score is pass/fail, match or mismatch. If we want to change one thing, we may have to modify all the other tests down the line too, a cascade of work. What about when we are refactoring code that is already live? Or if we want to test on live data, when the results vary by the second?

Scientist is the perfect gem for this: Scientist, which engineers at Github released last year. You know how a biologist studying living things compares a control group to an experimental group? That's the idea here. When you're using Scientist, it doesn't matter whether the program is adhering to rules internally. Scientist just compares outward results of methods written in different ways, so that you can see the impact and pick the best code.

By default it prints a table of statistics including percent match and runtime. You can overwrite its publish method as done here by Doug Patti, who implemented Scientist in Node.js. See the objects experiment, control, and candidates.

class Scientist::Default def publish(result) puts "Experiment results for '#{name}':" if result.matched? puts " -> values matched" elsif result.ignored? puts " -> values ignored" else puts " -> values mismatched" puts "expected: #{observed(result.control)}" puts "got: #{observed(result.candidates.first)}" end end def observed(observation) if observation.raised? observation.exception else observation.cleaned_value end end end class ScienceRunner include Scientist def test science "first test" do |experiment| experiment.use { OLD METHOD/API CALL/QUERY } experiment.try { NEW METHOD/API CALL/QUERY } end end

See how useful this could be? To me it seems like part of a greater trend in computer science now, the same one in neural networks. Sorry to anthropomorphize computers programs, but developers are going beyond micromanaging behavior of our code. It takes so much brainpower to micromanage, as in conventional unit testing, to verify each step of the program, write specs, mocks, fixtures, or stubs, and to change them all down the line after every change. So the flexibility saves a lot of time.

We, of course, still need conventional unit tests too to write programs in the first place. But Scientist allows us to refactor code with great speed. I am using it first to rewrite an API and then in migrating our SQLServer datawarehouse to a Druid column store, to rewrite queries for our analytics charts.

0 notes

atreecodesinbrooklyn · 8 years ago

Text

Docker for Development

Just wanted to say that Docker is not just for devops. One of our rustiest legacy applications took hours to set up locally. Lots of makefile errors for essential libraries. Compiling the dependencies shouldn’t be that difficult. Maybe the multiple versions of Ruby and their gems, were in conflict on my machine, not separated well enough by rbenv in my filesystem. Have yet to dockerize production deployment for this app. But Dockerizing my development environment alone was a helpful step.

Words to the wise:

-Brew installation of docker was buggy. Had to download it from docker.com.

-Get clear on the difference between Docker Toolbox and Docker for Mac. Docker for Mac is newer and better. Don't mix up advice for the two.

-Read up on the difference between images and containers.

0 notes

atreecodesinbrooklyn · 9 years ago

Text

A new carousel

My blog might be a little confusing because jumping back and forth between a few different projects I’m working on. One of the apps I’m responsible is our parking business analytics application. For that I am migrating two billion rows of data from a SQLServer to a Postgres database. I’m also redesigning and refactoring the charts there.

The other app I’m responsible for runs on a kiosk to help people find their cars in big parking garages. Drivers who have forgotten where they parked can type in part of their license plate number and it presents two maps “You are here” and “Your car is here.”

For the kiosk app I spent the past two days reworking the carousel that shows pictures of cars similar to the plate the driver searched for. The design previously asked users to swipe between car images. But we found that the kiosk screen was not easy to swipe on. For one, some of the kiosks weren’t sensitive to swiping, even though they were the same make and model as others. Secondly, users are different heights, and so it’s an awkward movement. I think swiping was invented for small screens and that’s what it’s best for.

So now it’s a carousel. Drivers tap on the smaller, greyed-out car image to the left or right of the main car image in order to move it to the center. I might have coded the carousel from scratch but after some research I really liked the stylish and sensible transition animations of a plugin called Moving Boxes. I noticed it’s maintained by the same coder whose multilingual keyboard I added to our kiosk app a few months ago. Highly recommend both of them. The code is very well-written and so it was easy modify.

0 notes

atreecodesinbrooklyn · 9 years ago

Text

Debugging SQL Queries

Part of the data migration I’ve been working on since August involves refactoring the app for a Postgres data warehouse. We’re leaving behind Microsoft SQLServer. Microsoft’s T-SQL is slightly different from Postgres, the datatypes are different and the query language is too. In good form, my supervisor also wanted to refactor the new queries in ActiveRecord rather than raw SQL. So over the past two weeks I’ve gotten the hang of debugging massive SQL queries.

Today I debugged a 75 line SQL query that was originally written in four ActiveRecord subqueries including a cross join, a left join, and several case statements. The query was adding up statistics over a user-selected timespan of days or weeks. It would be simple to just average out all the days, if we were counting the whole entire day. But what makes it complicated was the user also selects just few hours of each day. You can picture it as a slice through a calendar. The problem turned out to be that a case statement -- to determine the start and end times of each calculation, depending on whether the params start time or the visit start time happened first -- was inadvertently replacing null values with a default, when we wanted to keep them null. So we were counting even time slices where there was no activity.

SELECT

...,

CASE

WHEN visit_start_timestamp IS NULL THEN NULL <--without this line, even records in the join table without a visit were getting added in, with the params_start_timestamp as the default start_time_per_day_slice. WHEN visit_start_timestamp > params_start_timestamp THEN visit_start_timestamp ELSE params_start_timestamp END AS start_time_per_day_slice

FROM #{CROSS-JOIN-TABLE} LEFT JOIN VISITS TABLE...

It was a great relief to see that Sublime does SQL text highlighting like any other language. That way I could just copy the query from the Rails log, drop it into Sublime, and run the subqueries one-by-one to see what was wrong with the join table subqueries.

0 notes

atreecodesinbrooklyn · 9 years ago

Text

How to move two billion rows

I’m a few months into a migration of our data warehouse. Doesn’t qualify as Big Data--that would mean billions of rows per table. This is just a total of two billion rows spread out across a few dozen tables. I’m migrating them from Microsoft SQL-Server to a Postgres database. I started in August and gave it 16 agile points. Five months later, I’ve learned a great deal, not the least of which is the importance of managing expectations.

First thing I learned was what not to do. I spent a few weeks trying to get a janky piece of software to do the job. DBMoto. I spent a few days with the company trying to figure out how to use it -- it was not clear from the console. The user-interface of the console felt a few decades old, but I told myself the makers just weren’t that superficial--that they must have been focusing on the underlying performance. But when it finished moving the tables, I ran a basic QA queries, group_by then count groups, and they were off by a few million on some tables. So I had to change tack. Up next in this series: bulk copying, converting and dumping.

0 notes

atreecodesinbrooklyn · 9 years ago

Text

A new solution to a kiosk problem

At my company, one of our products is a kiosk with an app that helps drivers who have forgotten where they parked. They type in their license plates, and the Rails app I wrote displays maps of where they are and where their car is. Previously the kiosk app was not on the cloud but stored locally. It was a huge hassle to upgrade and maintain. I put it on Rails in the cloud and that made it much easier to improve.

However there was one problem. Once in a blue moon the new app--or more specifically the Elastic Load Balancer--would send back a 5xx server error. Though I put three app instances behind the load balancer to handle high traffic, and the instances weren’t going down, the ELB still occasionally sent back a 5xx, for no known reason. (Even when AWS monitoring showed the instances all healthy.) And when that happened, the kiosk would just blank out. It would lose the connection to it. On a blanked out browser in kiosk mode, drivers had no button to click to make a new request. Our cloud-hosted screensaver, which usually runs a video and then at the video requests itself again, would break out of its loop. Because of the nature of http protocol -- the client must initiate all communication with the server -- the kiosk would, essentially, go rogue. For hours. Days. However long it took someone on site to notify us.

At first we set up an alarm to email and text me, my supervisor, and a network engineer whenever the load balancer returned a 5xx error. Inevitably this would happen on a Friday night or a Saturday afternoon--when parking garage traffic was highest and we were furthest from our desks. We had to be on call. I would rush home and immediately log in remotely to every kiosk, just find the one that blanked out and press Command R once to refresh the page.

That’s until I made a Google Chrome extension to recover the kiosk. It intercepts a 5xx response and redirects the browser to a new page that looks like its part of our Rails app but is actually elsewhere. It lingers a bit on that external page -- which has a button for the person standing in front of the kiosk to click -- and then automatically redirects back to the app screensaver. The Google Chrome api is easy to get the hang of and extremely useful for kiosk-apps.

0 notes

atreecodesinbrooklyn · 10 years ago

Text

Making Heatmaps with Pandas

Everyone likes giant pandas, but I really like python pandas now after taking Sergey Fogelson’s Data Science Workshop at the Flatiron School two weeks ago. The name pandas is a loose acronym of “python data analysis library.” And it’s a good metaphor -- these tools crush datasets the way an average male bear crunches through 80 pounds of bamboo every day.

It was an incredibly productive workshop, two days that boosted my data coding skills to another level. The few week afterward I made a heat-map of New York bikeshare routes. My process involved iPython notebook, pandas’ groupby and sort methods, and the Google directions and heatmap APIs.

Citibike provides a list of all bike trips, including start and stop station latitude and longitude. With pandas, I grouped the trips by start and stop station and then counted total trips per route.

Then I made Google Directions API calls to get a polyline, a very abbreviated representation of each route. It was easy to decode each polyline with the python polyline module. Each polyline translated into a list of latitude and longitude coordinates. My Sinatra app uses the Google Heatmaps API. It takes the list of latitudes and longitudes and layers them over the map.

The demo is at https://citibike-heatmap.herokuapp.com/ and the repo is at https://github.com/aprilrabkin/heatmap_of_ny_bike_routes

Now if I can just get ahold of the numbers on “panda diplomacy” — which is a thing that exists — I can use pandas to make a heat-map of international panda routes.

0 notes

atreecodesinbrooklyn · 10 years ago

Photo

A heat-map I built with pandas and two Google APIs

0 notes

atreecodesinbrooklyn · 11 years ago

Text

The Physical Part of Information Security

Last week Matt Bentz, Fog Creek's sysadmin and an adjunct professor at NYU, took me along to replace a hard-drive and a battery at their data center around the corner.

Trello and Fog Creek keep their servers in the historic International Telephone & Telegraph Corporation building, which has this mosaic of a winged, early-20th-century telecom angel. See the bolt of lightning in his armspan. As a fig leaf, he has the earth's two hemispheres, trailed by a ribbon of ticker tape.

Technology has always flown along blissfully faster than security, and the internet today is no exception. This was the first data center I'd seen outside of the House of Cards hacker scene. It really is that easy to hack into a system with just a USB key. The military secures theirs by filling the USB ports with epoxy. Here just a stone's throw from the New York Stock Exchange, once you get past the doorman and up the elevator , you could hack into many companies' systems with no more than a USB key on a pole stuck through the chain-link fence.

“Data centers are where physical security meets information security,” says Matt, a former hacker. “If you get access to the machines, you can access all the information. The real security is the people who run it.”

0 notes

atreecodesinbrooklyn · 11 years ago

Text

Scripting CSV with Fog Creek Mentor Doug Patti

Today I coded a bit for a Democracy Works project to help people find polling places and early voting hours over the next few months. I thought it would be easy to scrape, parse, and merge data from two sites into one spreadsheet. But as the evening wore on, I felt frustrated and embarrassed to be one of the last people in the office, scripting something what would have taken a fraction of the time to copy and paste.

And then Trello developer Doug Patti walked by at just the right moment. He asked me what I was working on, tossed out a few pointers, and eventually sat down. We ended up pair-programming for hours. Here are some awesome tools I picked up from him:

1. Tracing the Network log in Chrome

Strangely, when I click on any jurisdiction on this site, the URL doesn't change, though it seems like a redirect. What appear to be more than a hundred static pages all have the same URL. It was impossible to scrape with Nokogiri alone. Not RESTful. What was going on?

By watching the Network log in Chrome, Doug pointed out that rather than changing the page with a GET request, each click triggered a POST request, as if the website were filling out its own form. It's a little archaic, but he wasn't surprised: "For a government webpage, it was kind of what I expected." You can inspect this and much more in the Network tab.

2. Using pipes with cURL in bash

curl <the url> | grep 'state:il' | grep -v 'place' | sed 's/,.*//' | subl

This script grabs everything on the page, then parses all the rows containing "state:il", deletes all the rows containing "place", takes out everything after the comma in each row (with "stream editor"), and finally puts it all in a Sublime file.

3. Chaining blocks in Ruby

[s,c,r,a,p,e,d,d,a,t,a].select do |i|

end.reject do |i|

end.map do |i|

end

This is similar to piping results of cURL in bash (as shown above).

4. Returning an array of values from a block

meal.map do |dish|

[dish.appetizer, dish.entree, dish.dessert]

end

Useful for lining up data in rows, getting them ready to go into a spreadsheet. (So cool that Ruby has a CSV (comma-separated value) class with its CSV.open method for my final step: copying it into a separate CSV file.)

4. Ruby methods to parse CSV data

.lines handles data pulled out of a spreadsheet. It splits on the \n, creating an array of rows.

.drop()can be used to ignore the first row (usually the column headers).

5. Working with a data rows copied and pasted into the text editor

DATA.read.lines.map do |line|

__END__

Yet another way to work with CSV data.

It was very productive to pair-program with Doug, and I felt lucky that my classmate Charlotte, whom he was originally assigned to mentor, found a job and left him with spare time for me.

0 notes

atreecodesinbrooklyn · 11 years ago

Text

The Fog Creek Fellowship

Tuning up for the Fog Creek/Trello Talent Show

0 notes

atreecodesinbrooklyn · 11 years ago

Text

ball tree

I'm really fascinated by machine learning, but I haven't had a chance to study it in-depth yet. Here I'll try to explain k-ball trees based on what I could glean online.

I've heard that K-ball trees are how okCupid matches people. Imagine people's answers to the questionnaire as datapoints thrown onto a graph. Each point represents one person. (If there were only two personality traits in question, an x-y graph would be enough. But since there are countless personality traits, based on hundreds of questions, the graph has countless dimensions.)

The distance between two points represents the similarity between two people. The closer they are, the more alike their questionnaire responses. "Nearest neighbor" is one of the most basic algorithms in machine learning, and it is how you might find a closest match.

A ball tree is one step more advanced. It would find all matches within a certain radius. There's one ball tree algorithm, the k-d construction algorithm, that splits the entire data set at once, in order to expedite nearest neighbor searches in advance. Maybe it divides the entire dataset into layers. Not really sure, so I'm looking forward to listening to these Stanford lectures from a machine learning class on Coursera.

0 notes