spwillison
spwillison
Simon Willison’s Weblog
1K posts
Don't wanna be here? Send us removal request.
spwillison · 12 years ago
Text
How exactly do you buy a Kindle book on Amazon?
On Amazon.com...
Tumblr media
Click through to Amazon.co.uk (which goes to the index page, not the page for the book you were looking at):
Tumblr media
OK... search for that book on the amazon.co.uk store:
Tumblr media
I'm stumped.
5 notes · View notes
spwillison · 12 years ago
Quote
The way to seem most formidable as an inexperienced founder is to stick to the truth.
Paul Graham
2 notes · View notes
spwillison · 12 years ago
Photo
Tumblr media
3 notes · View notes
spwillison · 15 years ago
Text
Comprehensive notes from my three hour Redis tutorial
Last week I presented two talks at the inaugural NoSQL Europe conference in London. The first was presented with Matthew Wall and covered the ways in which we have been exploring NoSQL at the Guardian. The second was a three hour workshop on Redis, my favourite piece of software to have the NoSQL label applied to it.
I've written about Redis here before, and it has since earned a place next to MySQL/PostgreSQL and memcached as part of my default web application stack. Redis makes write-heavy features such as real-time statistics feasible for small applications, while effortlessly scaling up to handle larger projects as well. If you haven't tried it out yet, you're sorely missing out.
For the workshop, I tried to give an overview of each individual Redis feature along with detailed examples of real-world problems that the feature can help solve. I spent the past day annotating each slide with detailed notes, and I think the result makes a pretty good stand-alone tutorial. Here's the end result:
Redis tutorial slides and notes
In unrelated news, Nat and I both completed the first ever Brighton Marathon last weekend, in my case taking 4 hours, 55 minutes and 17 seconds. Sincere thanks to everyone who came out to support us - until the race I had never appreciated how important the support of the spectators is to keep going to the end. We raised £757 for the Have a Heart children's charity. Thanks in particular to Clearleft who kindly offered to match every donation.
3 notes · View notes
spwillison · 15 years ago
Text
WildlifeNearYou talk at £5 app, and being Wired (not Tired)
Two quick updates about WildlifeNearYou. First up, I gave a talk about the site at £5 app, my favourite Brighton evening event which celebrates side projects and the joy of Making Stuff. I talked about the site's genesis on a fort, crowdsourcing photo ratings, how we use Freebase and DBpedia and how integrating with Flickr's machine tags gave us a powerful location API for free. Here's the video of the talk, courtesy of Ian Oszvald:
£5 App #22 WildLifeNearYou by Simon Willison and Natalie Downe from IanProCastsCoUk on Vimeo.
Secondly, I'm excited to note that WildlifeNearYou spin-off OwlsNearYou.com is featured in UK Wired magazine's Wired / Tired / Expired column... and we're Wired!
0 notes
spwillison · 15 years ago
Text
Some questions about the "blocking" of HTML5
When people say that the publication of HTML5 "blocked" by Larry Masinter's "formal objection", what exactly do they mean?
Why does the private w3c-archive mailing list exist? Why can't anyone reveal what happens on there? What are the consequences for doing so? Who gets to be on that list in the first place?
Can anyone raise a "formal objection"?
Is anyone calling for the HTML Working Group to be "rechartered"? If so, what does that involve?
If there are concerns about the inclusion of Canvas 2D in the specification, why were these not resolved earlier?
Some background reading. I was planning to fill in answers as they arrive, but I screwed up the moderation of the comments and got flooded with detailed responses - I strongly recommend reading the comments.
0 notes
spwillison · 15 years ago
Text
WildlifeNearYou: It began on a fort...
Back in October 2008, myself and 11 others set out on the first /dev/fort expedition. The idea was simple: gather a dozen geeks, rent a fort, take food and laptops and see what we could build in a week.
The fort was Fort Clonque on Alderney in the Channel Islands, managed by the Landmark Trust. We spent an incredibly entertaining week there exploring Nazi bunkers, cooking, eating and coding up a storm. It ended up taking slightly longer than a week to finish, but 14 months later the result of our combined efforts can finally be revealed: WildlifeNearYou.com!
WildlifeNearYou is a site for people who like to see animals. Have you ever wanted to know where your nearest Llama is? Search for "llamas near brighton" and you'll see that there's one 18 miles away at Ashdown Forest Llama Farm. Or you can see all the places we know about in France, or all the trips I've been on, or everywhere you can see a Red Panda.
The data comes from user contributions: you can use WildlifeNearYou to track your trips to wildlife places and list the animals that you see there. We can only tell you about animals that someone else has already spotted.
Once you've added some trips, you can import your Flickr photos and match them up with trips and species. We'll be adding a feature in the future that will push machine tags and other metadata back to Flickr for you, if you so choose.
You can read more about WildlifeNearYou on the site's about page and FAQ. Please don't hesitate to send us feedback!
What took so long?
So why did it take so long to finally launch it? A whole bunch of reasons. Week long marathon hacking sessions are an amazing way to generate a ton of interesting ideas and build a whole bunch of functionality, but it's very hard to get a single cohesive whole at the end of it. Tying up the loose ends is a pretty big job and is severely hampered by the fort residents returning to their real lives, where hacking for 5 hours straight on a cool easter egg suddenly doesn't seem quite so appealing. We also got stuck in a cycle of "just one more thing". On the fort we didn't have internet access, so internet-dependent features like Freebase integration, Google Maps, Flickr imports and OpenID had to be left until later ("they'll only take a few hours" no longer works once you're off /dev/fort time).
The biggest problem though was perfectionism. The longer a side-project drags on for, the more important it feels to make it "just perfect" before releasing it to the world. Finally, on New Year's Day, Nat and I decided we had had enough. Our resolution was to "ship the thing within a week, no matter what state it's in". We're a few days late, but it's finally live.
WildlifeNearYou is by far the most fun website I've ever worked on. To all twelve of my intrepid fort companions: congratulations - we made a thing!
0 notes
spwillison · 15 years ago
Text
Crowdsourced document analysis and MP expenses
As you may have heard, the UK government released a fresh batch of MP expenses documents a week ago on Thursday. I spent that week working with a small team at Guardian HQ to prepare for the release. Here's what we built:
http://mps-expenses2.guardian.co.uk/
It's a crowdsourcing application that asks the public to help us dig through and categorise the enormous stack of documents - around 30,000 pages of claim forms, scanned receipts and hand-written letters, all scanned and published as PDFs.
This is the second time we've tried this - the first was back in June, and can be seen at mps-expenses.guardian.co.uk. Last week's attempt was an opportunity to apply the lessons we learnt the first time round.
Writing crowdsourcing applications in a newspaper environment is a fascinating challenge. Projects have very little notice - I heard about the new document release the Thursday before giving less than a week to put everything together. In addition to the fast turnaround for the application itself, the 48 hours following the release are crucial. The news cycle moves fast, so if the application launches but we don't manage to get useful data out of it quickly the story will move on before we can impact it.
ScaleCamp on the Friday meant that development work didn't properly kick off until Monday morning. The bulk of the work was performed by two server-side developers, one client-side developer, one designer and one QA on Monday, Tuesday and Wednesday. The Guardian operations team deftly handled our EC2 configuration and deployment, and we had some extra help on the day from other members of the technology department. After launch we also had a number of journalists helping highlight discoveries and dig through submissions.
The system was written using Django, MySQL (InnoDB), Redis and memcached.
Asking the right question
The biggest mistake we made the first time round was that we asked the wrong question. We tried to get our audience to categorise documents as either "claims" or "receipts" and to rank them as "not interesting", "a bit interesting", "interesting but already known" and "someone should investigate this". We also asked users to optionally enter any numbers they saw on the page as categorised "line items", with the intention of adding these up later.
The line items, with hindsight, were a mistake. 400,000 documents makes for a huge amount of data entry and for the figures to be useful we would need to confirm their accuracy. This would mean yet more rounds of crowdsourcing, and the job was so large that the chance of getting even one person to enter line items for each page rapidly diminished as the news story grew less prominent.
The categorisations worked reasonably well but weren't particularly interesting - knowing if a document is a claim or receipt is useful only if you're going to collect line items. The "investigate this" button worked very well though.
We completely changed our approach for the new system. We dropped the line item task and instead asked our users to categories each page by applying one or more tags, from a small set that our editors could control. This gave us a lot more flexibility - we changed the tags shortly before launch based on the characteristics of the documents - and had the potential to be a lot more fun as well. I'm particularly fond of the "hand-written" tag, which has highlighted some lovely examples of correspondence between MPs and the expenses office.
Sticking to an editorially assigned set of tags provided a powerful tool for directing people's investigations, and also ensured our users didn't start creating potentially libellous tags of their own.
Breaking it up in to assignments
For the first project, everyone worked together on the same task to review all of the documents. This worked fine while the document set was small, but once we had loaded in 400,000+ pages the progress bar become quite depressing.
This time round, we added a new concept of "assignments". Each assignment consisted of the set of pages belonging to a specified list of MPs, documents or political parties. Assignments had a threshold, so we could specify that a page must be reviewed by at least X people before it was considered reviewed. An editorial tool let us feature one "main" assignment and several alternative assignments right on the homepage.
Clicking "start reviewing" on an assignment sets a cookie for that assignment, and adds the assignment's progress bar to the top of the review interface. New pages are selected at random from the set of unreviewed pages in that assignment.
The assignments system proved extremely effective. We could use it to direct people to the highest value documents (our top hit list of interesting MPs, or members of the shadow cabinet) while still allowing people with specific interests to pick an alternative task.
Get the button right!
Having run two crowdsourcing projects I can tell you this: the single most important piece of code you will write is the code that gives someone something new to review. Both of our projects had big "start reviewing" buttons. Both were broken in different ways.
The first time round, the mistakes were around scalability. I used a SQL "ORDER BY RAND()" statement to return the next page to review. I knew this was an inefficient operation, but I assumed that it wouldn't matter since the button would only be clicked occasionally.
Something like 90% of our database load turned out to be caused by that one SQL statement, and it only got worse as we loaded more pages in to the system. This caused multiple site slow downs and crashes until we threw together a cron job that pushed 1,000 unreviewed page IDs in to memcached and made the button pick one of those at random.
This solved the performance problem, but meant that our user activity wasn't nearly as well targeted. For optimum efficiency you really want everyone to be looking at a different page - and a random distribution is almost certainly the easiest way to achieve that.
The second time round I turned to my new favourite in-memory data structure server, redis, and its SRANDMEMBER command (a feature I requested a while ago with this exact kind of project in mind). The system maintains a redis set of all IDs that needed to be reviewed for an assignment to be complete, and a separate set of IDs of all pages had been reviewed. It then uses redis set intersection (the SDIFFSTORE command) to create a set of unreviewed pages for the current assignment and then SRANDMEMBER to pick one of those pages.
This is where the bug crept in. Redis was just being used as an optimisation - the single point of truth for whether a page had been reviewed or not stayed as MySQL. I wrote a couple of Django management commands to repopulate the denormalised Redis sets should we need to manually modify the database. Unfortunately I missed some - the sets that tracked what pages were available in each document. The assignment generation code used an intersection of these sets to create the overall set of documents for that assignment. When we deleted some pages that had accidentally been imported twice I failed to update those sets.
This meant the "next page" button would occasionally turn up a page that didn't exist. I had some very poorly considered fallback logic for that - if the random page didn't exist, the system would return the first page in that assignment instead. Unfortunately, this meant that when the assignment was down to the last four non-existent pages every single user was directed to the same page - which subsequently attracted well over a thousand individual reviews.
Next time, I'm going to try and make the "next" button completely bullet proof! I'm also going to maintain a "denormalisation dictionary" documenting every denormalisation in the system in detail - such a thing would have saved me several hours of confused debugging.
Exposing the results
The biggest mistake I made last time was not getting the data back out again fast enough for our reporters to effectively use it. It took 24 hours from the launch of the application to the moment the first reporting feature was added - mainly because we spent much of the intervening time figuring out the scaling issues.
This time we handled this a lot better. We provided private pages exposing all recent activity on the site. We also provided public pages for each of the tags, as well as combination pages for party + tag, MP + tag, document + tag, assignment + tag and user + tag. Most of these pages were ordered by most-tagged, with the hope that the most interesting pages would quickly bubble to the top.
This worked pretty well, but we made one key mistake. The way we were ordering pages meant that it was almost impossible to paginate through them and be sure that you had seen everything under a specific tag. If you're trying to keep track of everything going on in the site, reliable pagination is essential. The only way to get reliable pagination on a fast moving site is to order by the date something was first added to a set in ascending order. That way you can work through all of the pages, wait a bit, hit "refresh" and be able to continue paginating where you left off. Any other order results in the content of each page changing as new content comes in.
We eventually added an undocumented /in-order/ URL prefix to address this issue. Next time I'll pay a lot more attention to getting the pagination options right from the start.
Rewarding our contributors
The reviewing experience the first time round was actually quite lonely. We deliberately avoided showing people how others had marked each page because we didn't want to bias the results. Unfortunately this meant the site felt like a bit of a ghost town, even when hundreds of other people were actively reviewing things at the same time.
For the new version, we tried to provide a much better feeling of activity around the site. We added "top reviewer" tables to every assignment, MP and political party as well as a "most active reviewers in the past 48 hours" table on the homepage (this feature was added to the first project several days too late). User profile pages got a lot more attention, with more of a feel that users were collecting their favourite pages in to tag buckets within their profile.
Most importantly, we added a concept of discoveries - editorially highlighted pages that were shown on the homepage and credited to the user that had first highlighted them. These discoveries also added valuable editorial interest to the site, showing up on the homepage and also the index pages for political parties and individual MPs.
Light-weight registration
For both projects, we implemented an extremely light-weight form of registration. Users can start reviewing pages without going through any signup mechanism, and instead are assigned a cookie and an anon-454 style username the first time they review a document. They are then encouraged to assign themselves a proper username and password so they can log in later and take credit for their discoveries.
It's difficult to tell how effective this approach really is. I have a strong hunch that it dramatically increases the number of people who review at least one document, but without a formal A/B test it's hard to tell how true that is. The UI for this process in the first project was quite confusing - we gave it a solid makeover the second time round, which seems to have resulted in a higher number of conversions.
Overall lessons
News-based crowdsourcing projects of this nature are both challenging and an enormous amount of fun. For the best chances of success, be sure to ask the right question, ensure user contributions are rewarded, expose as much data as possible and make the "next thing to review" behaviour rock solid. I'm looking forward to the next opportunity to apply these lessons, although at this point I really hope it involves something other than MPs' expenses.
0 notes
spwillison · 15 years ago
Text
Node.js is genuinely exciting
I gave a talk on Friday at Full Frontal, a new one day JavaScript conference in my home town of Brighton. I ended up throwing away my intended topic (JSONP, APIs and cross-domain security) three days before the event in favour of a technology which first crossed my radar less than two weeks ago.
That technology is Ryan Dahl's Node. It's the most exciting new project I've come across in quite a while.
At first glance, Node looks like yet another take on the idea of server-side JavaScript, but it's a lot more interesting than that. It builds on JavaScript's excellent support for event-based programming and uses it to create something that truly plays to the strengths of the language.
Node describes itself as "evented I/O for V8 javascript". It's a toolkit for writing extremely high performance non-blocking event driven network servers in JavaScript. Think similar to Twisted or EventMachine but for JavaScript instead of Python or Ruby.
Evented I/O?
As I discussed in my talk, event driven servers are a powerful alternative to the threading / blocking mechanism used by most popular server-side programming frameworks. Typical frameworks can only handle a small number of requests simultaneously, dictated by the number of server threads or processes available. Long-running operations can tie up one of those threads - enough long running operations at once and the server runs out of available threads and becomes unresponsive. For large amounts of traffic, each request must be handled as quickly as possible to free the thread up to deal with the next in line.
This makes certain functionality extremely difficult to support. Examples include handling large file uploads, combining resources from multiple backend web APIs (which themselves can take an unpredictable amount of time to respond) or providing comet functionality by holding open the connection until a new event becomes available.
Event driven programming takes advantage of the fact that network servers spend most of their time waiting for I/O operations to complete. Operations against in-memory data are incredibly fast, but anything that involves talking to the filesystem or over a network inevitably involves waiting around for a response.
With Twisted, EventMachine and Node, the solution lies in specifying I/O operations in conjunction with callbacks. A single event loop rapidly switches between a list of tasks, firing off I/O operations and then moving on to service the next request. When the I/O returns, execution of that particular request is picked up again.
(In the talk, I attempted to illustrate this with a questionable metaphor involving hamsters, bunnies and a hyperactive squid).
What makes Node exciting?
If systems like this already exist, what's so exciting about Node? Quite a few things:
JavaScript is extremely well suited to programming with callbacks. Its anonymous function syntax and closure support is perfect for defining inline callbacks, and client-side development in general uses event-based programming as a matter of course: run this function when the user clicks here / when the Ajax response returns / when the page loads. JavaScript programmers already understand how to build software in this way.
Node represents a clean slate. Twisted and EventMachine are hampered by the existence of a large number of blocking libraries for their respective languages. Part of the difficulty in learning those technologies is understanding which Python or Ruby libraries you can use and which ones you have to avoid. Node creator Ryan Dahl has a stated aim for Node to never provide a blocking API - even filesystem access and DNS lookups are catered for with non-blocking callback based APIs. This makes it much, much harder to screw things up.
Node is small. I read through the API documentation in around half an hour and felt like I had a pretty comprehensive idea of what Node does and how I would achieve things with it.
Node is fast. V8 is the fast and keeps getting faster. Node's event loop uses Marc Lehmann's highly regarded libev and libeio libraries. Ryan Dahl is himself something of a speed demon - he just replaced Node's HTTP parser implementation (already pretty speedy due to it's Ragel / Mongrel heritage) with a hand-tuned C implementation with some impressive characteristics.
Easy to get started. Node ships with all of its dependencies, and compiles cleanly on Snow Leopard out of the box.
With both my JavaScript and server-side hats on, Node just feels right. The APIs make sense, it fits a clear niche and despite its youth (the project started in February) everything feels solid and well constructed. The rapidly growing community is further indication that Ryan is on to something great here.
What does Node look like?
Here's how to get Hello World running in Node in 7 easy steps:
git clone git://github.com/ry/node.git (or download and extract a tarball)
./configure
make (takes a while, it needs to compile V8 as well)
sudo make install
Save the below code as helloworld.js
node helloworld.js
Visit http://localhost:8080/ in your browser
Here's helloworld.js:
var sys = require('sys'), http = require('http'); http.createServer(function(req, res) { res.sendHeader(200, {'Content-Type': 'text/html'}); res.sendBody('<h1>Hello World</h1>'); res.finish(); }).listen(8080); sys.puts('Server running at http://127.0.0.1:8080/');
If you have Apache Bench installed, try running ab -n 1000 -c 100 'http://127.0.0.1:8080/' to test it with 1000 requests using 100 concurrent connections. On my MacBook Pro I get 3374 requests a second.
So Node is fast - but where it really shines is concurrency with long running requests. Alter the helloworld.js server definition to look like this:
http.createServer(function(req, res) { setTimeout(function() { res.sendHeader(200, {'Content-Type': 'text/html'}); res.sendBody('<h1>Hello World</h1>'); res.finish(); }, 2000); }).listen(8080);
We're using setTimeout to introduce an artificial two second delay to each request. Run the benchmark again - I get 49.68 requests a second, with every single request taking between 2012 and 2022 ms. With a two second delay, the best possible performance for 1000 requests 100 at a time is 1000 requests / (1000 / 100) * 2 seconds = 50 requests a second. Node hits it pretty much bang on the nose.
The most important line in the above examples is res.finish(). This is the mechanism Node provides for explicitly signalling that a request has been fully processed and should be returned to the browser. By making it explicit, Node makes it easy to implement comet patterns like long polling and streaming responses - stuff that is decidedly non trivial in most server-side frameworks.
djangode
Node's core APIs are pretty low level - it has HTTP client and server libraries, DNS handling, asynchronous file I/O etc, but it doesn't give you much in the way of high level web framework APIs. Unsurprisingly, this has lead to a cambrian explosion of lightweight web frameworks based on top of Node - the projects using node page lists a bunch of them. Rolling a framework is a great way of learning a low-level API, so I've thrown together my own - djangode - which brings Django's regex-based URL handling to Node along with a few handy utility functions. Here's a simple djangode application:
var dj = require('./djangode'); var app = dj.makeApp([ ['^/$', function(req, res) { dj.respond(res, 'Homepage'); }], ['^/other$', function(req, res) { dj.respond(res, 'Other page'); }], ['^/page/(\\d+)$', function(req, res, page) { dj.respond(res, 'Page ' + page); }] ]); dj.serve(app, 8008);
djangode is currently a throwaway prototype, but I'll probably be extending it with extra functionality as I explore more Node related ideas.
nodecast
My main demo in the Full Frontal talk was nodecast, an extremely simple broadcast-oriented comet application. Broadcast is my favourite "hello world" example for comet because it's both simpler than chat and more realistic - I've been involved in plenty of projects that could benefit from being able to broadcast events to their audience, but few that needed an interactive chat room.
The source code for the version I demoed can be found on GitHub in the no-redis branch. It's a very simple application - the client-side JavaScript simply uses jQuery's getJSON method to perform long-polling against a simple URL endpoint:
function fetchLatest() { $.getJSON('/wait?id=' + last_seen, function(d) { $.each(d, function() { last_seen = parseInt(this.id, 10) + 1; ul.prepend($('<li></li>').text(this.text)); }); fetchLatest(); }); }
Doing this recursively is probably a bad idea since it will eventually blow the browser's JavaScript stack, but it works OK for the demo.
The more interesting part is the server-side /wait URL which is being polled. Here's the relevant Node/djangode code:
var message_queue = new process.EventEmitter(); var app = dj.makeApp([ // ... ['^/wait$', function(req, res) { var id = req.uri.params.id || 0; var messages = getMessagesSince(id); if (messages.length) { dj.respond(res, JSON.stringify(messages), 'text/plain'); } else { // Wait for the next message var listener = message_queue.addListener('message', function() { dj.respond(res, JSON.stringify(getMessagesSince(id)), 'text/plain' ); message_queue.removeListener('message', listener); clearTimeout(timeout); }); var timeout = setTimeout(function() { message_queue.removeListener('message', listener); dj.respond(res, JSON.stringify([]), 'text/plain'); }, 10000); } }] // ... ]);
The wait endpoint checks for new messages and, if any exist, returns immediately. If there are no new messages it does two things: it hooks up a listener on the message_queue EventEmitter (Node's equivalent of jQuery/YUI/Prototype's custom events) which will respond and end the request when a new message becomes available, and also sets a timeout that will cancel the listener and end the request after 10 seconds. This ensures that long polls don't go on too long and potentially cause problems - as far as the browser is concerned it's just talking to a JSON resource which takes up to ten seconds to load.
When a message does become available, calling message_queue.emit('message') will cause all waiting requests to respond with the latest set of messages.
Talking to databases
nodecast keeps track of messages using an in-memory JavaScript array, which works fine until you restart the server and lose everything. How do you implement persistent storage?
For the moment, the easiest answer lies with the NoSQL ecosystem. Node's focus on non-blocking I/O makes it hard (but not impossible) to hook it up to regular database client libraries. Instead, it strongly favours databases that speak simple protocols over a TCP/IP socket - or even better, databases that communicate over HTTP. So far I've tried using CouchDB (with node-couch) and redis (with redis-node-client), and both worked extremely well. nodecast trunk now uses redis to store the message queue, and provides a nice example of working with a callback-based non-blocking database interface:
var db = redis.create_client(); var REDIS_KEY = 'nodecast-queue'; function addMessage(msg, callback) { db.llen(REDIS_KEY, function(i) { msg.id = i; // ID is set to the queue length db.rpush(REDIS_KEY, JSON.stringify(msg), function() { message_queue.emit('message', msg); callback(msg); }); }); }
Relational databases are coming to Node. Ryan has a PostgreSQL adapter in the works, thanks to that database already featuring a mature non-blocking client library. MySQL will be a bit tougher - Node will need to grow a separate thread pool to integrate with the official client libs - but you can talk to MySQL right now by dropping in DBSlayer from the NY Times which provides an HTTP interface to a pool of MySQL servers.
Mixed environments
I don't see myself switching all of my server-side development over to JavaScript, but Node has definitely earned a place in my toolbox. It shouldn't be at all hard to mix Node in to an existing server-side environment - either by running both behind a single HTTP proxy (being event-based itself, nginx would be an obvious fit) or by putting Node applications on a separate subdomain. Node is a tempting option for anything involving comet, file uploads or even just mashing together potentially slow loading web APIs. Expect to hear a lot more about it in the future.
Further reading
Ryan's JSConf.eu presentation is the best discussion I've seen anywhere of the design philosophy behind Node.
Node's API documentation is essential reading.
Streaming file uploads with node.js illustrates how well suited Node is to accepting large file uploads.
The nodejs Google Group is the hub of the Node community.
1 note · View note
spwillison · 16 years ago
Text
Why I like Redis
I've been getting a lot of useful work done with Redis recently.
Redis is typically categorised as yet another of those new-fangled NoSQL key/value stores, but if you look closer it actually has some pretty unique characteristics. It makes more sense to describe it as a "data structure server" - it provides a network service that exposes persistent storage and operations over dictionaries, lists, sets and string values. Think memcached but with list and set operations and persistence-to-disk.
It's also incredibly easy to set up, ridiculously fast (30,000 read or writes a second on my laptop with the default configuration) and has an interesting approach to persistence. Redis runs in memory, but syncs to disk every Y seconds or after every X operations. Sounds risky, but it supports replication out of the box so if you're worried about losing data should a server fail you can always ensure you have a replicated copy to hand. I wouldn't trust my only copy of critical data to it, but there are plenty of other cases for which it is really well suited.
I'm currently not using it for data storage at all - instead, I use it as a tool for processing data using the interactive Python interpreter.
I'm a huge fan of REPLs. When programming Python, I spend most of my time in an IPython prompt. With JavaScript, I use the Firebug console. I experiment with APIs, get something working and paste it over in to a text editor. For some one-off data transformation problems I never save any code at all - I run a couple of list comprehensions, dump the results out as JSON or CSV and leave it at that.
Redis is an excellent complement to this kind of programming. I can run a long running batch job in one Python interpreter (say loading a few million lines of CSV in to a Redis key/value lookup table) and run another interpreter to play with the data that's already been collected, even as the first process is streaming data in. I can quit and restart my interpreters without losing any data. And because Redis semantics map closely to Python native data types, I don't have to think for more than a few seconds about how I'm going to represent my data.
Here's a 30 second guide to getting started with Redis:
$ wget http://redis.googlecode.com/files/redis-1.01.tar.gz $ tar -xzf redis-1.01.tar.gz $ cd redis-1.01 $ make $ ./redis-server
And that's it - you now have a Redis server running on port 6379. No need even for a ./configure or make install. You can run ./redis-benchmark in that directory to exercise it a bit.
Let's try it out from Python. In a separate terminal:
$ cd redis-1.01/client-libraries/python/ $ python >>> import redis >>> r = redis.Redis() >>> r.info() {u'total_connections_received': 1, ... } >>> r.keys('*') # Show all keys in the database [] >>> r.set('key-1', 'Value 1') 'OK' >>> r.keys('*') [u'key-1'] >>> r.get('key-1') u'Value 1'
Now let's try something a bit more interesting:
>>> r.push('log', 'Log message 1', tail=True) >>> r.push('log', 'Log message 2', tail=True) >>> r.push('log', 'Log message 3', tail=True) >>> r.lrange('log', 0, 100) [u'Log message 3', u'Log message 2', u'Log message 1'] >>> r.push('log', 'Log message 4', tail=True) >>> r.push('log', 'Log message 5', tail=True) >>> r.push('log', 'Log message 6', tail=True) >>> r.ltrim('log', 0, 2) >>> r.lrange('log', 0, 100) [u'Log message 6', u'Log message 5', u'Log message 4']
That's a simple capped log implementation (similar to a MongoDB capped collection) - push items on to the tail of a 'log' key and use ltrim to only retain the last X items. You could use this to keep track of what a system is doing right now without having to worry about storing ever increasing amounts of logging information.
See the documentation for a full list of Redis commands. I'm particularly excited about the RANDOMKEY and new SRANDMEMBER commands (git trunk only at the moment), which help address the common challenge of picking a random item without ORDER BY RAND() clobbering your relational database. In a beautiful example of open source support in action, I requested SRANDMEMBER on Twitter yesterday and antirez committed just 12 hours later.
I used Redis this week to help create heat maps of the BNP's membership list for the Guardian. I had the leaked spreadsheet of the BNP member details and a (licensed) CSV file mapping 1.6 million postcodes to their corresponding parliamentary constituencies. I loaded the CSV file in to Redis, then looped through the 12,000 postcodes from the membership and looked them up in turn, accumulating counts for each constituency. It took a couple of minutes to load the constituency data and a few seconds to run and accumulate the postcode counts. In the end, it probably involved less than 20 lines of actual Python code.
A much more interesting example of an application built on Redis is Hurl, a tool for debugging HTTP requests built in 48 hours by Leah Culver and Chris Wanstrath. The code is now open source, and Chris talks a bit more about the implementation (in particular their use of sort in Redis) on his blog. Redis also gets a mention in Tom Preston-Werner's epic writeup of the new scalable architecture behind GitHub.
2 notes · View notes
spwillison · 16 years ago
Text
This shouldn't be the image of Hack Day
I love hack days. I was working in the vicinity of Chad Dickerson when he organised the first internal Yahoo! Hack Day back in 2005, and I've since participated in hack day events at Yahoo!, Global Radio and the Guardian. I've also been to every one of Yahoo!'s Open Hack Day events in London. They're fantastic, and the team that organises them should be applauded.
As such, I care a great deal about the image of hack day - and the videos that emerged from last weekend's Taiwan Hack Day are hugely disappointing.
(These are still images from the video - the original has been taken down).
Seriously, what the hell?
I've heard arguments that this kind of thing is culturally acceptable in Taiwan - in fact it may even be expected for technology events, though I'd love to hear further confirmation. I don't care. The technology industry has a serious, widely recognised problem attracting female talent. The ratio of male to female attendants at most conferences I attend is embarassing - An Event Apart last week in Chicago was a notable and commendable exception.
Our industry is still young. If we want an all-encompassing technology scene, we need to actively work to cultivate an inclusive environment. This means a zero tolerance approach to this kind of entertainment. Booth babes, tequila girls, and scantily clad gyrating women simply set the wrong tone, here or abroad. Heck, this isn't just about offending women - many guy geeks I know would be mortified by this kind of thing.
Hack days are a celebration of ingenuity and creativity. Past US hack days have featured performances from Beck and Girl Talk, both of whom embody the creative spirit of the event. Sexy dancing girls? Not so much.
I'm not the only one who's disappointed.
Caterina Fake:
@Yahoo, for shame : http://flic.kr/p/78btX1 I'm frankly disgusted.
Chad Dickerson:
i am *so* disappointed: http://flic.kr/p/78btX1. remember, a team of women delivered the winning hack at the 1st one:http://bit.ly/FokfF
There was a flurry of activity about this on Twitter yesterday. I sat on this entry for most of today, partly because writing this kind of thing is really hard but also because I was hoping someone at Yahoo! would wake up and release some kind of statement. So far, nothing.
Update (1:30am): Chris Yeh of YDN has responded with an appropriately worded apology.
0 notes
spwillison · 16 years ago
Text
Django ponies: Proposals for Django 1.2
I've decided to step up my involvement in Django development in the run-up to Django 1.2, so I'm currently going through several years worth of accumulated pony requests figuring out which ones are worth advocating for. I'm also ensuring I have the code to back them up - my innocent AutoEscaping proposal a few years ago resulted in an enormous amount of work by Malcolm and I don't think he'd appreciate a repeat performance.
I'm not a big fan of branches when it comes to exploratory development - they're fine for doing the final implementation once an approach has been agreed, but I don't think they are a very effective way of discussing proposals. I'd much rather see working code in a separate application - that way I can try it out with an existing project without needing to switch to a new Django branch. Keeping code out of a branch also means people can start using it for real development work, making the API much easier to evaluate. Most of my proposals here have accompanying applications on GitHub.
I've recently got in to the habit of including an "examples" directory with each of my experimental applications. This is a full Django project (with settings.py, urls.py and manage.py files) which serves two purposes. Firstly, it allows developers to run the application's unit tests without needing to install it in to their own pre-configured project, simply by changing in to the examples directory and running ./manage.py test. Secondly, it gives me somewhere to put demonstration code that can be viewed in a browser using the runserver command - a further way of making the code easier to evaluate. django-safeform is a good example of this pattern.
Here's my current list of ponies, in rough order of priority.
Signing and signed cookies
Signing strings to ensure they have not yet been tampered with is a crucial technique in web application security. As with all cryptography, it's also surprisingly difficult to do correctly. A vulnerability in the signing implementation used to protect the Flickr API was revealed just today.
One of the many uses of signed strings is to implement signed cookies. Signed cookies are fantastically powerful - they allow you to send cookies safe in the knowledge that your user will not be able to alter them without you knowing. This dramatically reduces the need for sessions - most web apps use sessions for security rather than for storing large amounts of data, so moving that "logged in user ID" value to a signed cookie eliminates the need for session storage entirely, saving a round-trip to persistent storage on every request.
This has particularly useful implications for scaling - you can push your shared secret out to all of your front end web servers and scale horizontally, with no need for shared session storage just to handle simple authentication and "You are logged in as X" messages.
The latest version of my django-openid library uses signed cookies to store the OpenID you log in with, removing the need to configure Django's session storage. I've extracted that code in to django-signed, which I hope to evolve in to something suitable for inclusion in django.utils.
Please note that django-signed has not yet been vetted by cryptography specialists, something I plan to fix before proposing it for final inclusion in core.
django-signed on GitHub
Details of the Signing proposal on the Django wiki
Signing discussion on the django-developers mailing list
Improved CSRF support
This is mainly Luke Plant's pony, but I'm very keen to see it happen. Django has shipped with CSRF protection for more than three years now, but the approach (using middleware to rewrite form HTML) is relatively crude and, crucially, the protection isn't turned on by default. Hint: if you aren't 100% positive you are protected against CSRF, you should probably go and turn it on.
Luke's approach is an iterative improvement - a template tag (with a dependency on RequestContext) is used to output the hidden CSRF field, with middleware used to set the cookie and perform the extra validation. I experimented at length with an alternative solution based around extending Django's form framework to treat CSRF as just another aspect of validation - you can see the result in my django-safeform project. My approach avoids middleware and template tags in favour of a view decorator to set the cookie and a class decorator to add a CSRF check to the form itself.
While my approach works, the effort involved in upgrading existing code to it is substantial, compared to a much easier upgrade path for Luke's middleware + template tag approach. The biggest advantage of safeform is that it allows CSRF failure messages to be shown inline on the form, without losing the user's submission - the middleware check means showing errors as a full page without redisplaying the form. It looks like it should be possible to bring that aspect of safeform back to the middleware approach, and I plan to put together a patch for that over the next few days.
Luke's CSRF branch on bitbucket
My django-safeform on GitHub
Details of the CSRF proposal on the Django wiki
CSRF discussion on the django-developers mailing list
Better support for outputting HTML
This is a major pet peeve of mine. Django's form framework is excellent - one of the best features of the framework. There's just one thing that bugs me about it - it outputs full form widgets (for input, select and the like) so that it can include the previous value when redisplaying a form during validation, but it does so using XHTML syntax.
I have a strong preference for an HTML 4.01 strict doctype, and all those <self-closing-tags /> have been niggling away at me for literally years. Django bills itself as a framework for "perfectionists with deadlines", so I feel justified in getting wound up out of proportion over this one.
A year ago I started experimenting with a solution, and came up with django-html. It introduces two new Django template tags - {% doctype %} and {% field %}. The doctype tag serves two purposes - it outputs a particular doctype (saving you from having to remember the syntax) and it records that doctype in Django's template context object. The field tag is then used to output form fields, but crucially it gets to take the current doctype in to account.
The field tag can also be used to add extra HTML attributes to form widgets from within the template itself, solving another small frustration about the existing form library. The README describes the new tags in detail.
The way the tags work is currently a bit of a hack - if merged in to Django core they could be more cleanly implemented by refactoring the form library slightly. This refactoring is currently being discussed on the mailing list.
django-html on GitHub
Improved HTML discussion on the django-developers mailing list
Logging
This is the only proposal for which I don't yet have any code. I want to add official support for Python's standard logging framework to Django. It's possible to use this at the moment (I've done so on several projects) but it's not at all clear what the best way of doing so is, and Django doesn't use it internally at all. I posted a full argument in favour of logging to the mailing list, but my favourite argument is this one:
Built-in support for logging reflects a growing reality of modern Web development: more and more sites have interfaces with external web service APIs, meaning there are plenty of things that could go wrong that are outside the control of the developer. Failing gracefully and logging what happened is the best way to deal with 3rd party problems - much better than throwing a 500 and leaving no record of what went wrong.
I'm not actively pursuing this one yet, but I'm very interesting in hearing people's opinions on the best way to configure and use the Python logging module in production.
A replacement for get_absolute_url()
Django has a loose convention of encouraging people to add a get_absolute_url method to their models that returns that object's URL. It's a controversial feature - for one thing, it's a bit of a layering violation since URL logic is meant to live in the urls.py file. It's incredibly convenient though, and since it's good web citizenship for everything to have one and only one URL I think there's a pretty good argument for keeping it.
The problem is, the name sucks. I first took a look at this in the last few weeks before the release of Django 1.0 - what started as a quick proposal to come up with a better name before we were stuck with it quickly descended in to a quagmire as I realised quite how broken get_absolute_url() is. The short version: in some cases it means "get a relative URL starting with /", in other cases it means "get a full URL starting with http://" and the name doesn't accurately describe either.
A full write-up of my investigation is available on the Wiki. My proposed solution was to replace it with two complementary methods - get_url() and get_url_path() - with the user implementing one hence allowing the other one to be automatically derived. My django-urls project illustrates the concept via a model mixin class. A year on I still think it's quite a neat idea, though as far as I can tell no one has ever actually used it.
ReplacingGetAbsoluteUrl on the wiki
django-urls on GitHub
Recent get_absolute_url discussion on the django-developers mailing list
Comments on this post are open, but if you have anything to say about any of the individual proposals it would be much more useful if you posted it to the relevant mailing list thread.
0 notes
spwillison · 16 years ago
Text
Hack Day tools for non-developers
We're about to run our second internal hack day at the Guardian. The first was an enormous amount of fun and the second one looks set to be even more productive.
There's only one rule at hack day: build something you can demonstrate at the end of the event (Powerpoint slides don't count). Importantly though, our hack days are not restricted to just our development team: anyone from the technology department can get involved, and we extend the invitation to other parts of the organisation as well. At the Guardian, this includes journalists.
For our first hack day, I put together a list of "tools for non-developers" - sites, services and software that could be used for hacking without programming knowledge as a pre-requisite. I'm now updating that list with recommendations from elsewhere. Here's the list so far:
Freebase
Originally a kind of structured version of Wikipedia, Freebase changed its focus last year towards being a "social database about things you know and love". In other words, it's the most powerful OCD-enabler in the history of the world. Create your own "Base" on any subject you like, set up your own types and start gathering together topics from the millions already available in Freebase - or add your own. Examples include the Battlestar Galactica base, the Tall Ships base and the fabulous Database base. If you are a developer the tools in the Make Things with Freebase section are top notch.
Dabble DB
Dabble is a weird combination of a spreadsheet, an online database and a set of visualisation tools. Watch the 8 minute demo to get an idea of how powerful this is - you can start off by loading in an existing spreadsheet and take it from there. You'll need to sign up for the free 30 day trial.
Google Docs
You can always build a hack in Excel, but Google Spreadsheets is surprisingly powerful and means that you can collaborate with others on your hack (including developers, who can use the Google Docs API to get at the data in your spreadsheet). Check out the following tutorials, which describe ways of using Google Spreadsheets to scrape in data from other webpages and output it in interesting formats:
Data Scraping Wikipedia with Google Spreadsheets
Calling Amazon Associates/Ecommerce Web Services from a Google Spreadsheet
There's also a simple way to create a form that submits data in to a Google Spreadsheet.
Yahoo! Pipes
Visual tools for combining, filtering and modifying RSS feeds. Combine with the large number of full-content feeds on guardian.co.uk for all sorts of interesting possibilities. Here's a tutorial that incorporates Google Docs as well.
Google My Maps
Google provide a really neat interface for adding your own points, lines and areas to a Google Map. Outputs KML, a handy file format for carting geographic data around between different tools.
If you already have a KML or GeoRSS feed URL from somewhere (e.g. the output of a Yahoo! Pipe), you can paste it directly in to the Google Maps search box to see the points rendered on a map.
Google SketchUp
A simple to use 3D drawing package that lets you create 3D models of real-world buildings and then import them in to Google Earth.
OpenStreetMap
Try your hand at some open source cartography on OpenStreetMap, the geographic world's answer to Wikipedia. If you have the equipment you can contribute GPS traces, otherwise there's a clever online editor that will let you trace out roads from satellite photos - or you could just make sure your favourite pub is included on the map. The export tools can provide vector or static maps, and if you export as SVG you can further edit your map in Illustrator or Inkscape.
CloudMade Maps
Commercial tools built on top of OpenStreetMap, the most exciting of which allows you to create your own map theme by setting your preferred colours and line widths for various types of map feature.
Many Eyes
IBM Research's suite of data visualisation tools, with a wiki-style collaboration platform for publishing data and creating visualisations.
Dapper
Dapper provides a powerful tool for screen scraping websites, without needing to write any code. Output formats include RSS, iCalendar and Google Maps.
TiddlyWiki
TiddlyWiki is a complete wiki in a single HTML file, which you can save locally and use as a notebook, collaboration tool and much more. There's a large ecosystem of plugins and macros which can be used to extend it with new features - see TiddlyVault for an index.
WolframAlpha
The "computational knowledge engine" with the hubristic search-based interface, potentially useful as a source of data and a tool for processing and visualising that data.
Tumblr
Useful as both an input and an output for feeds processed using other tools, and with a smart bookmarklet for collecting bits and pieces from around the web.
The UCSB Toy Chest
An outstanding list of tools that people "without programming skills (but with basic computer and Internet literacy) can use to create interesting projects", compiled by the English department at UC Santa Barbara.
Your help needed
There must be dozens, if not hundreds of useful tools missing from the above. Tell me in the comments and I'll add them to the list.
2 notes · View notes
spwillison · 16 years ago
Text
Teaching users to be secure is a shared responsibility
Ryan Janssen: Why an OAuth iframe is a Great Idea.
The reason the OAuth community prefers that we open up a new window is that if you look at the URL in the window (the place you type in a site’s name), you would see that it says www.netflix.com* and know that you are giving your credentials to Netflix.
Or would you? I would! Other technologists would! But would you? Would you even notice? If you noticed would you care? The answer for the VAST majority of the world is of course, no. In fact to an average person, getting taken to an ENTIRELY other site with some weird little dialog floating in a big page is EXTREMELY suspicious. The real site you are trusting to do the right thing is SetJam (not weird pop-up window site).
I posted a reply comment on that post, but I'll replicate it in full here:
Please, please don't do this.
As web developers we have a shared responsibility to help our users stay safe on the internet. This is becoming ever more important as people move more of their lives online.
It's an almost sisyphean task. If you want to avoid online fraud, you need to understand an enormous stack of technologies: browsers, web pages, links, URLs, DNS, SSL, certificates... I know user education is never the right answer, but in the case of the Web I honestly can't see any other route.
The last thing we need is developers making the problem worse by encouraging unsafe behaviour. That was the whole POINT of OAuth - the password anti-pattern was showing up everywhere, and was causing very real problems. OAuth provides an alternative, but we still have a long way to go convincing users not to hand their password over to any site that asks for it. Still, it's a small victory in a much bigger war.
If developers start showing OAuth in an iframe, that victory was for nothing - we may as well not have bothered. OAuth isn't just a protocol, it's an ambitious attempt to help users understand the importance of protecting their credentials, and the fact that different sites should be granted different permissions with regards to accessing their stuff. This is a difficult but critical lesson for users to learn. The only real hope is if OAuth, implemented correctly, spreads far enough around the Web that people start to understand it and get a feel for how it is meant to work.
By implementing OAuth in an iframe you are completely undermining this effort - and in doing so you're contributing to a tragedy of the commons where selfish behaviour on the behalf of a few causes problems for everyone else. Even worse, if the usability DOES prove to be better (which wouldn't be surprising) you'll be actively encouraging people to implement OAuth in an insecure way - your competitors will hardly want to keep doing things the secure way if you are getting higher conversion rates than they are.
So once again, please don't do this.
I hope my argument is convincing. In case it isn't, I'd strongly suggest that any sites offering OAuth protected APIs add frame-busting JavaScript to their OAuth verification pages. Thankfully, in this case there's a technical option for protecting the commons.
Update: It turns out Netflix already use a frame-busting script on their OAuth authentication page.
0 notes
spwillison · 16 years ago
Text
Facebook Usernames and OpenID
Today's launch of Facebook Usernames provides an obvious and exciting opportunity for Facebook to become an OpenID provider. Facebook have clearly demonstrated their interest in becoming the key online identity for their users, and the new usernames feature is their acknowledgement that URL-based identities are an important component of that, no doubt driven in part by Twitter making usernames trendy again.
It's interesting to consider Facebook's history with regards to OpenID and single sign on in general. When I started publicly advocating for OpenID back in 2007, my primary worry was that someone would solve the SSO problem in a proprietary way, irreparably damaging the decentralised nature of the Web - just as Microsoft had attempted a few years earlier with Passport.
When Facebook Connect was announced a year ago it seemed like my worst fears had become realised. Facebook Connect's user experience was a huge improvement over OpenID - with only one provider, the sign in UI could be reduced to a single button. Their use of a popup window for the sign in flow was inspired - various usability studies have since shown that users are much more likely to complete a SSO flow if they can see the site they are signing in to in a background window.
Thankfully, Facebook seem to understand that the industry isn't willing to accept a single SSO provider, no matter how smooth their implementation. Mark Zuckerberg made reassuring noises about OpenID support at both FOWA 2008 and SxSW 2009, but things really stepped up earlier this year when Facebook joined the OpenID Foundation Board (accompanied by a substantial financial donation). Facebook's board representative, Luke Shepherd, is an excellent addition and brings a refreshingly user-centric approach to OpenID. Luke was previously responsible for much of the work on Facebook Connect and has been advocating OpenID inside Facebook for a long time.
Facebook may not have committed to becoming a provider yet (at least not in public), but their decision to become a consumer first is another interesting data point. They may be trying to avoid the common criticism thrown at companies who provide but don't consume - if they're not willing to eat their own dog food, why should anyone else?
At any rate, their consumer implementation is fascinating. It's live right now, even though there's no OpenID login box anywhere to be seen on the site. Instead, Facebook take advantage of the little known checkid_immediate mode. Once you've associated your OpenID with your Facebook account (using the "Linked Accounts" section of the settings pane) Facebook sets a cookie remembering your OpenID provider, which persists even after you log out of Facebook. When you later visit the Facebook homepage, a checkid_immediate request is silently sent to your provider, logging you in automatically if you are already authenticated there.
While it's great to see innovation with OpenID at such a large scale, I'm not at all convinced that they've got this right. The feature is virtually invisible to users (it took me a bunch of research to figure out how to use it) and not at all intuitive - if I've logged out of Facebook, how come visiting the home page logs me straight back in again? I guess this is why Luke is keen on exploring single sign out with OpenID. It sounds like the current OpenID consumer support is principally intended as a developer preview, and I'm looking forward to seeing how they change it based on ongoing user research.
As OpenID provider implementation is an obvious next step that can't be that far off - I wouldn't be surprised to hear an announcement within a month or two.
HTTP redirect codes
As an aside, I decided to check that Facebook were using the correct 3xx HTTP status code to redirect from my old profile page to my new one. I was horrified to discover that they are using a 200 code, followed by a chunk of JavaScript to implement the redirect! The situation for logged out users is better but still fundamentally flawed: if you enable your public search listing (using an option tucked away on www.facebook.com/privacy/?view=search) and curl -i your old profile URL you get a 302 Found, when the correct status code is clearly a 301 Moved Permanently.
One final note: it almost goes without saying, but one of the best things about OpenID is that you can register a real domain name that you can own, instead of just having another URL on Facebook.
0 notes
spwillison · 16 years ago
Text
djng - a Django powered microframework
djng is nearly two weeks old now, so it's about time I wrote a bit about the project.
I presented a keynote at EuroDjangoCon in Prague earlier this month entitled Django Heresies. The talk followed the noble DjangoCon tradition (established last year with the help of Mark Ramm and Cal Henderson) of pointing a spotlight at Django's flaws. In my case, it was a chance to apply the benefit of hindsight to some of the design decisions I helped make back at the Lawrence Journal-World in 2004.
I took a few cheap shots at things like the {% endifequal %} tag and error silencing in the template system, but the three substantial topics in my talk were class-based generic views (I'm a fan), my hatred of settings.py and my interest in turtles all the way down.
Why I hate settings.py
In the talk, I justified my dislike for settings.py by revisiting the problems behind PHP's magic quotes feature (finally going away for good in PHP 6). Magic quotes were one of the main reasons I switched to Python from PHP.
My main problem with magic quotes was that they made it extremely difficult to write reusable PHP code. The feature was configured globally, which lead to a quandary. What if you have two libraries, one expecting magic quotes on and the other expecting it off? Your library could check get_magic_quotes_gpc() and stripslashes() from input if the setting was turned on, but this would break in the presence of the common idiom where stripslashes() is applied to all incoming $_GET and $_POST data.
Unfortunately, global settings configured using settings.py have a similar smell to them. Middleware and context processors are the best example here - a specific setting might be needed by just one installed application, but the effects are felt by everything in the system. While I haven't yet seen two "reusable" Django apps that require conflicting settings, per-application settings are an obvious use case that settings.py fails to cover.
Global impact aside, my bigger problem with settings.py is that I almost always end up wanting to reconfigure them at run-time.
This is possible in Django today, but comes at a price:
Only some settings can actually be changed at run-time - others (such as USE_I18N) are lazily evaluated once and irreversibly reconfigure parts of Django's plumbing. Figuring out which ones can be changed requires exploration of Django's source code.
If you change a setting, you need to reliably change it back at the end of a request or your application will behave strangely. Uncaught exceptions could cause problems here, unless you remember to wrap dynamic setting changes in a try/finally block.
Changing a setting isn't thread-safe (without doing some extra work).
Almost every setting in Django has legitimate use-cases for modification at run-time. Here are just a few examples:
Requests from mobile phones may need a different TEMPLATE_DIRS setting, to load the mobile-specific templates in preference to the site defaults.
Some sites offer premium accounts which in turn gain access to more reliable servers. Premium users might get to send e-mail via a separate pool of SMTP servers, for example.
Some sections of code may want to use a different cache backend, or talk to a different set of memcache servers - to reduce the chance of one rapidly changing component causing other component's cache entries to expire too early.
Errors in one area of a site might need to be sent to a different team of developers.
Admin users might want DEBUG=True, while regular site visitors get DEBUG=False.
Finally, settings.py is behind the dreaded "Settings cannot be imported, because environment variable DJANGO_SETTINGS_MODULE is undefined" exception. Yuck.
Turtles all the way down
The final section of the talk was about turtles. More precisely, it was about their role as an "infinite regression belief about cosmology and the nature of the universe". I want to apply that idea to Django.
My favourite thing about Django is something I've started to call the "Django Contract": the idea that a Django view is a callable which takes a request object and returns a response object. I want to expand that concept to other parts of Django as well:
URLconf: takes a request, dispatches based on request.path, returns a response.
Application: takes a request, returns a response
Middleware: takes a request, returns a response (conditionally transforming either)
Django-powered site: hooked in to mod_wsgi/FastCGI/a Python web server, takes a request, returns a response
So instead of a Django site consisting of a settings.py, urls.py and various applications and middlewares, a site would just be a callable that obeys the Django Contract and composes together dozens of other callables.
At this point, Django starts to look a lot like WSGI. What if WSGI and the Django Contract were interchangeable? WSGI is a wrapper around HTTP, so what if that could be swapped in and out (through proxies) as well? Django, WSGI and HTTP, three breeds of turtle arranged on top of each other in various configurations. Turtles all the way down.
djng
djng is my experiment to see what Django would like without settings.py and with a whole lot more turtles. It's Yet Another Python Microframework.
What's a microframework? The best examples are probably web.py (itself a result of Aaron Swartz's frustrations with Django) and Sinatra, my all time favourite example of Ruby DSL design. More recent examples in Python include juno, newf, mnml and itty.
Microframeworks let you build an entire web application in a single file, usually with only one import statement. They are becoming increasingly popular for building small, self-contained applications that perform only one task - Service Oriented Architecture reborn as a combination of the Unix development philosophy and RESTful API design. I first saw this idea expressed in code by Anders Pearson and Ian Bicking back in 2005.
Unlike most microframeworks, djng has a pretty big dependency: Django itself. The plan is to reuse everything I like about Django (the templates, the ORM, view functions, the form library etc) while replacing just the top level plumbing and removing the requirement for separate settings.py and urls.py files.
This is what "Hello, world" looks like in in djng:
import djng def index(request): return djng.Response('Hello, world') if __name__ == '__main__': djng.serve(index, '0.0.0.0', 8888)
djng.Response is an alias for Django's HttpResponse. djng.serve is a utility function which converts up anything fulfilling the Django Contract in to a WSGI application, then exposes it over HTTP.
Let's add URL routing to the example:
app = djng.Router( (r'^hello$', lambda request: djng.Response('Hello, world')), (r'^goodbye$', lambda request: djng.Response('Goodbye, world')), ) if __name__ == '__main__': djng.serve(app, '0.0.0.0', 8888)
The implementation of djng.Router is just a few lines of glue code adding a nicer API to Django's internal RegexURLResolver class.
Services, not settings
The trickiest problem I still need to solve is how to replace settings.py. A group of developers (including Adrian, Armin, Alex and myself) had an excellent brainstorming session at EuroDjangoCon about this. We realised that most of the stuff in settings.py can be recast as configuring services which Django makes available to the applications it is hosting. Services like the following:
Caching
Templating
Sending e-mail
Sessions
Database connection - django.db.connection
Higher level ORM
File storage
Each of the above needs to be configured, and each also might need to be reconfigured at runtime. Django already points in this direction by providing hooks for adding custom backends for caching, template loading, file storage and session support. What's missing is an official way of swapping in different backends at runtime.
I'm currently leaning towards the idea of a "stack" of service implementations, one for each of the service categories listed above. A new implementation could be pushed on to the stack at any time during the Django request/response cycle, and will be automatically popped back off again before the next request is processed (all in a thread-safe manner). Applications would also be able to instantiate and use a particular service implementation directly should they need to do so.
A few days ago I heard about Contextual, which appears to be trying to solve a similar problem. Just a few minutes ago I stumbled across paste.registry's StackedObjectProxy which seems to be exactly what I've been busily reinventing.
My current rough thoughts on an API for this can be found in services_api_ideas.txt. I'm eager to hear suggestions on how to tackle this problem.
djng is very much an experiment at the moment - I wouldn't suggest building anything against it unless you're willing to maintain your own fork. That said, the code is all on GitHub partly because I want people to fork it and experiment with their own API concepts as much as possible.
If you're interested in exploring these concepts with me, please join me on the brand new djng mailing list.
0 notes
spwillison · 16 years ago
Text
rev=canonical bookmarklet and designing shorter URLs
I've watched the proliferation of URL shortening services over the past year with a certain amount of dismay. I care about the health of the web and try to ensure that URLs I am responsible will last for as long as possible, and I think it's very unlikely that all of these new services will still be around in twenty years time. Last month I suggested that the Internet Archive start mirroring redirect databases, and last week I was pleased to hear that Archiveteam, a different organisation, had already started crawling.
The most recent discussion was kicked off by Joshua Schachter and Dave Winer, and a solution has emerged driven by some lightning fast hacking by Kellan Elliott-McCrea. The idea is simple: sites get to chose their preferred source of shortened URLs (including self-hosted solutions) and specify it from individual pages using <link rev="canonical" href="... shorter URL here ...">.
By hosting their own shorteners, the reliability should match that of the host site - and the amount of damage caused by a major shortener going missing can be dramatically reduced.
I've been experimenting with this new pattern today. Here are a few small contributions to the wider discussion.
A URL shortening bookmarklet
Kellan's rev=canonical service exposes rev=canonical links using a server-side script running on App Engine. An obvious next step is to distil that logic in to a bookmarklet. I decided to combine the rev=canonical logic with my json-tinyurl web service (also on App Engine), which allows browsers to lookup or create TinyURLs using a cross-domain JSONP request. The resulting bookmarklet will display the site's rev=canonical link if it exists, or create and display a TinyURL link otherwise:
Bookmarklet: Shorten (drag to your browser toolbar)
You can also grab the uncompressed source code.
Designing short URLs
I've also implemented rev=canonical on this site. I ended up buying a new domain for this, since simonwillison.net is both difficult to spell and 17 characters long. I ended up going with swtiny.eu - 9 characters, and keeping tiny in the domain helps people guess the nature of the site from just the URLs it generates. Be warned: the DNS doesn't appear to have finished resolving yet.
For the path component, I turned to a variant of base 62 encoding. Decimal integers are represented using 10 digits (0-9), but base 62 uses those digits plus the letters of the alphabet in both lower and upper case. A 13 character integer such as 7250397214971 compresses down to just 8 characters (CDeIPpOD) using base62. My baseconv.py module implements base62, among others. I considered using base 57 by excluding o, O, 0, 1 and l as being too easily confused but decided against it.
This site has three key types of content: entries, blogmarks and quotations. Each one is a separate Django model, and hence each has its own underlying database table and individual ID sequence. Since the IDs overlap, I need a way of separating out the shortened URLs for each content type.
I decided to spend a byte on namespacing my shortened URLs. A prefix of E means an entry, Q means a quotation and B means a blogmark. For example:
http://swtiny.eu/EZ8: Entry with ID 1584
http://swtiny.eu/BBEQ: Blogmark with ID 4108
http://swtiny.eu/QE5: Quotation with ID 279
By using upper case letters for the prefixes, I can later define custom paths starting with a lower case letter. I also have another 23 upper case prefix letters reserved in case I need them.
I asked on Twitter and consensus opinion was that a 301 permanent redirect was the right thing to do (as opposed to a 302), both for SEO reasons and because the content will never exist at the shorter URL.
Implementation using Django and nginx
I run all of my Django sites using Apache and mod_wsgi, proxied behind nginx. Each site gets an Apache running on a high port, and nginx deals with virtual host configuration (proxying each domain to a different Apache backend) and static file serving. I didn't want to set up a full Django site just to run swtiny.eu, especially since my existing blog engine was required in order to resolve the shortened URLs.
Instead, I implemented the shortened URL direction as just another view within my existing site: http://simonwillison.net/shorter/EZ8. I then configured nginx to invisibly requests to swtiny.eu through to that URL. The correct incantation took a while to figure out, so here's the relevant section of my nginx.conf:
server { listen 80; server_name www.swtiny.eu swtiny.eu; location / { rewrite (.*) /shorter$1 break; proxy_pass http://simonwillison.net; proxy_redirect off; } }
proxy_redirect off is needed to prevent nginx from replacing simonwillison.net in the resulting location header with swtiny.eu. My Django view code is relatively shonky, but if you're interested you can find it here.
The nice thing about this approach is that it makes it trivial to add custom URL shortening domains to other projects - a quick view function and a few lines of nginx configuration are all that is needed.
Update: The bookmarklet now supports the rev attribute on A elements as well - thanks for the suggestion, Jeremy.
0 notes