discoproject - Tumblr blog

discoproject · 12 years ago

Text

Partying with Disco

It turns out that the Disco community was quite active in 2012, and it looks like 2013 will be even more busy! Just for the record, here are some very interesting presentations and blogs that mentioned Disco in 2012, and some more already in the first few weeks of 2013.

Chris Mueller from Life Technologies gave an excellent introduction to Disco at PyData in March 2012, and again (!) in PyData NYC in October 2012.

Davin Potts described a very interesting image data exploration project that used Disco, also at PyData NYC. Davin also participated in an interesting panel discussion that weighed the pros and cons of some big data frameworks, and where Disco came up occasionally.

Pavlo Baron used Disco (and Riak, RabbitMQ and NLTK!) for an amazing potpourri project last year.

Benjamin Zaitlen has often been blogging about Disco at the very interesting Continuum Analytics blog. Check out his latest that illustrates chaining several Disco jobs together.

Ian Ozsvald recently blogged about using Disco to process Twitter data.

Just a few weeks away, PyCon 2013 and PyData SV 2013 look like they will have more Disco talks, so you should try to attend if you are in the Bay Area!

And please note that Disco has stiff competition from another very cool open source namesake.

(Thanks to Benjamin for several of the above links!)

#JustMigrated

0 notes

discoproject · 12 years ago

Text

Disco 0.4.4

It is time for another birthday release (happy birthday Pia!) - Disco 0.4.4 is out!

The main feature of this release is the modernization of the Disco Python library in order to support Python 3. This allows the same codebase to support Python 2.6.6 (or higher), Python 2.7.2 (or higher) and Python 3.2 (or higher). The API presented by the library has not changed, and so existing Disco jobs should still continue to work with this version of the library. Note also that the requirements for a homogenous cluster remain: the same version of Python should be installed on the clients as well as on the Disco cluster.

In addition, there have been some bugfixes and cleanup; thanks in particular to Daniel Graña and Yamamoto Takashi. More details are available in the release notes.

#JustMigrated

0 notes

discoproject · 13 years ago

Text

Disco 0.4.3

After a long summer soak, a new release of Disco is out. This is a clean-up, tune-up, and write-up release.

The documentation has been improved, thanks to the contribution of an extended Disco tutorial by Davin Potts. There is also information on how to use the proxy mode and how to recover from a master failure.

DDFS has undergone a tune-up. The re-replication done by DDFS is more fault-tolerant, and will help speed up node removal. Node-removal of more than one node at a time is better supported. We also create less unnecessary garbage in DDFS, which speeds up the DDFS garbage collection. DDFS now supports a "local-cluster" mode, that simulates a DDFS cluster on a single-machine. This is a DDFS-only mode, and cannot be used to run Disco jobs; we plan on using it to perform some DDFS testing. The "local-cluster" mode is thanks to the work of Harry Nakos.

As always, there are some bug-fixes; more information can be found in the release notes.

#JustMigrated

0 notes

discoproject · 13 years ago

Text

Disco 0.4.2

Disco 0.4.2 is here finally, in time for summer data crunching!

The highlight of the new release is a new garbage collector (GC) and re-replicator (RR) for DDFS. The previous GC/RR followed a conservative strategy, and terminated early on faults that the implementation could not easily handle. This meant that GC often did not run to completion in large and/or busy clusters. The new GC/RR tries to handle and recover from more such faults, and hence has a higher chance of running to completion. It also computes some basic statistics during its run, which are presented in the UI.

The new GC implementation also allowed us to implement the scheduled removal of a node from DDFS. Earlier, one could just remove a node from the DDFS cluster, but one did not get an indication when the replicas on that node were replaced by new copies on other nodes by GC/RR. This, in combination with its likelihood of often not running to completion, meant that it could have been a long time before the DDFS data availability was restored to the advertised number of replicas. With this release, one can mark (proactively or retroactively, i.e. after a node has died) a node for removal from DDFS, using a "DDFS blacklist", and receive an indication in the UI of when its data has been completely replicated. Note that it might take several runs of the new GC/RR for a node to be safe for removal. Since runs are scheduled at intervals of one day by default, it might take several days for this to occur. It is not advisable to have multiple nodes in the DDFS blacklist at a time; although it is supported, it hasn't been as well tested.

If you have been having issues with the GC not deleting stale DDFS data, you should consider upgrading to this release. In case you are understandably wary of the new GC unexpectedly deleting your valuable data, you can consider using the PARANOID mode, which just renames files instead of deleting them. However, please note that precisely because in this mode data is not actually deleted, coupled with the fact that the new GC/RR has a higher likelihood of re-replicating data when needed, your DDFS disk usage might increase substantially.

As mentioned in the release notes, there are also some bug fixes, as well as other smaller improvements.

#JustMigrated

0 notes

discoproject · 14 years ago

Text

Disco 0.4.1

Happy birthday grandma - Disco 0.4.1 has been released! This is a fairly minor release, which fixes a number of bugs and adds some enhancements to DiscoDB. The biggest reason for making this release now, was to move the repository to its new home at: https://github.com/discoproject/disco/.

Its great to see all the activity in the Disco community - especially of note are two new projects: Java Disco Worker and Disco SLCT. Thanks for making Disco even more awesome!

#JustMigrated #0.4.1 #minor #release

0 notes

discoproject · 14 years ago

Text

Disco 0.4

And what would Cinco de Mayo be without a Disco release? Also known as "the worker release", Disco 0.4 liberates the worker process, which performs the actual computation of an individual job task. There are two major aspects to the worker's newfound independence:

1) workers are now completely self-contained, and

2) the master supports a communication protocol over stdin/stdout/stderr

From the user point of view, the most important consequences are that you can:

1) make changes / update the worker without updating the master, and

2) write custom Disco workers, in any language, such as OCaml.

Another feature worth pointing out, is the greatly enhanced documentation. We've received lots of questions about building indices and using DiscoDB, so we added a DiscoDB Tutorial. If you haven't followed the Disco Tutorial in a while, you might want to have another look at it to introduce yourself to some of the new features (especially on the command line).

This release has taken almost 6 months to get out the door, and we hope it was worth the wait, but there are always more issues to address. So as we send this release out the door, we begin work on the next one. If there's something you would like to do with Disco but can't, or don't know how, please make sure you get involved, we are always looking for ways to improve.

Thanks, as always, to the awesome Disco community, who are constantly helping out on IRC, the mailing list, reporting issues, and submitting patches. We got to meet quite a few of you in person these past few months and it has been a real pleasure! Happy Discoing!

P.S. - We are really excited to start *using* the new worker protocol, and are hoping others are as well. We can't wait to see what other languages Disco jobs will soon be written in :)

#JustMigrated #0.4 #release #worker

0 notes

discoproject · 14 years ago

Text

Disco 0.3.2

If you've been slighted by your family and friends this gift-giving season, brood no more - because the very special "Disco 0.3.2: Holiday Release" was created just for you! Among the highlights of this release are chunking, DDFS tag attributes and authentication, and some new goodies in disco.func and discodb. For a more complete list of changes, see the release notes.

One of the most requested features for Disco/DDFS has been automatic splitting of inputs, and we're glad to announce that its finally supported in the form of chunking. If you prefer, you can still push raw blobs to DDFS, but we now provide another layer on top of 'push' called 'chunk', which uses some input streams and/or a reader (default is lines of text) to break your data into records, convert them to Disco's compressed internal format, and store them in size-limited blobs. Read the tutorial for a primer on chunking, or check out 'ddfs chunk --help' from the command line.

For those of you tired of writing and re-writing combiners and reducers to do basic summing, disco.func now includes some handy functions for those common tasks. We've also included a 'gzip_line_reader', for conveniently tearing through possibly corrupted, gzipped text files (unfortunately, these beasts seem to appear quite often in the wild).

As always, there are a number of bugfixes in this release, so upgrading is recommended (just beware the note about deleting OOB data in the release notes, in case you have been relying on that). Also, there are many other new features not covered here, so make sure to read through the docs if you want to get the most out of this release. We at NRC are especially delighted to see how the Disco community continues to grow. Thanks to everyone who has been asking questions on the IRC channel and on the mailing list, your feedback has been invaluable. Please enjoy!

#JustMigrated #0.3.2 #auth #chunking #release

0 notes

discoproject · 15 years ago

Text

Disco 0.3.1

Disco 0.3.1 was initially meant to be a rather minor release, mainly for the purpose of getting out some bugfixes before the next major release (which already includes chunking and authentication!). Things got a bit delayed due to the discovery of a major bug which has long existed in Disco (see the release notes). By the time we were ready to release, we realized there were actually a handful of substantial new features going into this release as well. We are really excited and proud to share 0.3.1 with everyone, and hopefully people get a chance to enjoy the new features and improved stability right away.

DiscoDB got a major facelift, as it now compresses values by default, and supports new means of construction, including using an unsorted iterator.

Users should recognize a noticeable improvement in usability with this release, as we have made significant enhancements to the command line tools, as well as the web UI. One thing that will be immediately obvious is the new disk space monitoring on the status page. This is really great for eyeballing how much space you have left on your DDFS volumes. For better or for worse, most users probably use the same disks for DDFS and temporary results, which means the tool can also be used to monitor temp disk space. Over the next few releases, the web UI is going to get some much needed care, and this is just the first of those enhancements (let us know if there's something you would find particularly useful there).

DDFS also got some powerful new features in this release. The delayed commit option allows the master to receive multiple updates to metadata cheaply (within some small time window), before it goes through the more expensive process of writing metadata to disks. For some applications, this will provide a vast performance improvement over forcing every client operation to commit its metadata changes right away. The set update feature allows for clients to tag urls, only if they don't already exist in the tag. This is a really powerful atomic operation that the master is now providing for tags, which can be utilized by applications that are concerned about race conditions when tags are being updated.

When it comes to the major bug that was discovered and fixed, it could only occur in certain rare circumstances, which is why it wasn't discovered until now. The bug was a result of a serious design flaw in the way that partition files were being updated (since partition files are shared by map tasks running on the same node). The partition files have long been a known weakness in Disco, so the introduction of the shuffle phase is a welcome enhancement that opens up the possibility for future optimizations (such as sorting at the end of map, instead of before reduce). Given all the features that have been added recently, and especially what will be included in the next major release, you can bet we are going to spend some major effort to concentrate on performance tuning in the coming months.

This release represents another milestone in Disco history, as we welcome several new authors, and we look forward to their future contributions :) Thanks and congratulations to the whole Disco community!

#JustMigrated #0.3.1 ddfs discodb discodex release

0 notes

discoproject · 15 years ago

Text

Disco 0.3

Finally, we've released Disco 0.3! As promised, DDFS and Discodex are now included with Disco. Alongside Discodex, we've released the lightning-fast, Python DiscoDB objects which Discodex uses to store indices. DiscoDBs don't have any dependencies on Disco, so you can use them as a general-purpose, highly efficient, immutable, persistent dict (read more about them in the docs).

Not only have we added lots of new features, but we've been busily fixing bugs as well. The docs should also be much improved, especially we've tried to clarify the different ways of arranging data flow in Disco. Thanks for everyone who submitted patches and asked questions. Hope you enjoy this release as much as we do!

#JustMigrated #0.3 #ddfs #discodb #discodex #release

0 notes

discoproject · 15 years ago

Text

Disco 0.2.4

We snuck the 0.2.4 release out in February without saying too much about it, so I think its time for a long overdue discussion of some release highlights. Overall, 0.2.4 represents a gain in momentum towards the ambitious 0.3.0, which should include awesome new additions such as ddfs and discodex.

improved scheduler:

The new scheduler and scheduler framework provides fair scheduling and fifo scheduling options. The fifo scheduling works fine in small restricted environments, but fair scheduling is a must when jobs are competing for resources. This is the single biggest feature provided in 0.2.4 and a great reason to upgrade.

input/output streams:

These were created to make it easier/possible to do things like read and write compressed inputs, and they make disco's input mechanism way more flexible. An input stream has a similar signature to a map reader (which will eventually be deprecated in favor of input streams), except input streams take the additional params object. Input/output streams can also be chained together by providing a list of them.

The default input stream function uses the scheme of the input URL to delegate to a more appropriate input stream. For instance, in discodex we build indices that store distributed index chunks which can later be queried against using a normal disco job. An output stream writes the resulting urls as `discodb://...`, and query jobs which provide the list of these URLs as inputs automatically use the discodb input stream to open the discodbs and get the appropriate iterator for the map function.

Input streams can also be used to do things like read from a database, or pretty much anything else you can think of that has to do with reading the input and providing appropriate objects for map.

new-style unittests:

The `disco test ...` command uses unittests that were rewritten to be safer and more controlled. The idea here is to eventually provide a test suite that makes step-by-step debugging installation easier, even filing reports to the issues list automatically when possible.

events format and exception handling:

Disco 0.2.4 uses a new format for sending events to the disco master, providing the groundwork for the new events API that will be part of 0.3.0. Related to this are changes in the way exceptions are handled in the disco worker, so that errors and their causes are always bubbled up to the master. Corner cases where certain types of exceptions were being masked were eliminated by these changes.

These are the highest profile changes from 0.2.4, but under the hood there was a lot more going on. As Disco (and the contributing community) grows, its important that we streamline the processes for developing Disco. That's why 0.2.4 includes many readability and style-converging improvements as well as functional changes. We've started using the github issues list more as a roadmap for Disco, not just for reporting bugs. The issues list is a great way to prioritize and tag TODO items while improving the visibility of Disco development. We are constantly looking for ways to refine the development process (all suggestions are welcome :). Congratulations and thanks to everyone who helped make this release possible!

#JustMigrated #0.2.4 #0.3.0 #discodex #features #input streams #release #scheduler #unittest

0 notes

discoproject · 16 years ago

Text

disco{db,dex} @ erlang user conference 2009

We gave our talk this morning at EUC2009, overall it was a very nice conference. I forgot to mention that we are currently going through the Nokia internal process for open-sourcing discodb and discodex. Hopefully we can release them by the end of the year. We will put pointers on discoproject.org when they are available.

http://discoproject.org/media/talks/ErlangUserConference2009/slides/

#JustMigrated #discodb #discodex

0 notes

discoproject · 16 years ago

Text

Why Not Hadoop?

We are flying back from Boston after an excellent week at the Architecture Technology Review. This was my first interaction with the Nokia architecture community at large, and I was really pleased (and I have to admit, somewhat surprised), to see how awesome many of the developments coming down the pipe are. Ville gave a talk on what we have been doing with Disco, and we also gave a demo during one of the 'speed geeking' sessions. One of the most common questions we were asked was, "why not Hadoop?", so I thought I'd give my opinion on the subject.

Prior to coming to the NRC, I was using Hadoop for about a year and a half (doing bioinformatics), and I must say that it served me quite well. To be sure, there were problems along the way, but Hadoop enabled me to do analyses that I would not otherwise have done, not because they would be impossible without Hadoop, but because mapreduce makes it so easy to parallelize a huge class of problems, that the overhead of doing things with big data becomes amazingly small.

Even when using Hadoop, I always used Python (with Hadoop Streaming) to write map/reduce functions, because Python is such a pleasure to write, and because I am much more productive writing Python than Java (or pretty much any other language). Because of my love for Python, I often wondered why noone had yet written a Python implementation of mapreduce, and even considered writing my own. I think it is natural for anyone who thinks about the design of systems, to question the validity of architecture decisions and to wonder how those designs might be improved. Of course, actually implementing a new design is a whole other story, and finding the impetus to do so, especially when a reasonably good implementation (with lots of high-profile developers) already exists, is not always easy.

When I discovered the Disco project, which is part Erlang, part Python, I was deeply intrigued. I questioned the choice of Erlang (not knowing much about it), but Ville's argument was extremely pragmatic: Erlang is really good at distributed stuff (that's what it was built to do), and Python is awesome for high-level programming (i.e. its fun, easy to read/write, expressive, etc.). But I guess the question remains, why not Hadoop? The reason answering this question is hard, is because largely it is a matter of taste. The bottom line is that neither Hadoop nor Disco is really a mature project (Hadoop IS more highly developed than Disco though), while it seems to me the choice of framework is a long-term question. For me, wanting to use Python to improve the framework itself is a no-brainer (additionally, Jython is currently too far behind CPython for me to consider it a replacement).

Why Disco? Because of it's philosophy: massive data - "minimal code". Lightweight is a design goal in Disco, and we really, truly, care about programmer overhead. Framework development should be as agile as possible, if we are trying to optimize programmer productivity. My vision of Disco is a framework that can be shaped to the needs of its users (including myself), by its users. For me, the reality of Hadoop was quite different.

#JustMigrated #architecture #erlang #hadoop #nokia #python

0 notes

discoproject · 16 years ago

Text

Erlang User Conference 2009

We will be speaking about discodex at the Erlang User Conference next week. discodex is our new index- building tool implemented using disco and a really awesome data structure called discodb. Check out our slides here:

http://discoproject.org/media/talks/ErlangUserConference2009/slides/

#JustMigrated

0 notes

discoproject · 16 years ago

Text

git init disco.posterous.com

Hello, world (and welcome)!

Ville and I are starting a blog, so we can share our thoughts as we continue to develop Disco (http://discoproject.org) and related tools at the Nokia Research Center (http://research.nokia.com). Disco is 100% open-source, but a lot of its development takes place behind-the-scenes at NRC. We're looking for a way to share our future plans, tips and tricks, design discussions, or whatever else we might be thinking about. Let us know if there's something in particular you'd like to hear more about!

#JustMigrated #welcome

0 notes