Tumgik
tomrijntjes · 7 years
Text
Headless browsing with Firefox on Windows and Linux
I'm developing a web crawler in selenium on my windows dev box, to be deployed in a docker container on a cloud host. To do so effectively, I want to be able to inspect the behavior of the crawler visually before switching to headless mode without any discrepancies. Thusfar I've been developing in Firefox, followed by a test in PhantomJS locally before deploying to the server. This has lead to a host of problems because PhantomJS and Firefox differ in subtle and not so subtle ways.
Recently, Firefox released headless browsing for windows. I expect using the same browser under the hood whilst toggling between headless and normal mode will make my life a lot easier. To copy the setup in Linux/Docker, use the following Dockerfile.
# start with a base image FROM ubuntu:16.04
# install dependencies
RUN apt-get update && apt-get install -y software-properties-common curl RUN apt-get update && apt-add-repository ppa:mozillateam/firefox-next RUN apt-get update && apt-get install -y python3-pip python3-dev python3-mysqldb python3-dateutil firefox
# Install geckodriver: RUN export BASE_URL=https://github.com/mozilla/geckodriver/releases/download  && export VERSION=$(curl -sL    https://api.github.com/repos/mozilla/geckodriver/releases/latest |    grep tag_name | cut -d '"' -f 4)  && curl -sL  $BASE_URL/$VERSION/geckodriver-$VERSION-linux64.tar.gz | tar -xz  && mv geckodriver /usr/local/bin/geckodriver
ENV MOZ_HEADLESS=1
ADD . /
RUN pip3 install selenium
To quickly toggle between headless and normal add something like the following to your crawler:
def get_browser(self,headless=False):    if headless:        binary = FirefoxBinary(r'C:Nightlyfirefox.exe', log_file=sys.stdout)        os.environ['MOZ_HEADLESS']='1'        return webdriver.Firefox(firefox_binary=binary)    else:        return webdriver.Firefox(executable_path=r'C:geckodrivergeckodriver.exe')
Tumblr media
1 note · View note
tomrijntjes · 8 years
Text
Developing a Synthetic Organism-Enterprise: lessons learned
The project is nearing completion which is a good time to take a step back and reflect. About a year ago I watched a clip on cryptocurrency (subtitles available) that talks about fully autonomous vending machines that manage their own inventory and funds. I thought that was a pretty cool idea and I wanted to build one. A researcher called Matthew Gladden described the concept in depth (pdf) which gave me the scientific ground to do so under his supervision.
youtube
Lesson 1: completing the business cycle without supervision is not the hardest part
In fact, there are tons of revenue generating schemes out there that require no further supervision, as made popular by the passive income community. This community consists of people who build web-based assets by exploiting quirks in search engines or jumping into niches where little competition exists. Less benign examples are spam bots. I have yet to find an example where unsupervised businesses survive in highly competitive markets when pitted against human creativity, so generally the stakes are low.
I made an affiliate marketing web application that sells watering cans. All moving parts could be automated or outsourced. No system is in place to drive traffic, but, you know, first things first. 
Lesson 2: don't use genetic programming if you're dealing with continuous variables
Many strategies exist for continuous optimization. I'm very fond of set-and-forget tools that relieve you from worrying about one component of your business. Or your life for that matter.Take for example this excellent blog on a set and forget approach to split testing. It's clever and simple: all you have to do is add another version of your webpage to the pool and the best version will emerge eventually. It works, it's easy to implement and the maths is simple to grasp.
I used an evolutionary computation module called deap in combination with a selection mechanism of my own design to drive learning over generations. I'm not particularly proud of this part because I made some mistakes.
Kind of a no-brainer in hindsight but I still want to mention it. It's super inefficient to pick a random boundary over and over again and see how it performs when the search space is likely to have a curved fitness landscape. The next iteration would have a drastically reduced search space and I would dedicate much more time to creating a clever encoding.
Think that shit through people. Don't be like me.
Lesson 3: artificial life lives in simulations
Artificial life is a field that's dedicated to understanding life by creating it. Typical a-life research generates agents in a roster that have a strategy for gathering food, run five billion generations and proceed to describe changes in the gene pool. 
I wanted to build upon this work because I was going to build artificial life. That's how Gladden's model defined it. However, nobody does that. It's not well understood how to create artificial life outside simulation. The reproduction algorithms are not suitable for constraints like super steep fitness evaluation costs. They are designed to approach biological life as closely as possible.
I designed a non-generational selection scheme. Each unit has internal energy that varies during its lifetime by interacting with the environment. In my case, the interactions were serving a page (-), converting to a third party seller (+++) and birthing a new instance (-----). Pretty clever, no? This approach does not appear in the literature as far as I know, so to this day I'm busy trying to understand its properties. 
The real lesson for myself is I should not try to tackle too many hard problems at the same time. Hard for me, that is, not necessarily fundamentally hard. It would've been wise to stick to well understood techniques to create an SOE rather than creating technology on the fly without the time or the knowledge to really understand its dynamics. That's hard enough in itself.
Lesson 4: not everybody wants to set and forget
At least it's a more subtle value proposition than I imagined. This is more of a hunch than a lesson that did not emerge from the project itself but from thinking about how to get ungodly rich off of my newly acquired knowledge.
In general, we don't like to trust machines when the stakes are high. We are too fond of micromanaging stuff to surrender ourselves to automation, even when the gains are obvious. The author of the split testing article I mentioned before notices the same thing.
I haven't worked much in the industry but given the discourse I follow online and with my peers, the situation in the data science field is as follows: we have the technology to automate almost every process. We don't want to. We build dashboards instead so humans can make the last 5% of the decisions.
Hunch: SOE's will make their mark before the arrival of Strong AI
When building this thing I expected to encounter fundamental objections to the concept. It didn't happen.
Will we have machine governed businesses that translate and adapt high level policies during their lifetime? It's a challenge, but I can't see why this couldn't be. Will we have bots generating and pitching ideas at your next hackathon? Not tomorrow perhaps. Entrepreneurial opportunities are very hard to model. However, techniques like the lean startup approach give us the means to validate business ideas efficiently through an iterative process of testing and pivoting. If there's a recipe, a machine can do it. 
Generating business ideas? Maybe not in the sense we like to think of young guns developing stuff in the after hours in their dorm rooms. Machines do excel at spotting trends in massive torrents of streaming data. Starting a business in market X is not fundamentally different from an algorithmic trading system placing a put option based on changes in the environment. As long as the environment can be quantified and there's a recipe for starting the business. Think for example of franchises. Are their indicators of a growing demand for fast food in this region? Plunk down a McDonalds. There's no need for periodic human research if the parameters for success are well understood. Create a learning system for managing said franchise by adapting to changes in the local market through well-researched heuristics. Suddenly, you have a SOE governing a large part of the business process. Far from impossible. It's up to you whether you find it desirable.
0 notes
tomrijntjes · 8 years
Text
5 Things you Didn't Know about Genetic Programming
Now that I baited your click, I will need even more of them. It's for science though, so stick with me.
Sold when you read 'science'? Click here and read the instructions.
I'm still very much in love with the idea of fully autonomous robotic entrepreneurs that live, build a business and perish as if they were living things. This post is about a second incarnation of this concept in the form of a species of affiliate web shops.
The analogy is simple. Basic life forms gather nutrients for energy. If they run out of energy, they perish. If they have abundant energy, they reproduce and pass on their genetic code.
This system is no different. Every web shop is treated as an individual enterprise with a startup cost of 50 cents; yes, cash is the lifeblood of any enterprise. Serving a visitor imposes a virtual cost of 0.25 cents, a combination of marginal costs and opportunity costs. When the business runs out of cash, it perishes.
The affiliate marketing model generates revenue by charging a percentage of every sale made by the third party web shops it refers too. Every clicked outbound link has an expected value that can be calculated by multiplying the average conversion with the product price, multiplied with the percentage the referrer receives. Let's say the referrer makes 5% on the sale with an average conversion of 6%, a click on a 30$ product is worth 0.05*0.06*30=0.09 or 9 cents.
When two businesses make a combined profit of 50 cents, that's enough to afford the startup costs of a new instance. Their genetic code is recombined to hopefully pass on the traits that made them successful.
It's unusual to optimize business units in their entirety rather than the individual processes. For my graduation project I'm testing this approach. That's why I need you. Take a minute to visit the experimental web application and subscribe for a daily reminder. Thanks a bunch :)
Tumblr media
1 note · View note
tomrijntjes · 8 years
Text
Evolving mySQL queries with Deap, pt 1: trees
I was drawn to the concept of evolutionary programming before having ever written any code. Programs writing programs!? Holy shit, that's cool.
And it's not that hard either.
We're going to follow the Genetic Programming tutorial from the docs and modify it to spit out SQL queries. We need to define primitives and terminals first. To keep it simple, let's stick to AND, OR, greater than and equal as primitives.
The terminals are a random float between 0 and 100 and one of the column names of our SQL table. Notice that all terminals are floating points, allowing the strongly typed primitive set to enforce well-formedness.
This suffices to generate a valid full tree by running
Unfortunately, Deap produces queries using a syntax SQL can't understand and it does not supply a built-in parser, so we'll have to write one ourselves to turn or_(lt('price', 'price'), eq('price', 31.31186591403997)) into (price < price) OR (price = 31.31186591403997).
This does the trick (most of the time). Please use at your own risk.
Let's fire up SQL to verify the syntax and, ultimately, poll the data to determine the fitness of the individual. Note that IndexErrors are caught to deal with the randomly occuring inability to satisfy all type constraints. See also the note in the GP tutorial.
I rather enjoyed watching the program invent nonsensical queries like (price < price). I haven't made up my mind yet whether it's necessary to prune the well-formed yet ostentatiously daft queries, or let the GP algorithm deal with its own quirks. Have fun writing programs writing programs!
Full code.
Requirements: Python, the Scipy stack (eg. Anaconda), Deap, and pymysql because it plays well with 'conda.
0 notes
tomrijntjes · 8 years
Text
Grappling with a HUGE SQL dump, Docker, Compose and Flask
I just want to jot down some pointers to prevent you from running into the same walls I did.
#hurdle 1: A container running the SQL server takes time to boot. Docker Compose doesn't like to wait.
Docker Compose looks for .sql and .sh scripts in /docker-entrypoint-initdb.d. However, if the linked SQL server isn't booted yet the script fails.
#solution
I put the following shellscript in /init
#!/bin/bash sleep 5 mysql -u root -ppassword db < /sql/dump.sql
And docker-compose.yml mounts that script in docker-entrypoint-initdb.d
data:  build: data/. mysql:  image: mysql  ports:    - "3306:3306"  environment:    MYSQL_ROOT_PASSWORD: password    DATABASE: 'db'  volumes:    - ./sql:/sql    - ./init:/docker-entrypoint-initdb.d  volumes_from:    - data
Note that two containers are built: a persistent data container and a mysql server that mounts the volume from the data container (source). The Dockerfile of the data container:
FROM n3ziniuka5/ubuntu-oracle-jdk:14.04-JDK8
VOLUME /var/lib/mysql
CMD ["true"]
I used this pattern with a 12G sql dumpfile. Conveniently, the shell script is only run once. If you want to rerun the script for debugging purposes, remove the containers by calling 
docker-compose rm
#hurdle 2: The hostname of the mysql container doesn't show up in the environment variables. You need the hostname to query the database from your webapp.
A flask application can look like this:
#!/usr/bin/env python
from flask import Flask from flask.ext.mysql import MySQLdb
import os
app = Flask(__name__)
conn = MySQLdb.connect(host="localhost",                           user = "root",                           passwd = os.environ['DB_ENV_MYSQL_ROOT_PASSWORD'],                           db = os.environ['DB_ENV_DATABASE']) c = conn.cursor()
@app.route('/') def home():    c.execute("SHOW TABLES")    raise Exception(c.fetchall())    return
if __name__ == "__main__":    app.run(host='0.0.0.0', debug=True)
Unfortunately, the sql server doesn't run on localhost but in a different container. The name of the host doesn't show up in the environment variables either. Change home() to the following to verify this:
raise Exception(os.environ)
I like using 'raise' because it allows you to inspect data with unknown structure through the Werkzeug debugger.
#solution
The hostname is equal to the name of the container that runs sql. 
This took me a long period of digging in the networking docs. I had to do some discovery by running:
docker exec -i -t yourwebcontainer /bin/bash
to get a shell in the container that holds your web app, then
cat /etc/hosts
This should output something like 
127.0.0.1tlocalhost ::1tlocalhost ip6-localhost ip6-loopback fe00::0tip6-localnet ff00::0tip6-mcastprefix ff02::1tip6-allnodes ff02::2tip6-allrouters 172.17.0.3tdb 01129da80e56 sqlcontainer_mysql_1 172.17.0.3 sqlcontainer_mysql_1_1 01129da80e56 172.17.0.3tmysql_1 01129da80e56 sqlcontainer_mysql_1 172.17.0.2t9618c6adaa6a
These show the TCP-addresses of all linked containers and their aliases. Modify the flask application and you're good to go.
conn = MySQLdb.connect(host="sqlcontainer_mysql_1",                          user = "root",                          passwd = os.environ['DB_ENV_MYSQL_ROOT_PASSWORD'],                          db = os.environ['DB_ENV_DATABASE'])
Tumblr media
0 notes
tomrijntjes · 8 years
Text
Artificial Life with Docker. Part 1: Death
I think artificial life is a truly fascinating field, but it is a pity scientists aim almost exclusively at creating life in a simulated environment. It would be nice to create a life-form that interfaces witht the real world in an accessible way.
In this series I try to create a species of web applications using the container service Docker. I expect Docker's caching abilities might come in handy when generating a large amount of similar but distinct instances of a web application species.
Life 1) is composed  of  particular  individuals,  that (2) reproduce (which involves transferring their identity to progeny) and (3) evolve (their identity can change from generation to generation).
In this first installment we set up a basic flask application with the capability to turn itself off.
Tumblr media
This may seem a strange place to start. The end goal is to have multiple web applications compete for resources to achieve advancement of the species as a whole. Inevitably, the weaker ones have to die.
We create a basic Dockerfile that builds a Flask webserver on top of Ubuntu.
This is the Flask app. Notice the GENOME environment variable for future use. The name of the image is assumed to be equal to the genome. This is a design choice: we want every instance to be unique.
Finally, we tie everything together by building and running the image. Notice the -v flag to expose the unix socket on the host machine to the container.
The application greets us at localhost. Or does it? It has a ten percent chance of perishing at every request. The docker-py module interfaces with the docker instance on the host through the remote API, thus allowing it to run cli.kill("its own name"). If you think this is tragic, just know that we're not talking about sentient life... yet.
0 notes
tomrijntjes · 9 years
Link
My research project begins to take shape.
1 note · View note
tomrijntjes · 10 years
Text
Computer Science in Movies
Rotten Tomatoes Average Rating on the Y-axis, years on the X-axis. 
0 notes
tomrijntjes · 10 years
Link
On Soccer and Public Data
0 notes
tomrijntjes · 10 years
Video
vimeo
Neuroscientific efforts propel increasingly accurate methods of mapping brain activity to stimuli. What if these mappings were recorded and represented outside grey matter? Is the ability to think exclusively a property of the human brain? Or could one's mind be copied, compressed and stored as if it were virtual? A project by Dirrik Emmen, Coen Dekker and Tom Rijntjes
0 notes
tomrijntjes · 10 years
Text
Gambling and Honesty
A few years ago I flunked college and started playing poker full time. It was fun, challenging and I made much more money on an hourly basis as I would stacking shelves in the supermarket. After months of practice and studying statistics I had a reasonable edge on other players, but taking the house's cut into account I had to play over a thousand hands hourly on multiple online tables simultaneously to grind out a nice income. At the time, poker was quite popular and making a living as a professional gambler made me somewhat well-known in the local pub circuit. I was invited to play with the bar's employees and their friends every now and again on Mondays when the bar was closed. This may conjure an image of rich men wearing fedora's and smoking cigars, but in reality the players' wealth and the stakes were quite modest. We played a tournament in which one stood to lose about €25 and due to a tiered payout structure the winner would take home slightly over ten times their buy-ins. After a few fruitless attempts, I won the tournament twice in a row. I was accused of cheating and subsequently banned from my favourite bar.
In their defence, winning a 40 people tournament twice in a row is a streak of luck bordering on statistical anomaly. Secondly, I won most of the hands I dealt myself due to the advantage of being the final actor - which happens to coincide with dealing the cards. Thirdly, live poker is about fifty times slower than what I was used to, so I spent a lot of time fiddling with chips and cards, leading to what may be interpreted as considerable sleight of hand.
Now, cheating is an interesting phenomenon. According to Dan Ariely, people will cheat when given the chance, but only by a little bit. The gravity of our misbehaviour is affected by the potential gains nor the chances of getting caught, thus we respond irrationally to incentives. Ariely suggests that when cheating is possible, we are torn between economical incentives and a desire to think of ourselves as honest people. This conflict results in an irrational tendency to cheat by a small amount. In canon with omnipresent cheating, the golden rule exists, a heuristic of practical ethics that states that one should treat others as one would like other to treat themselves. There are many opportunities to steal, rob or extort but in reality, we adhere to the golden rule most of the time.
Back to the pickle I got myself into. Even in the light of opportunity, I never cheated, not even a little bit. The social penalty of being removed from my favourite drinking hole was far heftier than the potential monetary gain. However, I do not believe choosing for fair play was a calculated decision. You got to trust me on that one.
This column is part of an assignment for Cool Science at Leiden University. 
0 notes
tomrijntjes · 10 years
Text
PACMAN: keeping nature conservation Real
I propose the founding of PACMAN, an organization that keeps conservation real. Here's why. A few years ago I visited the Boston Zoo. One pen drew a particularly large crowd. I discovered a young panda bear, facing the crowd without caring much and chomping away bamboo at an incredible rate. It was the coolest thing. In the back of the pen, a rat scurried through the scene. A girl shrieked, pointed at the rat and said something to her mother about 'filthy animals.' Her mother wondered aloud why rats aren't eradicated by the zoo staff. This is nature conservation in a nutshell.
The gist of most nature conservation schemes is that nature conservation is equal to reducing biodiversity loss. Promoting biodiversity in the narrow sense assumes that more animal species is always better. This is wrong and it's easy to see why. Take for example the threatened species Yersinia Pestis, which is extinct outside containment. Would you say we need to invest in the conservation of this species if you knew it is the bacterium responsible for the Bubonic Plague? It would maintain biodiversity! Besides this obvious example there are countless species we would never miss simply because we are unaware of them or because they serve no purpose. Some species are instrumental to stuff important to us like food security, but many more are not. The WWF estimates up to 100.000 species go extinct every year. However, we can only name about ten species that went extinct since 2000. Apparently, we care enough about 0.0001% of all species to notice they are no longer in existence.
The IUCN Red List features only 20,000 species threatened with extinction, of which the majority are mammals and amphibians . Less sexy species like funghi and bacteria are vastly under-represented. Do these not contribute to biodiversity? This puts conservation institutes in an awkward position. On the one hand they persuade us to hold all suffering in equal regard, human and animal alike. On the other hand conservation is prioritized aggressively. One cannot promote equality and inequality at the same time. The truth is, we do not want a world full of funghi and bacteria. We like fluffy animals like pandas. I prefer pandas over mosquitoes always because they inspire awe rather than annoyance. I think it's perfectly okay to have preferences based on my own subjective experience. In fact, it's okay to actively reduce certain aspects of biodiversity. Did I mention PACMAN is a backronym for Pandas Are Cool, Mosquitoes Are Not? The first one hundred subscribers get a Yersinia Pestis pluchy.
Sources:
WWF projections of biodiversity loss
UCN Red List meta
This blog is part of an assignment for the course Cool Science at Leiden University.
0 notes
tomrijntjes · 10 years
Text
Nanobucks
How curiosity driven science struggles for resources is the talk of the town but apparently, some fields have been subject to an astronomical surge of capital injection. According to NOS news, a Dutch news agency, €1500 billion was invested in nanotechnology in 2010 - roughly the same amount as Italy's GDP of that year. The vastness of this number was explained by a major contribution of the paint and coating industry which apparently employs passive nano structures on a grand scale.
But it doesn't add up. The global coating industry sold about $90 billion worth of paint and such. Even if we assume that every coating company uses nanotechnology - which is not the case - this still leaves roughly €1400 billion unaccounted for. BBC research came up with a much more conservative number: the institute valued the global market of nanotech at $20.1 billion as of 2011, about 1% of the estimate mentioned above.
One might wonder how a hundredfold overstatement popped up in NOS's news bulletin. My best guess would be that the hype surrounding nanotechnology during the mid 2000 led to rosy projections by venture capitalists that sought to boost demand for their assets. But that's not too interesting. What's interesting is how a respectable news source can miss the mark by such a wide distance. It's hardly a consequential mistake, but it demonstrates a crucial lack of intuition.
The reason for this is twofold. The first one is that stakeholders in the capital market sometimes have an incentive for 'bending the truth.' If a dubious investment can be made to look worthwhile, huge amounts of money can be earned. The second reason is, global trade is so vast only specialists have some notion of its scope. You and I never encounter anything close to it in our daily lives. I'm used to managing a mere €2000, one billionth of NOS's estimate of the nanotech industry. The prefix for something that is a billionth of a unit happens to be nano. Coincidence?
This column is an assignment for the course Cool Science at Leiden University.
3 notes · View notes
tomrijntjes · 10 years
Video
vimeo
582 notes · View notes
tomrijntjes · 10 years
Video
youtube
Made with Kinect, Processing and SimpleOpenNi 1.96.
Code Excerpt:http://pastebin.com/6pcbGWnZ
0 notes
tomrijntjes · 11 years
Photo
Tumblr media
Fingertracking with Kinect and Processing.
Pretty cool, huh? I plan to build a gesture based interface with this.
(sourcecode)
1 note · View note
tomrijntjes · 11 years
Photo
Tumblr media Tumblr media Tumblr media Tumblr media Tumblr media Tumblr media Tumblr media Tumblr media Tumblr media Tumblr media
High-Speed photographs of ink dropped into water.
355K notes · View notes