bencoppersmith-blog
bencoppersmith-blog
Copperblog
9 posts
Don't wanna be here? Send us removal request.
bencoppersmith-blog · 8 years ago
Text
New Site
Hopefully GitHub Pages remains a thing!
https://bcoppersmith.github.io/
0 notes
bencoppersmith-blog · 8 years ago
Text
Silver Lake Parking Tickets II: Neighborhoods
After months of skating by, undetected, unpermitted, it finally happened: we got a parking ticket.
Our neighborhood, just south of the Silver Lake Reservoir, restricts overnight parking between 11pm and 6am to those with District 98 parking permits. My wife normally parks her car on the road to the North of our home. It’s on a steep incline and no one has received a ticket on that stretch of road in the past 6 years, for any reason.
But that night her usual spot was taken so she parked maybe 15 feet away from it, on the street just to the West of us -- an area that still receives around 2-3 tickets a month. Not a ton, but enough to keep people honest. And we were kept honest.
While it didn’t help us get burned that night, we knew these tickets stats from studying the City of Los Angeles’s massive parking citation dataset. At the time of writing, it contains over 10 million unique citations, issued all over the city, between February 2010 - September 2016 (it took me awhile to post this!). Parsing this data tells us where is safest to park around our house, but we can also use it to spot neighborhood trends.
I joined the citation data against the LA Times’s neighborhood shapefiles to map how hard each Eastside neighborhood is hit by parking citations:
East Hollywood and Boyle Heights come out on top -- proximity to Downtown or Hollywood does not help here, as those are far and away are the hardest hit neighborhoods in the entire city. Also of note: Atwater Village has 15 times as many tickets as Mount Washington, even though they’re roughly the same size (likely because Atwater Village has 15 times more road than Mount Washington).
And here’s the top violation for each neighborhood (you can see that Silver Lake stands alone when it comes to doling out Preferential Parking citations):
Finally, joining this data with the list of the City’s parking meters and their meter zones, we can find the Eastside’s most dangerous meters. That honor belongs to ER106-ER108, the cluster of meters at Caspar Ave and Colorado Blvd in Eagle Rock (ER106 has registered 456 tickets over the past 6 years, worth over $28,000 in fines):
Tumblr media
Also high on the list: the meter outside of Scoops on Heliotrope and the area around the Vintage Theater in Los Feliz (Sunset Junction has some dangerous parking spots, but those meters aren’t officially listed in the meters dataset).
Tumblr media
Overall, the Parking Citations dataset is a little challenging to parse but it’s ultimately a goldmine of data analysis. And if I can remember to renew my permits and park in the right spots, I’ll be doing my part to prevent the dataset from getting too big.
(for technical details on how this data was produced, see https://github.com/bcoppersmith/lacity-citation-parser).
0 notes
bencoppersmith-blog · 10 years ago
Text
Star Wars: Let’s Put Some Data On It
Tumblr media
Star Wars: The Force Awakens is going to be big. Its midichlorian readings are off the charts. I mean, this is looking like a steam engine so fast even Sebulba couldn’t keep up!  We’re going to have to get Nute Gunray to negotiate some new trade deals to accommodate the insatiable demand for new toys! 
(Coming soon: my 15,000 word piece on why The Phantom Menace is actually fine and pretty fun).
So how much money will Star Wars make this weekend? I reran my model and updated some comparisons. Let’s dig in:
The New Model
Again, my box office prediction model uses a film’s budget (proxy for star power and distribution reach) and its Wikipedia pageviews (proxy for popularity) to predict opening weekend box office grosses.
I updated my movie data with all films from the back-half of 2015 and fixed some bugs in my workflow. All said, I increased the size of the dataset from 133 films to over 300. We’re really working with some big medium data here!
The model itself is slightly more predictive now -- the median error is a little smaller and the Adjusted R-Squared is up to 0.62 from 0.56. I also changed one of the independent variables: the Wikipedia page view variable is now calculated as the average number of views 2-to-14 days before a film’s release.
Star Wars: The Force Awakens has a $200 million budget and has averaged about 70k Wikipedia pageviews for the past two weeks. Plugging that in, the model predicts an opening weekend total of $123,691,557. The 95% prediction interval tops out at $161,369,690.
Tumblr media
Model Errors
That prediction is going to be wrong, though. 
The model does a bad job at predicting the absolute biggest films. It missed Jurassic World’s opening box office total by over $140 million, and Minion’s cume by $80 million. The average error for the Top 10 opening weekend box office totals from the past three years is $65 million.
This simple linear regression model, like most models, tends to break down at the extremes. This is because there are fewer examples from which to learn. It’s also likely because huge films are dominated by different factors than smaller films. For most films, the biggest problem is making sure people know it exists. This isn’t a real problem for blockbusters; once a film becomes extremely well known, effects from franchise quality, word of mouth, and competing premiers start to really matter. I can’t capture those right now, though, so I’ll have to live with this error. I think of it as The Subway Corollary: If a movie is big enough to have a corporate tie in with Subway, my model won’t have much predictive power.
(If this seems too pessimistic, maybe I should have done a better job publicizing some of the movies it predicted really well. To pick a recent example: remember Magic Mike XXL? Remember how charismatic Channing Tatum is? That came out just a few months ago! Everyone loved it! Well, Magic Mike XXL earned $12 million its opening weekend -- and my model’s prediction was off by about $393k.)
Comparisons
The most novel component of this model is the Wikipedia page view data (it’s really hard to collect and aggregate! Believe me!). It’s also cool to see how Star Wars: The Force Awakens does against some comparable films:
Avengers: Age of Ultron’s Wikipedia page was actually significantly more popular than Star Wars: The Force Awakens 4-to-10 days before its release. Surprising! It does seem pretty erratic, though, so perhaps there’s another factor at play (like a large volume website dumping traffic to Wikipedia).
The Star Wars page does have some spectacular highs from even earlier, though -- it got 306,034 views on October 20th. That’s the day its final trailer premiered during Monday Night Football (From CNN. Per that  article, the trailer got enormous traction on social media networks that night. The fact that the pageview data has the same huge spike gives me confidence in its validity as a proxy for Facebook Shares, Twitter Tweets, Bing Searches, MySpace Posts, Google Plus Blorts, etc.). Avengers: Age of Ultron had a similar peak when it’s very first trailer was released, but it was 25% smaller.
All together: Star Wars: The Force Awakens has one of the most popular Wikipedia pages of the past 3 years, and is primed for a huge opening. It’s so big that I have very little confidence in the ability of this model to predict its opening weekend box office gross. But that’s ok -- come back to me if there’s another mid-budget comedy about male strippers; I can do those just fine.
Tumblr media
1 note · View note
bencoppersmith-blog · 10 years ago
Text
Rich Get Richer: Hot 100 Edition
Top Tracks
Check out this story from “Hit Charade” by Nathaniel Rich (sourced from John Seabrook):
[Max Martin and Dr. Luke] are listening, reportedly, to the Yeah Yeah Yeahs’ “Maps”—an infectious love song, at least by indie-rock standards. Martin is being driven crazy by the song’s chorus, however, which drops in intensity from the verse. Dr. Luke says, “Why don’t we do that, but put a big chorus on it?” He reworks a guitar riff from the song and creates Kelly Clarkson’s breakout hit, “Since U Been Gone.”
Top Forty radio is dominant again, thanks in large part to a small group of super producers reverse-engineering pleasing sounds on a MacBook.  It really seems that almost every major hit is produced by a small cabal of Scandinavians -- Max Martin (Sweden), Mikkel Eriksen and Tor Hermansen of StarGate (Norway), and Dr. Luke (America, but seems awfully Swedish by association).
Ok -- so producers are big right now. But how big? How many top songs are from the same producers? And if it’s a lot -- has it always been that way?
The short answers: pretty big, a bunch, and not really.
The long answer involves more data.
More Data
To investigate, I grabbed data for all Billboard Hot 100 Top 10 tracks from Wikipedia. It comes from pages like this, and there’s data all the way back to 1958.  I wrote a script to grab each of the songs (and grab the list of producers for each song of the individual song pages -- like this one, for example).
Once I had the data, I ranked all artists and producers for each year by how many songs they registered in the Top 10. I took the top 10% of artists and producers for each year, and charted the historical trends for their proportion of Top 10 hits:
As expected, the top producers and artists from a given year produce a disproportionate amount of the Billboard Hot 100 Top 10 Hits. More unexpected: the top 10% of producers are producing an increasing amount of all Top 10 hits -- peaking at 60% in 2010 when Dr. Luke charted ten Top 10 hits ! 
This measure favors producers by nature, since there are multiple producers per song and only one artists per song (I did not count artists that received producing credit on their own songs, though). Still, it's remarkable that about 7 producers are responsible for nearly half of all recent Hot 100 Top 10 hits. 
Top Talent
As it turns out, it doesn't just seem like Max Martin has produced all of your favorite songs. He's actually produced the most Billboard Hot 100 Top 10 hits of all-time, with 46. Just behind Max is George Martin with 43 career Top 10 hits, 26 of which were with The Beatles. (Max Martin passed him in 2014 with Blank Space. So The Beatles were bigger than Jesus, just not a Swedish guy).
Most of Max Martin and Dr. Luke’s top 10 songs are by Katy Perry. In fact, they both get producing credit on a fair number of these hits. This dynamic -- multiple producers on the same track -- is also likely behind the increase in songs by the top 10% of producers. On the graph below, you’ll see that the average number of producers per track has increased by one whole producer since the mid-70s:
The total number of all producers credited on any Top 10 hit has bounced around a ton, but doesn’t show as strong of an upward trend:
The total number of producers isn’t really increasing, but the producers we do have are getting credited on more tracks. So part of the secret to becoming a super producer is super sharing the credit on these tracks
I'd Like An Advance Copy, Please
The top producers are producing more of the top 10 hits than ever before, and that share appears to be increasing -- while the top artists’ share is not. It may be that hit production has been boiled down to a artful algorithm dominated by a pioneering and capable few (a theory that I think is argued by this interesting-looking book that is coming out soon).
And if producers become the stars, then that may make the artists themselves a commodity. That’s a trend that, uh, Taylor Swift will have a hard time trying to, ahem,… shake it off. 
Ok. Jeez - I’m sorry. That was not good. Gah. Ok, I, uh, sure don’t want to create bad blood with my readers. Ok, oh no. Another pun! Time to go. Goodbye.
Notes:
H/t to my friend and bondsman Alex for pointing out the tangled web of producers and artists
Data and scripts to recreate the graphs above are on my GitHub page
0 notes
bencoppersmith-blog · 10 years ago
Text
Best Open Government Datasets: Vol. I
We truly live in the golden age of open government data:
h/t louis
source: DATA.NY.GOV
0 notes
bencoppersmith-blog · 10 years ago
Text
Fantastic Four: Popular, But Still Not Good
Fantastic Four looks bad. But it's still getting a bunch of eyeballs.
Tumblr media
Here are the pre-release Wikipedia Page Views for this week’s big release, Fantastic Four, along with some related titles:
Fantastic Four is looking pretty good with these comps. But it’s getting nothing but derision and mockery from critics (but before you feel bad, read this profile of Miles Teller).
Early industry reports have it doing very poorly. Maybe Jupiter Ascending is the best comparison here -- both were sci-fi pics that looked intriguing months in advance, but got killed critically when they hit theaters. (Jupiter pulled in less than $20 million on opening weekend). The Amazing Spider-Man 2 also got harsh reviews, but it was significantly more popular than the other films in this chart (which is why it wasn't a total financial bomb).
My primitive model predicts a gross of $57 million for Fantastic Four, but this is looking like a big overestimate. One thing I don’t price into my model is quality. This is obviously a big factor, even for the opening weekend. 
I didn’t include a movie critic variable in my model because I want the predictions to be effective at least a month in advance. Unfortunately, professional reviews are only available the week a movie is released. There’s probably a way to capture “buzz” via Tweets or YouTube trailer comments further in advance. I don’t have access to that exact data, but I’m going to see if I can find a good enough proxy.
As for Ricki and The Flash: looks like the Wikipedia game isn't kind to the indies.
Update: I had previously underestimated the page views for Spider Man -- there was a bug that scraped the wrong premier date. It's fixed here, but still incorrect in my previous post. Should have an update soon.
0 notes
bencoppersmith-blog · 10 years ago
Text
Wikipedia Page Views and Opening Weekend Box Office (An Investigation)
Wikipedia Page Views were predictive for Ant Man. But what about bug-free films?
Here's how the average daily views for a movie’s Wikipedia article relate to its opening weekend box office:
You can view a larger graph here
The Data
Collecting the data was a trip.
To aggregate the page view logs, I downloaded about a year and a half of data, from January 2014 to June 2015. Since these logs are hourly, I compressed them into daily files and threw out the records for all non-English and other irrelevant pages to save space. This took about 3 months of human time, all said and done.
For the movie data, I used a quick and dirty script to get all film titles from these wikipedia pages (2014, 2015). I parsed the individual pages to get things like release date, box office, budget, runtime, and top-billed actor.
I joined both of these datasets on their Wikipedia page title (ex: "Minions_(film)"). After filtering out all movies with less than $3 million in opening weekend grosses, I was left with 118 movies.
The page view data is still huge and expansive at this point, so I calculated a few statistics. The most important was Average 60 -- the average daily page view count from 60 days before a film was released to just one day before. (I used a 60-day window because it was big enough to smooth out any weird outliers and far enough out so it could make cooler-looking time-trend graphs. I experimented with other time ranges but they were all about equally predictive). I plotted Average 60 against Opening Weekend Grosses, and, well, here we are today!
Building a Model
The underlying theory for this whole investigation is that the view count for a film’s Wikipedia page is a proxy for the overall interest in that film: if people are interested in a film, they’re going to Google for it and wind up on the Wikipedia page. So if a Wikipedia page is getting a lot of hits, it means a lot of people are interested. And if a lot of people are interested, then a lot of them will become paying customers.
Studios can (and probably do) use this kind of predictive data to help marketing and distribution strategies. It may also give them a head start on lining everyone up for the sequel.
But Wikipedia page views alone only have an ok correlation with opening weekend grosses, as seen in the graph above. The trendline looks pretty good, but just 12 of the 118 films have a “predicted” opening gross that’s within 10% of its actual opening gross -- not so hot.
To make the model more predictive, I threw in a bunch of other other independent variables into the regression:
View Velocity (i.e. If a movie’s Wikipedia article page views shoot up right before it gets released, does that matter more than it’s spread-out average?)
Budget
Sequel or Remake vs. Original Property
Genre
Gender of Top Billed Actor or Actress
Of all of these variables, only budget was significant at any reasonable p-value.
Why Don’t Sequels Matter?
That’s a pretty big shock -- I certainly expected the sequel/original variable to be significant. After all, the top grossing films are almost all sequels or remakes. And isn’t everyone always grumbling that studios only make sequels nowadays? 
My best guess is that sequels and remakes make more money because of the widespread awareness and interest that has been built up by the first film. But awareness and interest are already priced into the model through the Wikipedia page views, making the sequel variable irrelevant (I’ll dig into this idea in a future post).
I also think that budget is significant because it captures the “studio expectations” for a film. After all, they only spend what they think they’ll make back. I think this variable captures things like star power, special effects, overall quality, and marketing, none of which is captured in my model (yet!).
The Regression
Here’s what the simple multivariate linear regression looks like:
Tumblr media
Again, that’s using Average 60 and Budget to predict opening weekend box office gross.
The model has a modest R-squared value, but both independent variables are statistically significant. The median error is -$2.1 million, which seems pretty reasonable.
The residuals have a very wide spread, and appear to exhibit some heteroskedasticity (i.e., the variability of errors is positively correlated with the prediction size). That may indicate that I’m missing some explanatory variables:
Tumblr media
The upshot here is that some log transormations may increase the accuracy of the model.
Bigest Misses?
The biggest underestimate was for Jurassic World. I don’t feel too bad about this, since most people missed pretty big -- this piece from Deadline is a good overview of that. The article also considers a few key factors that are missing from my model: holiday weekend release dates, competition, and word-of-mouth.
The biggest overestimate was Expendibles 3. It appears the fanbase for this film was 12,000 boys refreshing the Wikipedia page every day before it came out (the film also leaked 3 weeks before its premier).
Next Steps
The model discussed here is fine for making general claims, but can’t be used to make particularly accurate predictions. I’m going to try to work on improving that by growing the movie database beyond 118 films. I also want to try to source more explanatory variables. (I’m thinking about using Wikipedia page views to capture “star power” for the model. That’s interesting on its own, too -- who gets more page views, Chris Pratt or Tom Cruise?)
I’ll also try to keep posting articles similar to the Ant-Man one-- I think it’s generally interesting and predictive to look at how similar movies stack up against each other on Wikipedia page views.
References
The movie database is in “rough draft” mode -- as are the data parsing scripts -- but request a share on this Google Sheet if you want to crunch your own numbers.
All movie data is from Wikipedia except for opening weekend box office totals. I hand-entered all opening weekend box office numbers from Box Office Mojo.
I filtered out movies that had an opening weekend gross less than $3 million because many of these were “rolling” releases for prestige or independent films -- movies that appear in a few cities at first, then spread out, or have a heavy VoD component. My model is made for predicting bigger, national release weekends.
Budget data is sourced from Wikipedia, and appears to be very generic -- it’s almost always rounded off to the nearest $5 million.
I made the scatter plot using Plotly ( plotlyblog ). It was really easy!
1 note · View note
bencoppersmith-blog · 10 years ago
Text
No One Is Reading About Ant-Man
It’s getting fewer Wikipedia views than any other recent Marvel film. Does that mean it’s going to make a lot less money, too?
I’m still cooking on a larger investigation, but I found an interesting intermediate result that I wanted to put up. It has to do with Ant-Man.
Ant-Man is a Marvel cash-extraction device that seems funny and weird. In a lot of ways, it is very similar to last summer’s Guardians of the Galaxy. This weekend, Ant-Man is probably going to make a fair amount of money. But is it going to make Guardians of the Galaxy-level money?
Probably not.
That prediction isn’t based on any fancy polling data or customer insights – just cold, hard Wikipedia page view data.
The Data
Wikipedia keeps a log of all page views for every article. The data is raw and enormous. I’ve built a historical database of page views for the past year and a half, though, so I can access aggregate stats (I’ll detail more on this in the future).
But for now, check out the daily page views for Ant-Man and Guardians of the Galaxy, starting 30 days before each movie was released:
Tumblr media
(There’s a lot of volatility in the original Wikipedia data, so each day’s total page views is actually a 5-day rolling average – same day + 2 prior + 2 future days).
It’s pretty clear that more people were reading the Guardians of the Galaxy Wikipedia article before its release date than the Ant-Man Wikipedia article. As it turns out, Ant-Man is the least popular pre-release Wikipedia article of the last few Marvel releases (I also threw in Jurassic World … because I was interested):
Tumblr media
But what does it mean? Does it really matter that no one is reading Ant-Man’s Wikipedia page?
Well… it matters a little. Wikipedia page views are pretty solid proxy for audience interest and awareness of a film. Some academic studies have even shown that it is a statistically significant predictor of a film’s box office gross (this blog will show that, too, just … soon. Not quite ready yet. Data is hard!).
So given that this data has some predictive power, it does seem like Ant-Man may be in trouble. Both Guardians of the Galaxy and Captain America: Winter Soldier went into theaters getting more than 20,000 daily Wikipedia page views. Both had opening box office grosses above $90 million (domestic). Avengers and Jurassic got even more views, and opened with even more money.
Again, Wikipedia page views only carry a small bit of predictive power. It doesn’t tell the whole story. But they seem to do a good job of capturing broad interest in a film, and this interest often translates into boatloads of cold hard cash.
Or in the case of Ant-Man… maybe a half-boatload.
(At least it’ll probably be more than Trainwreck:)
Tumblr media
References
Box office information was gleaned from Box Office Mojo and Wikipedia
2 notes · View notes
bencoppersmith-blog · 10 years ago
Text
Silver Lake Parking Tickets
The City of Los Angeles recently released a dataset that contains parking citations from 2012-2015.
The Data
This is a pretty amazing dataset. When it comes to parking, we all constantly wonder how far we can toe the line – do you really need to feed the meter right before parking restrictions end? Do you actually need to move for street sweeping? If the car is a-rockin’, is it true that the police are legally prohibited from coming a-knockin’? Well, now we have a way to really answer at least some of these questions.
The Problem
Angelenos get a ton of tickets – there are at least 7.3 million citations since 2012. That’s a ton of data to analyze and there are many ways to break it down. As a start, I decided to keep it simple. I wanted to examine how tickets were assigned in the Silver Lake area of Los Angeles. More specifically: exactly how strict is the enforcement of the Permit Parking zone in my neighborhood?
Tumblr media
The Tickets
I downloaded the raw ticket data and filtered out the non-Silver Lake tickets. The ticket data comes with the date and time at which the ticket was issued, so I compressed all tickets into the same 24 hour window and then dropped them on a map. This makes it possible to visualize the various times of day that parking citations are issued:
(The sliding-scale is the time of day; this map shows all tickets from 2012-2015. “Preferential Parking” tickets are given to vehicles that do not have visible permits during the permit-only window from 11pm-6am).
And holy cow! That’s a huge explosion of “Preferential Parking” citations at 11pm! It’s amazing how many tickets are assigned right as the “Permit Parking” begins. But it’s more remarkable how *few* tickets are given out after 11:30 PM. Even though permits are required until 6:00AM, it really seems like enforcement goes to bed at midnight.
Here’s another look at when tickets are given out:
Tumblr media
92% of preferential parking tickets are issued before midnight! 83% before 11:30pm! There have been no recorded preferential parking tickets assigned from 3:30am to 6am!
Tumblr media
So - the next time you are enjoying a post-punk noise show at The Satellite, and you glance at your Samsung Gear S Smartwatch to see it’s already midnight, and you’ve illegally parked your PT Cruiser on West Silver Lake Blvd (full disclosure, I have no idea who my readers are) – chill. You’ve either already gotten a ticket or you’re in the clear.
Notes and References
The raw ticket data is pretty messy – the timestamps are weird and the coordinates are in an esoteric, California-centric projection. You can find the script I used to clean up and filter the data on my GitHub page.
It was incredibly easy to make a fancy map with CartoDB. The only problem was the time attribute kept getting automatically shifted to UTC time. I wound up hacking through this (I added 7 hours to the raw timestamps). So, uh, if you click through to my underlying dataset on CartoDB, that’s why it has some weird columns. (I believe this is the issue being tracked by CartoDB here).
7.3 million is a lot of tickets! @shalomsweetheart should feel encouraged that she’s only responsible for ~0.0001% of all tickets.
7 notes · View notes