#spark-jobserver
Explore tagged Tumblr posts
Text
spark-jobserver 0.7.0
#Scala #akka @ApacheSpark
Python job support! (mostly @mattinbits, some @CalebFenton)
Some settings like job-jar-paths renamed to job-bin-paths (@nikkiSproutling)
New Scala API! The expanded API includes passing in a JobEnvironment which provides access to job ID, context information; it also offers more type safety in job return type, validation etc. (@velvia, others)
Spark 2.0 support (see spark-2.0-preview branch) (@f1yegor, @addisonj)
Changed minor HTTP responses to be all JSON (@roarking)
Deprecate JobFileDAO. H2 is now the default. Furthermore, H2 and JobFileDAO cannot be used with context-per-jvm. (@noorul)
Context creation API (POST /contexts) now takes a config in the POST body! (@casey-green)
Make create job API to return more information (@noorul)
Upgrade Slick to 3.1.1 (@hntd187)
Spark driver cluster mode for mesos (@LeightonWong)
Fix broken links to ec2 scripts, #369 (@noorul)
EC2 VPC deploy fixes (@mcnels)
Only set spark.executor.uri if env var is set (@addisonj)
Add Scala version to Docker image (@mariussoutier)
Integrate converage into CI using codecov.io service (@hntd187)
Remove akka dependency from api module (@f1yegor)
Eliminate POST and DEL /job race conditions (@addisonj)
Improve Chinese document (@stanzhai)
Return error if data file can't be deleted (@CalebFenton)
Make dbcp optional, default: disabled (@noorul)
Fix for logging issue in dev mode, #475 (@noorul)
Fix flaky tests (@TimMaltGermany)
Increase size of config/input that can be submitted via custom Akka serializer (@rmettu-rms)
Update build plugins (@hntd187)
README fix (@Vincibean)
Ensure Scala compiler dependency has correct version (@@aganim-chariot)
Docs for YARN queue config option (@ash211)
Make JMX port configurable (@casey-green)
Update test description, PR #481 (@oranda)
Per-user authenticated contexts, PR #469 (@TimMaltGermany)
Fix for Delete file API issue, #507 (@noorul)
Fix UI not showing running jobs if completed jobs fills the limit, #547 (@TianLangStudio)
Fix Jar name issue on Windows (@TianLangStudio)
Forked JVM processes must have their own JobDAO, #353, First step towards HA (@noorul)
Change cluster status of removed contexts to down (@derSascha)
Include Python exception stacktrace on failure (@CalebFenton)
UI - Use relative paths so it works if running in a context-path (@sjoerdmulder)
Bash style fixes (@sks)
EC2 deployment script fixes (@mcnels1)
Python Job API - Format standard output from process (@windelinckx)
Fix test issues on Windows (@doernemt)
Use OpenJDK for Docker base image (@xjodoin)
Cleanup LDAP authentication and add filter config. (@derSascha)
Giter8 template available at https://github.com/spark-jobserver/spark-jobserver.g8 (@noorul)
Delete binary API (@f1yegor)
Fix start-manager.sh for Docker environments (@fmcgough)
spark-jobserver is a REST job server for Apache Spark. It provides Spark as a service.
0 notes
Text
Creating A Spark Server For Every Job With Livy
One of the frustrations that most people who are new to Spark have, is how exactly to run Spark. Before running your first Spark job you’re likely to hear about YARN or Mesos and it might seem like running a Spark job is a world unto it self. This barrier to entry makes it harder for beginners to imagine that's possible with Spark. Livy provides a RESTful interface to Apache Spark and helps obfuscate some of the details of Sparks execution mechanics and lets developers submit programs to a Spark cluster and gets results. This post if a summary of my notes using Livy to send jobs queued from web hooks to a Spark cluster. My aim to to talk about the benefits and drawbacks of using this setup as well as a small tutorial on Livy. If you’re interested in using Livy, the documentation is excellent.
The Basics
When running a Spark Job, you typically submit jobs via a Spark Shell. This can be in Python or Scala, but running a Spark Job looks something like this:
There are some exceptions, notably if you’re working in a Notebook context like Juypter Notebook, Zeppelin or Beaker Notebooks. In these cases, the notebooks are bound to a Spark Shell so you can run jobs dynamically instead of submitting Jar files or Python files.
In either context, you need to have a Spark Context (either create on in the notebook or within the file submitted to the shell) and code is isolated to your environment. This is fine for most workloads and for development, but it limits the kinds of programs you can write in Spark and the amounts of services that can communicate with Spark.
For example, if we built a regression model in Spark and wanted to run live data through it, it’s not immediately obvious how we’d do that, or over what protocol. It all seems too boxed in and tightly coupled with the machine it’s running on. That’s where Livy is helpful. Livy exposes a rest endpoint to Spark, allowing you to submit jobs from anywhere*. How it accomplishes this is a bit tricky and I’ll walk through the mechanics of it.
Mechanics
Spark doesn’t have a RESTful protocol to it’s engine, however with a little work you can create a rest API server that translates Python, Scala or R code to Spark Job lingo and return the results. This is essential with Livy does (forgive the oversimplification). This allows (for example) us to write a DSL that submits Spark Jobs over REST and gets data back (There are other ways to get about this like MLeap that I’ll cover in a future post)
The power of doing this should be immediately obvious, but the drawbacks might be as well. I worked through two examples to explore the API behind Livy and then to try and actually use REST to do something interesting.
A RESTful Endpoint Example
My first example is just an endpoint that squares the integer it receives on a POST request. For example, POST /2 would reply with 2^2 = 4. I chose this strategically because of the complexity in putting one of these endpoints together. My example is in Scala, but you could do the same thing in PySpark or SparkR. Here is the endpoint. I commented int he code about each part and what it’s doing. I find that much easier than posting the code and explaining it after the fact:
Predict My Weight
In order to do something a bit more applicable to an actual workload, I created a silly model. The models predicts what my weight will be one week from today, based on how many calories I ate + how many calories I burned today. It’s wildly inaccurate, but good for the purposes of this blog. I will enter my weight and calories burned in a Google Sheet and I used Microsoft Flow to trigger an HTTP event that fires to my Livy server and calculates my weight.
Here is a rough sketch of what will happen.
This will work a little differently from the example I shared above. Instead of writing a Scala HTTP client, I can just make a post request from the Microsoft Flow HTTP client. I won’t walk through how to do that as the above example already illustrates and the UI is intuitive. Essentially, I’ll add my weight and calories I burned today into a spread sheet, that’ll trigger an event to predict my weight and add it as a new column in a separate spreadsheet. Here is the function:
I entered 199 as current weight and 1500 as calories burned today ( Both fakes numbers) and it predicted my weight would be 188.99 a week from now.
Summary
Livy provides an interesting way to use Spark as a RESTful service. In my opinion, this is not an ideal way to interact with Spark, however. There is just a tad too much overhead of language interoperability to make it worth it. For starters, sending strings of Scala code over the wire doesn’t inspire a lot of confidence. It’s also not immediately clear why executing pre-defined JAR files over rest has. On the positive side, I expected something much slower than what I got out of Livy. For a use case as contrived as the one I made up for this blog it’s pretty solid, but the model in general might be hard to scale and reason about.
Notes
Microsoft Flow is very cool. I know it seems like an IFTTT clone, but with the ability to send HTTP requests and web hooks it’s much more customizable. Also the free tier is much more generous than something like Zapier.
This stuff takes forever to configure and use the first time.
There are some alternative projects that aim to accomplish the same task as Livy, most notably spark-jobserver, which I think is a little bit easier to use but I didn’t find out about until long after I started experimenting with Livy. If anyone would be interested in a tutorial about that feel free to let me know.
0 notes
Text
spark-jobserver 0.6.2
#Scala #akka @ApacheSpark
1 context per JVM. This forks off a separate process for each SparkContext, and is a major change in architecture. (EXPERIMENTAL) (@velvia, @Ankit1010, many others)
Update to Spark 1.6.0 (@RaafatAkkad, @apivovarov)
NamedObjects, an abstraction of NamedRDDs to other types of things (@TimMaltGermany / KNIME)
NamedBroadcasts (@CBribiescas)
Per-job dependent jars (@amarouni)
DEBUG_PORT env var allows remote debugging (@addisonj)
UI links to status/config (@h0ke)
Fix bug - invalid error w no user/pass #329 (@slater-ben)
Chunked stream encoding for Stream[_] job output (@leone17)
Workaround - custom error classes for context-per-jvm (@ishassan)
Spark Hadoop configuration in spark.context-settings.hadoop (@hntd187)
Support user impersonation for an already Kerberos authenticated user (@mateilucianstefan)
Expose MAXDIRECTMEMORY to adjust direct memory buffer size (@addisonj)
Expose JAVA_VERSION for building with say JDK8 (@stannie42)
Fix docker build (@stannie42)
Redirect /dev/null to stdin in server_start.sh (@apivovarov)
Database migration scripts and better table definitions (@v-gerasimov)
Chinese doc updates (@zp-j, @liujuncheng)
Minor improvements in style (@apivovarov)
SBT and other dependency/plugin upgrades, including to Mesos 0.28 (@hntd187, @jaceklaskowski)
README change to document overriding default meta data DB backend (@noorul)
README typo fix (@qayshp)
spark-jobserver is a REST job server for Apache Spark. It provides Spark as a service.
1 note
·
View note
Text
spark-jobserver 0.4.0
#Scala #akka @ApacheSpark
Updated for Spark 1.0.2
spark-jobserver is a REST job server for Apache Spark. It provides Spark as a service.
1 note
·
View note
Text
spark-jobserver 0.6.1
#Scala #akka @ApacheSpark
New route POST /contexts?reset=reboot to close and restart all contexts defined in config (@wildwood)
CORS support and EC2 deploy scripts (@David-Durst)
Updated to Spark 1.5.2 (@zeitos, @apivovarov)
Updated Docker container, much smaller (based on jre-java7) with configurable Mesos version
Return job status uniformly for both /jobs and /jobs/ routes (@noorul)
Fix issue #247: Allow non-computed named RDDs (@TimMaltGermany / KNIME)
Fix for PostgreSQL varchar issue (@ffarozan)
Update Yarn docs, set --master yarn-client in start script (@lant)
Update Spray to 1.3.3 (@v-gerasimov)
Allow Web API timeouts to be configurable via short-timeout (@v-gerasimov)
Chinese documentation at doc/chinese/ ! (@linrunzhang)
spark-jobserver is a REST job server for Apache Spark. It provides Spark as a service.
0 notes
Text
spark-jobserver 0.6.0
#Scala #akka @ApacheSpark
Default to Spark 1.4.1 (@zeitos)
LDAP Authentication support via Apache Shiro (@TimMaltGermany / KNIME)
HTTPS Support! (@koettert @TimMaltGermany / KNIME)
Load jars automatically on startup (@lossyrob / Azavea)
/data routes (The KNIME team)
JavaSparkJob
Fix server-stop.sh so it works with spark-submit (@addisonj @RussellSpitzer @jmelching)
Upgrade to spray-json 1.3.2 (@v-gerasimov)
Upgrade joda-time to 2.2 (@jwalgran)
Docker fixes, such as for logging (@MeiSign)
Fixes to context config merging (@zeitos)
JobSqlDAO to support username, password, DB connection pooling (@ffarozan)
Doc: Instructions for running job-server docker container in yarn-client (@MeiSign)
Doc: Various fixes to README (@noorul) and deploy instructions (@mvle @lewismc) and troubleshooting (@eranation)
Fix issue #229: Always use limit while reading from DB (@noorul)
Fix typo in SparkStreamingJob that used to be called SparkStramingJob. This will break your code if you used SparkStramingJob (@zeitos)
spark-jobserver is a REST job server for Apache Spark. It provides Spark as a service.
0 notes
Text
spark-jobserver 0.5.2
#Scala #akka @ApacheSpark
Spark streaming context support !! (@zeitos)
Change server_start.sh to use spark-submit. This should fix some edge case bugs.
Configurable driver memory (@acidghost)
Be able to accept environment vars in job server config files, eg master = ${?MY_SPARK_HOST} (see the Typesafe Config docs)
spark-jobserver is a REST job server for Apache Spark. It provides Spark as a service.
0 notes