Hosting an R Shiny Application on Amazon EC2

Hosting an R Shiny Application on Amazon EC2

** For those who missed the first part of this series, read this blog post first to see details about the NFL Play Predictor at **

In the second part of this three part series I will discuss how I host using a Shiny app hosted on an Amazon EC2 instance, also using Amazon’s Route 53 DNS service to setup a custom domain name. In the third and final post will dig into the methodology behind the models that support the site.

EC2 Setup

To start, either create or login to your Amazon account ( & navigate to the EC2 service. From there click the launch instance button from your region of choice.

Screen Shot 2017-11-11 at 9.41.03 AM.pngScreen Shot 2017-11-11 at 10.07.31 AM.png

Select your desired AMI, I use the Ubuntu Server 16.04, and all code used in this post will be reflective of that.

Screen Shot 2017-11-11 at 9.41.03 AM.png

From there select an EC2 instance type. See the EC2 pricing scheme to select the instance right for you. R is very memory intensive, so memory optimized instances will have better performance. I’m using the r3.large instance, which at current prices costs a little over $100 monthly.

Screen Shot 2017-11-11 at 9.41.20 AM.png

Stealing from this excellent tutorial – the next steps are as following:

Click the grey “Next: Configure Instance Details” button. You won’t need to change any of the default settings here.

Click the grey “Next: Add Storage” button. You won’t need to change any of the default settings here.

Click the grey “Next: Tag Instance” button. You won’t need to change any of the default settings here.

Click the grey “Next: Configure Security Group” button. Shiny uses port 3838 for inbound web connections, so we’ll need to add a rule for this. Click the “Add Rule” button and on the “Custom TCP Rule” line, set the “Port Range” to 3838 and the “Source” to “Anywhere.”

Click the blue “Review and Launch” button. You won’t need to change any of the settings here.

Click the blue “Launch” button. A window will now appear prompting you to “Select an existing key pair or create a new key pair.” The key is what you will use to login to your Linux server; without that you won’t be able to access your server and you’ll need to start over (losing all of your work).

In the first dropdown select “Create a New Pair,” and in the “Key pair name” box type whatever you would like to name your pair, I will use “test_pair.pem”. For now, just keep this file in your downloads folder.

Congratulations on launching your instance! Now it’s time to load your Shiny app!

SSH into the EC2 Instance

The first step is to connect to your instance via SSH. Click on the blue “Actions “dropdown, and select connect, or just click the connect button.

Screen Shot 2017-11-11 at 11.14.25 AM.png

The next page gives detailed instructions about how to connect – as indicated, the first step is to make your .pem file publically viewable.

Screen Shot 2017-11-11 at 11.23.47 AM.png

Open up a terminal window, change the directory to your downloads folder (or wherever you put the .pem file), and type the following

chmod 400 your_pem_file_name.pem

From there, on that same screen copy the code starting with “ssh –I” immediately under the “Example” line in section 4. You’re now in control of the EC2 instance!

To make this easier in the future, set up an SSH config file that you can call with a simple command using the code below (use whatever text editor is your preference, I’m using VIM here):

vim ~/.ssh/config

And create a file with the following information (insert the values specific to your instance replacing “insert name” text below)

Host *

       ServerAliveInterval 240

 Host insert_your name

       HostName ec2-<your-ec2-ip>

       User ubuntu

       IdentityFile ~/.ssh/your_pem_file_name.pem

       ForwardX11 yes

     ForwardX11Trusted yes

Install R, Shiny Server & Required R packages into your EC2 instance

 Now the next step is to install R & all R packages required to run your shiny app. Below is the code I used for my application

sudo bash -c “echo ‘deb trusty/’ >> /etc/apt/sources.list”

sudo apt-get update

sudo apt-get install r-base

sudo apt-get install nginx

Next, install Shiny Server. Note the version of shiny server will change in the future

sudo apt-get install gdebi-core


sudo gdebi shiny-server-

## Relevant R packages

sudo Rscript -e ‘install.packages(“shiny”,repos=”;)’

sudo Rscript -e ‘install.packages(“dplyr”,repos=”;)’

sudo Rscript -e ‘install.packages(“ggplot2″,repos=”;)’

sudo Rscript -e ‘install.packages(“xgboost”,repos=”;)’

sudo Rscript -e ‘install.packages(“tidyr”,repos=”;)’

For reference, if you’re using xgboost you may run into issues installing the package due to memory issues. To get around this, use an EC2 instance with more memory available.

Configure Shiny Server & the Shiny Application

To test if the installation was successful, use insert the public IP address for your instance then :3838 into your web browser (youripaddress:3838), and you should see the following screen.

Screen Shot 2017-11-11 at 4.06.34 PM.png

Now it’s time to load all relevant files to the EC2 instance to host your shiny application using the secure copy function (scp in the terminal). To do this, again click “connect” in the EC2 console to pull up the same screen as before, and using the same line of code used to ssh into the instance just select the portion of the line starting at Ubuntu@.

Open a new terminal window, navigate the location on your machine that houses your Shiny app code and .pem file and type the following – this will load the ui.R file associated with your shiny application:

scp your_pem_file.pem ui.R

Repeat this process for all files that you need for your Shiny application. I also loaded a custom index.html file as well containing instructions for how to use the application.

From this point forward it is up to you how you would like to present the application to users. The sample apps provided embed the applications within the sample index.html page. As mentioned above I chose to create a custom index.html page containing instructions on how to use the application with a link to the Shiny app included, so that is what the instructions below produce.

Go back to the terminal window ssh’d into your EC2 instance and change your file location to /srv/shiny-server. From there, remove the sample index.html file & applications and create a new folder in the same location:

sudo rm index.html

sudo rm sample-apps

sudo mkdir insert-folder-name

Then go back to the home directory within the EC2 instance & copy all the files loaded up via scp into the new folder location

cd /home/Ubuntu/

cp .* /srv/shiny-server/insert-folder-name

cp index.html /srv/shiny-server/

The second command will copy all files into the newly created folder, and the third will copy the index.html file into the shiny server directory. Now put the link to your IP address & port 3838 back into the browser (your ip address:3838), and you should see your shiny application!

Creating a Custom Domain for your Application

To take this application to the next level, let’s give it a custom domain name. To start, purchase & register the domain name of your choosing using Amazon’s Route 53 service. Next, assign an elastic IP to your EC2 instance. The purpose of using an elastic IP is you can mask the failure of an instance or software by rapidly remapping the address to another instance in your account. Finally, use the instructions at this location to direct traffic going to the chosen domain name to your Shiny application.

You’ll also need to configure your application to port 80 in addition to port 3838 to allow public group. Click on your security group (a selection in the “Description” section of the EC2 console), and add in the following rule:

Screen Shot 2017-11-11 at 5.38.33 PM.png

To ensure this works, run the following code to see what exists in the shiny-server.conf file.

cd /etc/shiny-server

vim /shiny-server.conf

If the below is not in there, add it into the following leaving the existing text untouched (which references port 3838).

# Instruct Shiny Server to run applications as the user “shiny”

run_as shiny;

# Define a server that listens on port 80

server {

listen 80;

# Define a location at the base URL

location / {

# Host the directory of Shiny Apps stored in this directory

site_dir /srv/shiny-server;

# Log all Shiny output to files in this directory

log_dir /var/log/shiny-server;

# When a user visits the base URL rather than a particular application,

# an index of the applications available in this directory will be shown.

directory_index on;



You should now be able to access your application using your custom domain! This is what my site looks like:

Screen Shot 2017-11-12 at 4.25.26 PM.png

All of the code powering can be found on my Github page.

Reach out to @WesleyPasfield on Twitter, or wesley dot pasfield at gmail dot com via email with any questions!

Predicting NFL Plays with the xgboost Decision Tree Algorithm

Predicting NFL Plays with the xgboost Decision Tree Algorithm


In all levels of football, on-field trends are typically discerned exclusively through voluminous film study of opponent history, and decisions are made using anecdotal evidence and gut instinct. These methods in isolation are highly inefficient and prone to human error.

Enter – the play prediction. This tool aims to enhance in-game NFL decision making with a tool capable of predicting the type of play the opposing team will run at high accuracy in real-time. On average this tool is able to predict pass or run at 73.6% accuracy, with varying performance dictated by teams playing and mostly game situation.

This tool is able to predict in close to real-time, meaning you can follow along during a game and generate accurate predictions of what type of play will be run next, where it will go, how that decision compares to all other teams in the NFL, and what the likelihood of success is (as defined by scoring a touchdown or getting a first down).

This post will focus on the results of the study touching on some of the technical methodology behind the tool and well as details on how to use it. Later posts will go into greater technical detail about the modeling process as well the operational side of hosting the site and generating real-time predictions using the AWS tech stack.


Data was collected from two primary sources, the nflScrapr package for all variables outside of offensive formation, which was pulled from For future versions, offensive formation data will be pulled directly from the NFL API thanks to a tip from MLB’s Daren Willman. All data manipulation, analysis, modeling & visualization was completed using R. The web app was developed using Shiny in R, and published using an AWS EC2 instance (since taken down).

Five models were created leveraging the xgboost algorithm with different evaluation metrics depending on the objective variable. The models predicting play type, TD/First Down likelihood, & TD drive likelihood were all binary classifiers with a logloss evaluation metric, while the play direction model was a multi-class classifier also leveraging the logloss evaluation metric. The variables included in the models are offensive formation, down, distance, score differential, quarter, time remaining, offensive & defensive teams, a home team binary indicator and team timeouts remaining. The full training & testing sample includes all offensive plays from 2013 – 2016. Only plays where the team with the ball lined up in an offensive formation were considered (so special teams plays were ignored).


The play type model is able to predict whether a team is going to run or pass with 73.60% accuracy across all situations on the held out testing set. The rate fluctuates by team & by situation; the chart below is an example of variable performance. The increasing model performance from the first to the second quarter indicates that offenses grow predictable over the course of the first half. The decline in the third quarter is an indication that teams do make “half-time adjustments,” while the increase in predictability in the fourth quarter is likely driven by game situation.

Screen Shot 2017-09-11 at 10.29.31 AM.png

There was also highly variable prediction performance by team. The chart below shows a boxplot of play type prediction performance on the held out test set for all teams from 2013-2016. On its surface lack of predictability seems like an asset, but in the case of the 2016 49ers that just reflects their ultra-conservative run-first approach, which despite its novelty did not yield positive results.

Screen Shot 2017-09-17 at 9.19.46 AM.png

Pundits often talk about “playing to the sticks” and avoiding third & long scenarios that make the offense one-dimensional. That notion is borne out in this tool, as teams are much more predictable on 3rd down (and 4th down, when not punting or kicking) than they are on 1st & 2nd down, as seen in the chart below.

Screen Shot 2017-09-17 at 9.26.11 AM.png

Offensive formation is also an important differentiator in prediction success. It’s harder to predict what a team is going to do when the QB is under center compared to in the shotgun. Part of this is due to game situation (you are more likely to be in shotgun in obvious passing situations), but it’s an important consideration when evaluating likelihood of play type.

Screen Shot 2017-09-17 at 9.30.53 AM.png

The play direction predictor is quite weak, as shown by the relatively even predictions across the six different options, but the TD drive, TD play & First Down play predictors performed at the same accuracy level and often slightly higher compared to the play type model. The win model performs at over 95% accuracy; obviously it offers significantly better performance later in games.


This model allows users to leverage these models and simulate any NFL in-game scenario to understand what the offense is likely to do, what the likely outcome of the play will be and how the selected offensive team’s decision making compares to all other teams in the league.

The built-in scenario highlights how variable strategies are for each NFL team. Playing the Denver Broncos & starting the first quarter with the ball on the 50-yard line, the 2016 Arizona Cardinals are predicted to pass 76.39% of the time compared to just 41.96% for the 2016 Buffalo Bills.

Screen Shot 2017-09-11 at 4.33.10 PM.png

The nature of football forces coaches to make complex decisions about highly variable situations in real-time. Often coaches rely on anecdotal evidence or gut instinct to make decisions, which can reduce their team’s chance of winning. This tool aims to complement that decision-making process through advanced statistical modeling. It’s also a fun game-watching companion for the statistics-inclined fan!

Any Questions? Reach out to @wesleypasfield on Twitter, or wesley dot pasfield at gmail dot com.


NFL Teams are Terrified of Going for Two

Time and time again we see football coaches make risk averse decisions because it will protect them from criticism. The most obvious of these choices is whether to go for it on 4th & 1 when in FG range, on 4th & 2-5 between the other team’s 40-50 yard line, and my personal favorite, whether to go for two.

Announcers will almost always say coaches “shouldn’t chase the points” early in the game, or should “kick the extra point and make it a one-possession game” when a team scores a touchdown down 15 to get within 8. Both of these assertions, notably the second, can be objectively incorrect, yet if coaches made the opposite decisions they would almost certainly be vilified in the media when unsuccessful.

During the 2016 season, the success rate on extra points was 93.6%, meaning on average teams scored 0.936 points per extra point attempt. The success rate on two point conversions was 48.6%, so on average teams scored 0.972 points per two point conversion attempt. Based on this alone, teams would score more if they went for two every time.

Obviously looking at this purely as an average over the course of the whole season across all teams is not fair for every team’s individual situation. If your team has a terrible offense and an amazing kicker, your odds may flip in the other direction, but if you have a top tier offense why not consider going for two every time? Or more specifically, say your team has a solid offense that struggles to make big plays – why not go for two each time to maximize the potential of each scoring opportunity? Either way, there’s a discussion to be had and a coach can make a rational argument in either direction in almost all situations.

On the other hand, the second scenario of kicking the extra point to go down 8 points, especially well into the 4th quarter, is often objectively the wrong decision. Everyone agrees that down 8 a team needs a touchdown and two point conversion to tie the game. If you believe that a team’s chance of actually converting the two point conversion is the same no matter when the play occurs, then postponing that 2 point conversion until later in the game serves no purpose.

Let’s operate under the scenario that the team will not convert the two point conversion. If the team decides to go for two after the first score and doesn’t it get it, they will be down 9 and the coach will get destroyed by the announcers since it is no longer a “one” possession game. But in reality, the coach now knows he needs two scores in order to win the game and plays the game accordingly, likely taking more risks & playing faster. If the coach waits to go for two later and doesn’t convert, he will have likely coached as if they only needed one more score to tie the game which diminishes the chances of winning.

This can easily be described through a practical situation. A team is down 28-13 and scores a touchdown to go down 28-19 with 4 minutes left. The coach subsequently decides to kick the extra point leaving his team at a 28-20 deficit. His defense then produces a stop and the offense gets the ball back and methodically drives all the way down the field burning all their time-outs in the process and scores a touchdown as time expires to go down 28-26. Unfortunately, the two point conversion comes up short and the game ends. No one bats an eye or considers criticizing the coach.

In an alternate reality, the coach decides to go for two right away and fails to get it, leaving his team down 28-19. His defense gets a stop and the offense scores quickly conserving all their timeouts to go down 28-26 with 1:30 remaining. The coach then tries an onside kick, fails to get it and ultimately comes up short when the opposing team gets the first down they need to end the game. The coach is universally criticized for his decision to go for two early.

Everyone would agree that being down 28-26 with 1:30 remaining and all your timeouts is better than being down 28-26 with no time left, correct? The information gained from attempting the two point conversion earlier gave the alternate reality coach a better chance of winning than the first coach who followed convention. Going for two early gives you the highest probability of the worst outcome, scoring no points after the touchdown, but it also gives you the highest probability of the best outcome, and allows you to coach fully understanding the nature of the situation. The article below does a better job explaining this concept than I ever could:

The day an announcer accurately assesses this situation will be a very, very happy day.


Derek Norris 2016 – A Season to Forget

Derek Norris 2016 – A Season to Forget

While it may not be the most exciting Nationals story of the offseason, Wilson Ramos signing with the Rays and the subsequent trade for Derek Norris to replace him is a very big change for the Nats. Prior to tearing his ACL in September, Ramos was having an incredible 2016 and really carried the Nationals offense through the first part of the year (with the help of Daniel Murphy, of course) when Harper was scuffling and Anthony Rendon was still working back from last season’s injury. Given Ramos’ injury history it makes sense to let him walk, but Nationals fans have reasons to be concerned about Norris.

After a few seasons of modest success including an All-Star appearance in 2014, Norris batted well under the Mendoza line (.186) in 2016 with a significant increase in strikeout rate. What was the cause for this precipitous decline? Others have dug into this lost season as well, and this article will focus on using PitchFx pitch-by-pitch data through the pitchRx package in R as well as Statcast batted-ball data manually downloaded into CSV files from, and then loaded into R. Note that the Statcast data has some missing values so is not comprehensive, but it still tells enough to paint a meaningful story.

To start, Norris’ strikeout rate increased from 24% in 2015 to 30% in 2016, but that’s not the entire story. Norris’ BABIP dropped from .310 in 2015 to .238 in 2016 as well, but his ISO stayed relatively flat (.153 in 2015 vs. .142 in 2016). Given the randomness that can be associated with BABIP, this could be good new for Nats fans, but upon further investigation there’s reason to believe this drop was not an aberration

Using the batted-ball Statcast data, it doesn’t appear that Norris is making weaker contact, at least from a velocity standpoint (chart shows values in MPH):

Screen Shot 2016-12-11 at 9.50.27 PM.png

Distance, on the other hand, does show a noticeable difference (chart shows values in feet):

Screen Shot 2016-12-11 at 9.53.45 PM.png

So Norris is hitting the ball further in 2016, but to less success, which translates to lazy fly balls. This is borne out by the angle of balls he put in play in 2015 vs. 2016 (values represent the vertical angle of the ball at contact).

Screen Shot 2016-12-11 at 9.56.55 PM.png

The shifts in distance & angle year over year are both statistically significant (velocity is not), indicating these are meaningful changes, and they appear to be caused at least in part by the way pitchers are attacking Norris.

Switching to the PitchFx data, it appears pitchers have begun attacking Norris up and out of the zone more in 2016. The below chart shows the percentage frequency of all pitches thrown to Derek Norris in 2015 & 2016 based on pitch location. Norris has seen a noticeable increase in pitches in Zones 11 & 12, which are up and out of the strike zone.

Screen Shot 2016-12-11 at 10.11.19 PM.png

Norris has also seen a corresponding jump in fastballs, which makes sense given this changing location. This shift isn’t as noticeable as location, but Norris has seen less change-ups (CU & sinkers (SI) and an increase in two-seam (FT) & four-seam fastballs (FF).

Screen Shot 2016-12-11 at 10.15.10 PM.png

The net results from this are striking. The below chart shows Norris’ “success” rate for pitches in Zones 11 & 12 (Represented by “Yes” values, bars on the right below) compared to all other zones for only outcome pitches, or the last pitch of a given at-bat. In this case success is defined by getting a hit of any kind, and a failure is any non-productive out (so excluding sacrifices). All other plate appearances were excluded.

Screen Shot 2016-12-11 at 10.21.20 PM.png

While Norris was less effective overall in 2016, the drop in effectiveness on zone 11 & 12 pitches is extremely noticeable. Looking at the raw numbers makes this even more dramatic:

2015                                                     2016

Screen Shot 2016-12-11 at 10.23.19 PM.png                       Screen Shot 2016-12-11 at 10.23.38 PM.png

So not only did more at-bats end in pitches in zones 11 & 12, Norris ended up a shocking 2-81 in these situations in 2016.

In short, Norris should expect a steady stream of fastballs up in the zone in 2016, and if he can’t figure out how to handle them, the Nationals may seriously regret handing him the keys to the catcher position in 2016.

wERA – Rethinking Inherited Runners in the ERA Calculation

wERA – Rethinking Inherited Runners in the ERA Calculation

There are many things to harp on about traditional ERA, but one thing that has always bothered me is the inherited runner portion of the base ERA calculation. Why do we treat it in such a binary fashion? Shouldn’t the pitcher who allowed the run shoulder some of the accountability?

As a Nationals fan the seminal example of the fallacy of this calculation was game 2 of the 2014 Division series vs. the Giants. Jordan Zimmermann had completely dominated all day, and after a borderline ball-four call Matt Williams replaced him with Drew Storen who entered the game with a runner on first and two outs in the top of the 9th and the Nats clinging to a one-run lead. Storen proceeded to give up a single to Buster Posey & a double to Pablo Sandoval to tie the game, but escaped the inning when Posey was thrown out at the plate. So taking a look at the box score, Zimmerman, who allowed an innocent two-out walk, takes the ERA hit and is accountable for the run, while Storen, who was responsible for a lion’s share of the damage, gets completely off the hook. That doesn’t seem fair to me!

I’ve seen other statistics target other flawed elements of ERA (park factors, defense), but RE24 is the closest thing I’ve found to a more context-based approach to relief pitcher evaluation. RE24 calculates the change in run expectancy over the course of a single at-bat, so it’s applicable beyond relief pitchers & pitchers in general, and is an excellent way to determine how impactful a player is on the overall outcome of the game. But at the same time, it does not tackle the notion of assignment, simply the change in probability based on a given situation.

wERA is an attempt to retain the positive components of ERA (assignment, interpretability), but do so in a fashion that better represents a pitcher’s true role in allowing the run.

The calculation works in the exact same way as traditional ERA, but assigns inherited runs based on the probability that run will score based on the position of the runner & number of outs at the start of the at-bat when a relief pitcher enters the game. These probabilities were calculated using every outcome from the 2016 season where inherited runners were involved.

Concretely, here is a chart showing the probability, and thus the run responsibility, in each possible situation. So in the top example – if there’s a runner on 3rd and no one out when the RP enters the game, the replaced pitcher is assigned 0.72 of the run, and the pitcher who inherits the situation is assigned 0.28 of the run. On the flip side, if the relief pitcher enters the game with 2 outs & a runner on first, they will be assigned 0.89 of the run, since it is primarily the relief pitcher’s fault the runner scored.

Screen Shot 2016-12-04 at 9.35.13 AM.pngLet’s take a look at the 2016 season, and see which starting & relief pitchers would be least & most affected by this version of the ERA calculation (note only showing starters with at least 100 IP, and relievers with over 30 IP).

Screen Shot 2016-12-07 at 9.39.40 PM.png

The Diamondbacks starting pitchers had a rough year this year, but they were not helped out by their bullpen. Patrick Corbin would shave off almost 10 runs & over half a run in season-long ERA using the wERA calculation over the traditional ERA calculation.

On the relief pitcher side the ERA figures shift much more severely.

Screen Shot 2016-12-07 at 9.40.37 PM.png

Cam Bedrosian had by normal standards an amazing year with an ERA of just 1.12. Factoring inherited runs scored, his ERA jumps up over 2 runs to a still solid 3.18, but clearly he was the “beneficiary” of the traditional ERA calculation. So to be concrete about the wERA calculation – it is saying that Bedrosian was responsible for an additional 9.22 runs this season stemming directly from his “contribution” of  the runners who he inherited that ultimately scored.

The below graph shows relief pitchers wERA vs. traditional ERA in scatter plot form. The blue line shows the slope of the relationship of the Regular ERA vs wERA, and the black line shows a perfectly linear relationship. It’s clear that the result of this new ERA is an overall increase to RP ERA, albeit to varying degrees based on individual pitcher performance.

Screen Shot 2016-12-07 at 10.04.15 PM.png

While I believe this represents an improvement over traditional ERA, there are two flaws in this approach:

  • In complete opposite fashion compared to traditional ERA wERA disproportionately “harms” relief pitcher ERA, because they enter games in situations that starters do not which are more likely to cause a run to be allocated against them.
  • This does not factor in pitchers who allow runners to advance, but don’t allow that runner to reach base or score. Essentially a pitcher could leave a situation worse off than he started, but not be negatively impacted.

The possible solution to both of these would be to implore a similar calculation to RE24 and calculate both RP & SP expected vs. actual runs based on these calculations. This would lose the nature of run assignment to a degree, but would be a more unbiased way to evaluate how much better or worse a pitcher is compared to expectation. I will attempt to refactor this code to perform those calculations over the holidays this year.

All analysis was performed using the incredible pitchRx package within R, and the code can be found at the Github page below.





Methodology Deep Dive – Gradient Descent

An excellent way to better understand technical subjects is to write about them, so today I’m going to do a deep dive into the gradient descent algorithm. I’ve utilized gradient descent for years largely with gradient boosted decision & regression trees (gbm / xgboost etc..), but this will be a discussion of the mathematical underpinnings that make gradient descent an efficient optimization algorithm. I’m stealing a lot of this from Andrew Ng’s excellent Machine Learning course, which I highly recommend.


Gradient descent is a broadly applied algorithm in machine learning that is designed to minimize a given cost function based on a user-specified constant learning rate (alpha). The formula is as follows:

Screen Shot 2016-08-14 at 3.11.04 PM.png

I’m using the <- notation to indicate assignment (a la R). In this simplified example there are two major factors that dictate how large each “step” in gradient descent are – the cost function itself, and the learning rate chosen.


The larger the learning rate, the larger each “step” in the descent becomes. An advantage of a smaller learning rate is a higher likelihood of converging on the local minimum. The downsides are if the surface is not convex the algorithm will converge to the local, not global minimum (no issue for convex surfaces), and smaller steps mean more steps necessary to actually converge to that local minimum, which depending on the size of the data set you’re working with can be extremely time consuming. On the flip side, a larger learning rate can speed up the process, but risks not actually converging on the local minimum.


The derivative portion of the equation captures the minimization of the cost function based on the features in the model – so as a practical example from a modeling context, if in step one there’s a huge minimization opportunity, the rate of change of the of that minimization will be quite large so the impact from the derivative term will be large. If the next step there is a smaller minimization of the cost function, the derivative term in this function will have a smaller impact. So even though the alpha remains constant in each step of a gradient descent based model, the overall minimization of the cost function becomes smaller as the impact of the derivative decreases.


There’s far more to deep dive into on this subject, but this should serve as a nice introduction into the intuition behind how the gradient descent algorithm works.


Why Tableau Should be Worried

Let me start by saying Tableau will be totally fine for a long, long time. They have software that produces beautiful visualizations & have established themselves at many companies (big & small) already. They have done a really excellent job of making data visualization approachable for users who are comfortable in an Excel or SQL environment.

BUT – even with all of that said, if Tableau wants to be seen as an end-to-end analytics & data science tool I think they should be worried about 2 things:

  • Open source visualization packages that are compatible with common data science languages I.E. Plotly.

Tableau’s model is you need to take your data, stick it in our tool and visualize from there. When your data starts to get really big or are interested in applying significant manipulation or some type of machine learning or statistical modeling, you can get stuck.

With something like Plotly, which just operates as a package within common data science languages (R, Python, Scala), I can perform analyses & manipulate data in any language I want, not worry about porting the output of that data into another system and simply use Plotly to visualize. Data science as an industry iterates so fast that the most important aspect of any tool in my mind is flexibility, and that’s exactly what Plotly provides.

  • New End-to-End tools that start at cluster management and end at dashboard creation I.E. Databricks new platform

 Databricks recently released a tool that allows users to do everything from cluster management, flexible data manipulation (at scale since it’s Spark, as well as ability to code R/Python if you’d like to for certain tasks), AND produce dashboards that can be shared via URL. They came in and gave a demo at work this week, and it was extremely impressive.

Again – Tableau has nothing to worry about in the immediate future, and they still can become the next version of Excel, but I think right now they are sold as an end-to-end analytics solution, and I just don’t agree with that. I know that drag and drop clustering is coming in Tableau 10 (which is a conversation for another day, but potentially very dangerous), but as of right now it’s ideal for smaller datasets that just need visualization, or simple analysis.

But – for my personal workflow, I very much prefer Plotly for the flexibility it provides as well as allowing me to stay in one environment to complete an analysis. In addition, a tool like Databricks new platform is a better fit for how data science will be performed in the future. This is 100% just my opinion, but I think Tableau needs to either revamp it’s offering to better handle large data sets & allow more flexibility, or commit to being the successor to Excel.

Daily Fantasy Predictions – MLB

Daily Fantasy Predictions – MLB

I finished up my MSPA (MS in Predictive Analytics) from Northwestern this past month, so I now have more free time to follow my own selfish data interests … and actually update this blog. One of those interests is predicting fantasy sports – I started developing a model with a team in my final course in the MSPA, and have extended it from there. The aim of the model is to use historic at-bat level information to create an optimal lineup that outperforms the competition in 50/50 MLB competitions on Daily Fantasy sites – I put the model into production (at a small scale) in mid-June, and so far I’ve been using Fanduel.


For those unfamiliar with MLB Daily Fantasy, each day you pick a new team based on positional & salary constraints provided by daily fantasy sites, and each player on your team gets points for favorable baseball outcomes such as getting hits & scoring runs for batters, and striking batters out for pitchers. Each site has it’s own salary & scoring system, and even normalizing for overall salary constraints the value of a player is different site-to-site. So essentially this boils down to an optimization problem for each site – you need to assemble a team that maximizes the possible point output under the salary & positional constraints provided. The ‘50/50’ means the top 50% of players win, and the bottom 50% of players lose.


While that may sound simple, creating a modeled approach is a pretty complex exercise, largely because all relevant data can’t be found in one place. I sourced historic at-bat information from the pitchRx package in R which houses a tremendous wealth of data from Baseball-Reference (I can’t understate how cool this package is). Unfortunately daily fantasy salary information is not included, and (to my knowledge at least) can only be obtained in an automated fashion through a custom web scraper. Even after that there’s a tremendous amount of manipulation required to get the data into the right format not just for the at-bat level predictions, but also for events that count towards scoring but aren’t the definitive at-bat outcome (runs scored or driven in).


Today I won’t talk too much about the web scraping or the model itself, everything from scraping (rvest), to modeling (xgboost) to my personal dashboard (shiny) and everything in between (dplyr, splitstackshape etc..) was built in R.


Instead, I’ll be talking about how I’m responsibly evaluating my performance. Daily Fantasy sites take a large cut on each bet (in 50/50 competitions on Fanduel you ‘win’ 80% of your investment, so a $10 bet returns $18). If you don’t have a systemic way of betting, it’s very easy to lose money fast. Until my model proves to significantly outperform the breakeven threshold (which is a 55.55% win rate) I’m not willing to put in any meaningful investment.


This first chart shows my performance (blue) by day versus the winning score (orange) in each game I play in weighted by investment in each game. As you can see, I got off to a hot start but have been about even with the competition since. The transparent blue & orange fill represent the 95% confidence interval of scores based on total performance weighted by investment per day. Right now there’s a lot of variability day-by-day for both my scores & winning scores and there always will be some due to randomness in baseball, but I expect over time these lines will become much closer to horizontal, and ideally my scores would be noticeably above the competition. .

My Score vs. Winning Score - Weighted by Investment per Day


The second chart shows the essentially the same thing, but the whole time frame in classic normal distribution form, each vertical line represents a stand deviation. My scores are to the right of the competition, which obviously is positive, but also wider meaning they are more variable (negative). I want to maintain the “right”, but also want my score distribution to move up vertically, which would mean my scores are consistently better and more consistent than the competition. More consistent is arguably more important – in these competitions if I win by 50 or win by 1 I win the same amount, so consistency is more important than high scoring potential. If I find over time that the existing trend continues (better but more variable scores), it would be worth considering another game format that rewards higher scores with a higher payout.

Probability Distribution - My Score vs. Winning Score Weighted by Investment per DayRight now in a comparison of 1 MM random outcomes for my score vs. the winning scores based on existing means & standard deviations I’m above the necessary threshold (~62%), but it’s still too early to tell if that’s meaningful or not. Overall this has been a ton of fun, and I’ll continue to update on performance, hopefully it picks back up!

Visualizing March Madness

Visualizing March Madness

Here’s some Python code for visualizing predictions from the Kaggle March Madness 2016 competition, full code can be found on my Github page at the link below. The purpose of this chart is to show the volume of predictions for my model by prediction percentage, as well has how accurate the model is by prediction percentage. I’ve used random numbers as placeholders for game predictions and actual outcome, but posted the actual results from my model in the image below. Just simply replace ‘pred’ and ‘Win’ with your game-by-game predictions and actual outcomes of the games.

As expected, higher probabilities align with a greater chance of winning games. An interesting insight is how heavily weighted extreme predictions are – this shows that there are many very lopsided match-ups in NCAA basketball that are very easy to predict. Note that this chart should be totally symmetrical in terms of counts, but is slightly off just because of rounding predictions to 2 decimals for prediction purposes.



import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches

## Purpose of this chart is to get an understanding of how predicted probabilities translate to actual wins
## As well as an understanding of the prediction distribution across percentages

## NOTE – in order to run this need to have a completed model, then create a dataframe
## that contains the predicted chance of winning & actual outcome for each game in the test set

## Create dataframe with predicted win % and actual result for each game from predicted test set
## predTest DataFrame houses my predictions, and X_testWin houses the outcome of the predicted games

predTestComp = pd.DataFrame({‘pred’ : predTest[:,1],
‘Win’: X_testWin[‘Win’]})
## Round prediction % for ease of interpretation

predTestComp[‘predRound’] = np.round(predTestComp[‘pred’], decimals=2)

## Group predictions by predicted win percentage, then find the total number of wins, games & average win percentage
## For each rounded prediction value (0-100%)

grouped = predTestComp.groupby([‘predRound’])[‘Win’].agg([‘sum’, ‘count’, ‘mean’]).reset_index()

## Create subplots that have win% on the x axis, and number of games on the secondary axis

fig, tsax = plt.subplots(figsize=(12,5))
barax = tsax.twinx()

## Create bar chart based on the count of games for each predicted percentage, grouped[‘count’], facecolor=(0.5, 0.5, 0.5), alpha=0.3)

## Create line chart that shows the average win percentage by predicted percentage

tsax.plot(grouped.index, grouped[‘mean’], color = ‘b’)

## Set axis & data point labels as well as tick distribution

barax.set_ylabel(‘Number of Games’)
tsax.set_ylabel(‘Win %’)
tsax.set_xlabel(‘Predicted Win %’)
tsax.set_xlim([0, 101])
tsax.set_ylim([0, 1])
plt.xticks(np.arange(0, 101, 10))
percListX = [‘0%’, ‘10%’, ‘20%’, ‘30%’, ‘40%’, ‘50%’, ‘60%’, ‘70%’, ‘80%’, ‘90%’, ‘100%’]
percListY = [‘0%’, ‘20%’, ‘40%’, ‘60%’, ‘80%’, ‘100%’]

## Put line graph in front of bar chart

tsax.patch.set_visible(False) # hide the ‘canvas’

## Create legend labels – necessary because it’s a subplot

line_patch = mpatches.Patch(color=’blue’, label=’Percentage of Games Won’)
bar_patch = mpatches.Patch(color=’gray’, label=’Number of Games’)
plt.legend(handles=[line_patch, bar_patch], loc = ‘upper center’

Data Science – Getting Started

Last post we talked about Data Science from a conceptual perspective – what it is exactly, and why it’s becoming so popular in the business world. Often people interested in Data Science gain this conceptual understanding, but simply don’t know where to start from a practical perspective. There are seemingly endless resources to learn from which on one hand is amazing, but on another, can be overwhelming (curse of dimensionality!) leading to inaction.

In my opinion, the best way to learn is to get started with either R or Python, and after getting comfortable with either language, choose an interesting problem to solve. That’s not a novel concept by any means, but it’s worth reiterating that the most efficient way to learn data science is to actually do it, not passively learn about it. It’s important to understand some of the fundamental concepts of modeling & programming in general, but if you have those under your belt, the best move is to just get started.

In this post, we’ll talk about getting started with programming, and since I am predominantly an R user, I’ve provided some practical steps to getting started with R.

  • Download R

  • Download R Studio

R Studio is a really wonderful IDE that makes writing code & visualizing data significantly easier than the R interface itself

  • Get to Know install.packages()

One of the biggest perks of R is the endless amount of packages that users have created over the years. Packages contain reusable code that make extremely complicated data manipulation, modeling & visualization functions significantly easier. The install.packages function allows users to download any available package in one simple line of code (ex. install.packages(‘plyr’))

Below are some sample packages to get started – note that this isn’t even close to a comprehensive list, just some packages that I use frequently. I won’t go into detail on these packages since documentation can be found all over the web, or by typing in ?’Insert Package Name’ into R Studio directly. Keep in mind that to actually execute the functions within these packages, it’s necessary to load them into your R session using library(‘Insert Package’)

All Around Amazingness 

Caret – installation – install.packages(‘caret’, dependencies = c(‘Depends’, ‘Suggests’)). Truly an invaluable package, must-have for modeling & data preparation

Data Manipulation/Loading:










Data Visualization




Base R

These are a part of base R so no packages need to be installed, but these are very useful functions.









Again – this is not even CLOSE to a comprehensive list, just some super helpful packages to get started. The next installation will talk about how to read data into R, and start to get into some of the basic tenets of the R language.