NFL Teams are Terrified of Going for Two

Time and time again we see football coaches make risk averse decisions because it will protect them from criticism. The most obvious of these choices is whether to go for it on 4th & 1 when in FG range, on 4th & 2-5 between the other team’s 40-50 yard line, and my personal favorite, whether to go for two.

Announcers will almost always say coaches “shouldn’t chase the points” early in the game, or should “kick the extra point and make it a one-possession game” when a team scores a touchdown down 15 to get within 8. Both of these assertions, notably the second, can be objectively incorrect, yet if coaches made the opposite decisions they would almost certainly be vilified in the media when unsuccessful.

During the 2016 season, the success rate on extra points was 93.6%, meaning on average teams scored 0.936 points per extra point attempt. The success rate on two point conversions was 48.6%, so on average teams scored 0.972 points per two point conversion attempt. Based on this alone, teams would score more if they went for two every time.

Obviously looking at this purely as an average over the course of the whole season across all teams is not fair for every team’s individual situation. If your team has a terrible offense and an amazing kicker, your odds may flip in the other direction, but if you have a top tier offense why not consider going for two every time? Or more specifically, say your team has a solid offense that struggles to make big plays – why not go for two each time to maximize the potential of each scoring opportunity? Either way, there’s a discussion to be had and a coach can make a rational argument in either direction in almost all situations.

On the other hand, the second scenario of kicking the extra point to go down 8 points, especially well into the 4th quarter, is often objectively the wrong decision. Everyone agrees that down 8 a team needs a touchdown and two point conversion to tie the game. If you believe that a team’s chance of actually converting the two point conversion is the same no matter when the play occurs, then postponing that 2 point conversion until later in the game serves no purpose.

Let’s operate under the scenario that the team will not convert the two point conversion. If the team decides to go for two after the first score and doesn’t it get it, they will be down 9 and the coach will get destroyed by the announcers since it is no longer a “one” possession game. But in reality, the coach now knows he needs two scores in order to win the game and plays the game accordingly, likely taking more risks & playing faster. If the coach waits to go for two later and doesn’t convert, he will have likely coached as if they only needed one more score to tie the game which diminishes the chances of winning.

This can easily be described through a practical situation. A team is down 28-13 and scores a touchdown to go down 28-19 with 4 minutes left. The coach subsequently decides to kick the extra point leaving his team at a 28-20 deficit. His defense then produces a stop and the offense gets the ball back and methodically drives all the way down the field burning all their time-outs in the process and scores a touchdown as time expires to go down 28-26. Unfortunately, the two point conversion comes up short and the game ends. No one bats an eye or considers criticizing the coach.

In an alternate reality, the coach decides to go for two right away and fails to get it, leaving his team down 28-19. His defense gets a stop and the offense scores quickly conserving all their timeouts to go down 28-26 with 1:30 remaining. The coach then tries an onside kick, fails to get it and ultimately comes up short when the opposing team gets the first down they need to end the game. The coach is universally criticized for his decision to go for two early.

Everyone would agree that being down 28-26 with 1:30 remaining and all your timeouts is better than being down 28-26 with no time left, correct? The information gained from attempting the two point conversion earlier gave the alternate reality coach a better chance of winning than the first coach who followed convention. Going for two early gives you the highest probability of the worst outcome, scoring no points after the touchdown, but it also gives you the highest probability of the best outcome, and allows you to coach fully understanding the nature of the situation. The article below does a better job explaining this concept than I ever could:

http://www.footballperspective.com/trailing-by-15-in-the-middle-of-the-4th-quarter-teams-are-foolish-to-not-go-for-2-after-touchdowns/

The day an announcer accurately assesses this situation will be a very, very happy day.

 

Derek Norris 2016 – A Season to Forget

Derek Norris 2016 – A Season to Forget

While it may not be the most exciting Nationals story of the offseason, Wilson Ramos signing with the Rays and the subsequent trade for Derek Norris to replace him is a very big change for the Nats. Prior to tearing his ACL in September, Ramos was having an incredible 2016 and really carried the Nationals offense through the first part of the year (with the help of Daniel Murphy, of course) when Harper was scuffling and Anthony Rendon was still working back from last season’s injury. Given Ramos’ injury history it makes sense to let him walk, but Nationals fans have reasons to be concerned about Norris.

After a few seasons of modest success including an All-Star appearance in 2014, Norris batted well under the Mendoza line (.186) in 2016 with a significant increase in strikeout rate. What was the cause for this precipitous decline? Others have dug into this lost season as well, and this article will focus on using PitchFx pitch-by-pitch data through the pitchRx package in R as well as Statcast batted-ball data manually downloaded into CSV files from baseballsavant.com, and then loaded into R. Note that the Statcast data has some missing values so is not comprehensive, but it still tells enough to paint a meaningful story.

To start, Norris’ strikeout rate increased from 24% in 2015 to 30% in 2016, but that’s not the entire story. Norris’ BABIP dropped from .310 in 2015 to .238 in 2016 as well, but his ISO stayed relatively flat (.153 in 2015 vs. .142 in 2016). Given the randomness that can be associated with BABIP, this could be good new for Nats fans, but upon further investigation there’s reason to believe this drop was not an aberration

Using the batted-ball Statcast data, it doesn’t appear that Norris is making weaker contact, at least from a velocity standpoint (chart shows values in MPH):

Screen Shot 2016-12-11 at 9.50.27 PM.png

Distance, on the other hand, does show a noticeable difference (chart shows values in feet):

Screen Shot 2016-12-11 at 9.53.45 PM.png

So Norris is hitting the ball further in 2016, but to less success, which translates to lazy fly balls. This is borne out by the angle of balls he put in play in 2015 vs. 2016 (values represent the vertical angle of the ball at contact).

Screen Shot 2016-12-11 at 9.56.55 PM.png

The shifts in distance & angle year over year are both statistically significant (velocity is not), indicating these are meaningful changes, and they appear to be caused at least in part by the way pitchers are attacking Norris.

Switching to the PitchFx data, it appears pitchers have begun attacking Norris up and out of the zone more in 2016. The below chart shows the percentage frequency of all pitches thrown to Derek Norris in 2015 & 2016 based on pitch location. Norris has seen a noticeable increase in pitches in Zones 11 & 12, which are up and out of the strike zone.

Screen Shot 2016-12-11 at 10.11.19 PM.png

Norris has also seen a corresponding jump in fastballs, which makes sense given this changing location. This shift isn’t as noticeable as location, but Norris has seen less change-ups (CU & sinkers (SI) and an increase in two-seam (FT) & four-seam fastballs (FF).

Screen Shot 2016-12-11 at 10.15.10 PM.png

The net results from this are striking. The below chart shows Norris’ “success” rate for pitches in Zones 11 & 12 (Represented by “Yes” values, bars on the right below) compared to all other zones for only outcome pitches, or the last pitch of a given at-bat. In this case success is defined by getting a hit of any kind, and a failure is any non-productive out (so excluding sacrifices). All other plate appearances were excluded.

Screen Shot 2016-12-11 at 10.21.20 PM.png

While Norris was less effective overall in 2016, the drop in effectiveness on zone 11 & 12 pitches is extremely noticeable. Looking at the raw numbers makes this even more dramatic:

2015                                                     2016

Screen Shot 2016-12-11 at 10.23.19 PM.png                       Screen Shot 2016-12-11 at 10.23.38 PM.png

So not only did more at-bats end in pitches in zones 11 & 12, Norris ended up a shocking 2-81 in these situations in 2016.

In short, Norris should expect a steady stream of fastballs up in the zone in 2016, and if he can’t figure out how to handle them, the Nationals may seriously regret handing him the keys to the catcher position in 2016.

wERA – Rethinking Inherited Runners in the ERA Calculation

wERA – Rethinking Inherited Runners in the ERA Calculation

There are many things to harp on about traditional ERA, but one thing that has always bothered me is the inherited runner portion of the base ERA calculation. Why do we treat it in such a binary fashion? Shouldn’t the pitcher who allowed the run shoulder some of the accountability?

As a Nationals fan the seminal example of the fallacy of this calculation was game 2 of the 2014 Division series vs. the Giants. Jordan Zimmermann had completely dominated all day, and after a borderline ball-four call Matt Williams replaced him with Drew Storen who entered the game with a runner on first and two outs in the top of the 9th and the Nats clinging to a one-run lead. Storen proceeded to give up a single to Buster Posey & a double to Pablo Sandoval to tie the game, but escaped the inning when Posey was thrown out at the plate. So taking a look at the box score, Zimmerman, who allowed an innocent two-out walk, takes the ERA hit and is accountable for the run, while Storen, who was responsible for a lion’s share of the damage, gets completely off the hook. That doesn’t seem fair to me!

I’ve seen other statistics target other flawed elements of ERA (park factors, defense), but RE24 is the closest thing I’ve found to a more context-based approach to relief pitcher evaluation. RE24 calculates the change in run expectancy over the course of a single at-bat, so it’s applicable beyond relief pitchers & pitchers in general, and is an excellent way to determine how impactful a player is on the overall outcome of the game. But at the same time, it does not tackle the notion of assignment, simply the change in probability based on a given situation.

wERA is an attempt to retain the positive components of ERA (assignment, interpretability), but do so in a fashion that better represents a pitcher’s true role in allowing the run.

The calculation works in the exact same way as traditional ERA, but assigns inherited runs based on the probability that run will score based on the position of the runner & number of outs at the start of the at-bat when a relief pitcher enters the game. These probabilities were calculated using every outcome from the 2016 season where inherited runners were involved.

Concretely, here is a chart showing the probability, and thus the run responsibility, in each possible situation. So in the top example – if there’s a runner on 3rd and no one out when the RP enters the game, the replaced pitcher is assigned 0.72 of the run, and the pitcher who inherits the situation is assigned 0.28 of the run. On the flip side, if the relief pitcher enters the game with 2 outs & a runner on first, they will be assigned 0.89 of the run, since it is primarily the relief pitcher’s fault the runner scored.

Screen Shot 2016-12-04 at 9.35.13 AM.pngLet’s take a look at the 2016 season, and see which starting & relief pitchers would be least & most affected by this version of the ERA calculation (note only showing starters with at least 100 IP, and relievers with over 30 IP).

Screen Shot 2016-12-07 at 9.39.40 PM.png

The Diamondbacks starting pitchers had a rough year this year, but they were not helped out by their bullpen. Patrick Corbin would shave off almost 10 runs & over half a run in season-long ERA using the wERA calculation over the traditional ERA calculation.

On the relief pitcher side the ERA figures shift much more severely.

Screen Shot 2016-12-07 at 9.40.37 PM.png

Cam Bedrosian had by normal standards an amazing year with an ERA of just 1.12. Factoring inherited runs scored, his ERA jumps up over 2 runs to a still solid 3.18, but clearly he was the “beneficiary” of the traditional ERA calculation. So to be concrete about the wERA calculation – it is saying that Bedrosian was responsible for an additional 9.22 runs this season stemming directly from his “contribution” of  the runners who he inherited that ultimately scored.

The below graph shows relief pitchers wERA vs. traditional ERA in scatter plot form. The blue line shows the slope of the relationship of the Regular ERA vs wERA, and the black line shows a perfectly linear relationship. It’s clear that the result of this new ERA is an overall increase to RP ERA, albeit to varying degrees based on individual pitcher performance.

Screen Shot 2016-12-07 at 10.04.15 PM.png

While I believe this represents an improvement over traditional ERA, there are two flaws in this approach:

  • In complete opposite fashion compared to traditional ERA wERA disproportionately “harms” relief pitcher ERA, because they enter games in situations that starters do not which are more likely to cause a run to be allocated against them.
  • This does not factor in pitchers who allow runners to advance, but don’t allow that runner to reach base or score. Essentially a pitcher could leave a situation worse off than he started, but not be negatively impacted.

The possible solution to both of these would be to implore a similar calculation to RE24 and calculate both RP & SP expected vs. actual runs based on these calculations. This would lose the nature of run assignment to a degree, but would be a more unbiased way to evaluate how much better or worse a pitcher is compared to expectation. I will attempt to refactor this code to perform those calculations over the holidays this year.

All analysis was performed using the incredible pitchRx package within R, and the code can be found at the Github page below.

Baseball/wERA.R

 

 

 

Methodology Deep Dive – Gradient Descent

An excellent way to better understand technical subjects is to write about them, so today I’m going to do a deep dive into the gradient descent algorithm. I’ve utilized gradient descent for years largely with gradient boosted decision & regression trees (gbm / xgboost etc..), but this will be a discussion of the mathematical underpinnings that make gradient descent an efficient optimization algorithm. I’m stealing a lot of this from Andrew Ng’s excellent Machine Learning course, which I highly recommend.

 

Gradient descent is a broadly applied algorithm in machine learning that is designed to minimize a given cost function based on a user-specified constant learning rate (alpha). The formula is as follows:

Screen Shot 2016-08-14 at 3.11.04 PM.png

I’m using the <- notation to indicate assignment (a la R). In this simplified example there are two major factors that dictate how large each “step” in gradient descent are – the cost function itself, and the learning rate chosen.

 

The larger the learning rate, the larger each “step” in the descent becomes. An advantage of a smaller learning rate is a higher likelihood of converging on the local minimum. The downsides are if the surface is not convex the algorithm will converge to the local, not global minimum (no issue for convex surfaces), and smaller steps mean more steps necessary to actually converge to that local minimum, which depending on the size of the data set you’re working with can be extremely time consuming. On the flip side, a larger learning rate can speed up the process, but risks not actually converging on the local minimum.

 

The derivative portion of the equation captures the minimization of the cost function based on the features in the model – so as a practical example from a modeling context, if in step one there’s a huge minimization opportunity, the rate of change of the of that minimization will be quite large so the impact from the derivative term will be large. If the next step there is a smaller minimization of the cost function, the derivative term in this function will have a smaller impact. So even though the alpha remains constant in each step of a gradient descent based model, the overall minimization of the cost function becomes smaller as the impact of the derivative decreases.

 

There’s far more to deep dive into on this subject, but this should serve as a nice introduction into the intuition behind how the gradient descent algorithm works.

 

Why Tableau Should be Worried

Let me start by saying Tableau will be totally fine for a long, long time. They have software that produces beautiful visualizations & have established themselves at many companies (big & small) already. They have done a really excellent job of making data visualization approachable for users who are comfortable in an Excel or SQL environment.

BUT – even with all of that said, if Tableau wants to be seen as an end-to-end analytics & data science tool I think they should be worried about 2 things:

  • Open source visualization packages that are compatible with common data science languages I.E. Plotly.

Tableau’s model is you need to take your data, stick it in our tool and visualize from there. When your data starts to get really big or are interested in applying significant manipulation or some type of machine learning or statistical modeling, you can get stuck.

With something like Plotly, which just operates as a package within common data science languages (R, Python, Scala), I can perform analyses & manipulate data in any language I want, not worry about porting the output of that data into another system and simply use Plotly to visualize. Data science as an industry iterates so fast that the most important aspect of any tool in my mind is flexibility, and that’s exactly what Plotly provides.

  • New End-to-End tools that start at cluster management and end at dashboard creation I.E. Databricks new platform

 Databricks recently released a tool that allows users to do everything from cluster management, flexible data manipulation (at scale since it’s Spark, as well as ability to code R/Python if you’d like to for certain tasks), AND produce dashboards that can be shared via URL. They came in and gave a demo at work this week, and it was extremely impressive.

Again – Tableau has nothing to worry about in the immediate future, and they still can become the next version of Excel, but I think right now they are sold as an end-to-end analytics solution, and I just don’t agree with that. I know that drag and drop clustering is coming in Tableau 10 (which is a conversation for another day, but potentially very dangerous), but as of right now it’s ideal for smaller datasets that just need visualization, or simple analysis.

But – for my personal workflow, I very much prefer Plotly for the flexibility it provides as well as allowing me to stay in one environment to complete an analysis. In addition, a tool like Databricks new platform is a better fit for how data science will be performed in the future. This is 100% just my opinion, but I think Tableau needs to either revamp it’s offering to better handle large data sets & allow more flexibility, or commit to being the successor to Excel.

Daily Fantasy Predictions – MLB

Daily Fantasy Predictions – MLB

I finished up my MSPA (MS in Predictive Analytics) from Northwestern this past month, so I now have more free time to follow my own selfish data interests … and actually update this blog. One of those interests is predicting fantasy sports – I started developing a model with a team in my final course in the MSPA, and have extended it from there. The aim of the model is to use historic at-bat level information to create an optimal lineup that outperforms the competition in 50/50 MLB competitions on Daily Fantasy sites – I put the model into production (at a small scale) in mid-June, and so far I’ve been using Fanduel.

 

For those unfamiliar with MLB Daily Fantasy, each day you pick a new team based on positional & salary constraints provided by daily fantasy sites, and each player on your team gets points for favorable baseball outcomes such as getting hits & scoring runs for batters, and striking batters out for pitchers. Each site has it’s own salary & scoring system, and even normalizing for overall salary constraints the value of a player is different site-to-site. So essentially this boils down to an optimization problem for each site – you need to assemble a team that maximizes the possible point output under the salary & positional constraints provided. The ‘50/50’ means the top 50% of players win, and the bottom 50% of players lose.

 

While that may sound simple, creating a modeled approach is a pretty complex exercise, largely because all relevant data can’t be found in one place. I sourced historic at-bat information from the pitchRx package in R which houses a tremendous wealth of data from Baseball-Reference (I can’t understate how cool this package is). Unfortunately daily fantasy salary information is not included, and (to my knowledge at least) can only be obtained in an automated fashion through a custom web scraper. Even after that there’s a tremendous amount of manipulation required to get the data into the right format not just for the at-bat level predictions, but also for events that count towards scoring but aren’t the definitive at-bat outcome (runs scored or driven in).

 

Today I won’t talk too much about the web scraping or the model itself, everything from scraping (rvest), to modeling (xgboost) to my personal dashboard (shiny) and everything in between (dplyr, splitstackshape etc..) was built in R.

 

Instead, I’ll be talking about how I’m responsibly evaluating my performance. Daily Fantasy sites take a large cut on each bet (in 50/50 competitions on Fanduel you ‘win’ 80% of your investment, so a $10 bet returns $18). If you don’t have a systemic way of betting, it’s very easy to lose money fast. Until my model proves to significantly outperform the breakeven threshold (which is a 55.55% win rate) I’m not willing to put in any meaningful investment.

 

This first chart shows my performance (blue) by day versus the winning score (orange) in each game I play in weighted by investment in each game. As you can see, I got off to a hot start but have been about even with the competition since. The transparent blue & orange fill represent the 95% confidence interval of scores based on total performance weighted by investment per day. Right now there’s a lot of variability day-by-day for both my scores & winning scores and there always will be some due to randomness in baseball, but I expect over time these lines will become much closer to horizontal, and ideally my scores would be noticeably above the competition. .

My Score vs. Winning Score - Weighted by Investment per Day

 

The second chart shows the essentially the same thing, but the whole time frame in classic normal distribution form, each vertical line represents a stand deviation. My scores are to the right of the competition, which obviously is positive, but also wider meaning they are more variable (negative). I want to maintain the “right”, but also want my score distribution to move up vertically, which would mean my scores are consistently better and more consistent than the competition. More consistent is arguably more important – in these competitions if I win by 50 or win by 1 I win the same amount, so consistency is more important than high scoring potential. If I find over time that the existing trend continues (better but more variable scores), it would be worth considering another game format that rewards higher scores with a higher payout.

Probability Distribution - My Score vs. Winning Score Weighted by Investment per DayRight now in a comparison of 1 MM random outcomes for my score vs. the winning scores based on existing means & standard deviations I’m above the necessary threshold (~62%), but it’s still too early to tell if that’s meaningful or not. Overall this has been a ton of fun, and I’ll continue to update on performance, hopefully it picks back up!

Visualizing March Madness

Visualizing March Madness

Here’s some Python code for visualizing predictions from the Kaggle March Madness 2016 competition, full code can be found on my Github page at the link below. The purpose of this chart is to show the volume of predictions for my model by prediction percentage, as well has how accurate the model is by prediction percentage. I’ve used random numbers as placeholders for game predictions and actual outcome, but posted the actual results from my model in the image below. Just simply replace ‘pred’ and ‘Win’ with your game-by-game predictions and actual outcomes of the games.

As expected, higher probabilities align with a greater chance of winning games. An interesting insight is how heavily weighted extreme predictions are – this shows that there are many very lopsided match-ups in NCAA basketball that are very easy to predict. Note that this chart should be totally symmetrical in terms of counts, but is slightly off just because of rounding predictions to 2 decimals for prediction purposes.

Enjoy!

https://github.com/WesleyPasfield/March_Madness/blob/master/MarchMadness.ipynb

output

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches

## Purpose of this chart is to get an understanding of how predicted probabilities translate to actual wins
## As well as an understanding of the prediction distribution across percentages

## NOTE – in order to run this need to have a completed model, then create a dataframe
## that contains the predicted chance of winning & actual outcome for each game in the test set

## Create dataframe with predicted win % and actual result for each game from predicted test set
## predTest DataFrame houses my predictions, and X_testWin houses the outcome of the predicted games

predTestComp = pd.DataFrame({‘pred’ : predTest[:,1],
‘Win’: X_testWin[‘Win’]})
## Round prediction % for ease of interpretation

predTestComp[‘predRound’] = np.round(predTestComp[‘pred’], decimals=2)

## Group predictions by predicted win percentage, then find the total number of wins, games & average win percentage
## For each rounded prediction value (0-100%)

grouped = predTestComp.groupby([‘predRound’])[‘Win’].agg([‘sum’, ‘count’, ‘mean’]).reset_index()

## Create subplots that have win% on the x axis, and number of games on the secondary axis

fig, tsax = plt.subplots(figsize=(12,5))
barax = tsax.twinx()

## Create bar chart based on the count of games for each predicted percentage

barax.bar(grouped.index, grouped[‘count’], facecolor=(0.5, 0.5, 0.5), alpha=0.3)

## Create line chart that shows the average win percentage by predicted percentage

fig.tight_layout()
tsax.plot(grouped.index, grouped[‘mean’], color = ‘b’)

## Set axis & data point labels as well as tick distribution

barax.set_ylabel(‘Number of Games’)
barax.xaxis.tick_top()
tsax.set_ylabel(‘Win %’)
tsax.set_xlabel(‘Predicted Win %’)
tsax.set_xlim([0, 101])
tsax.set_ylim([0, 1])
plt.xticks(np.arange(0, 101, 10))
percListX = [‘0%’, ‘10%’, ‘20%’, ‘30%’, ‘40%’, ‘50%’, ‘60%’, ‘70%’, ‘80%’, ‘90%’, ‘100%’]
percListY = [‘0%’, ‘20%’, ‘40%’, ‘60%’, ‘80%’, ‘100%’]
tsax.set_xticklabels(percListX)
#tsax.set_yticklabels(percListY)

## Put line graph in front of bar chart

tsax.set_zorder(barax.get_zorder()+1)
tsax.patch.set_visible(False) # hide the ‘canvas’

## Create legend labels – necessary because it’s a subplot

line_patch = mpatches.Patch(color=’blue’, label=’Percentage of Games Won’)
bar_patch = mpatches.Patch(color=’gray’, label=’Number of Games’)
plt.legend(handles=[line_patch, bar_patch], loc = ‘upper center’

Data Science – Getting Started

Last post we talked about Data Science from a conceptual perspective – what it is exactly, and why it’s becoming so popular in the business world. Often people interested in Data Science gain this conceptual understanding, but simply don’t know where to start from a practical perspective. There are seemingly endless resources to learn from which on one hand is amazing, but on another, can be overwhelming (curse of dimensionality!) leading to inaction.

In my opinion, the best way to learn is to get started with either R or Python, and after getting comfortable with either language, choose an interesting problem to solve. That’s not a novel concept by any means, but it’s worth reiterating that the most efficient way to learn data science is to actually do it, not passively learn about it. It’s important to understand some of the fundamental concepts of modeling & programming in general, but if you have those under your belt, the best move is to just get started.

In this post, we’ll talk about getting started with programming, and since I am predominantly an R user, I’ve provided some practical steps to getting started with R.

  • Download R

https://cran.rstudio.com

  • Download R Studio

https://www.rstudio.com/products/rstudio/download/

R Studio is a really wonderful IDE that makes writing code & visualizing data significantly easier than the R interface itself

  • Get to Know install.packages()

One of the biggest perks of R is the endless amount of packages that users have created over the years. Packages contain reusable code that make extremely complicated data manipulation, modeling & visualization functions significantly easier. The install.packages function allows users to download any available package in one simple line of code (ex. install.packages(‘plyr’))

Below are some sample packages to get started – note that this isn’t even close to a comprehensive list, just some packages that I use frequently. I won’t go into detail on these packages since documentation can be found all over the web, or by typing in ?’Insert Package Name’ into R Studio directly. Keep in mind that to actually execute the functions within these packages, it’s necessary to load them into your R session using library(‘Insert Package’)

All Around Amazingness 

Caret – installation – install.packages(‘caret’, dependencies = c(‘Depends’, ‘Suggests’)). Truly an invaluable package, must-have for modeling & data preparation

Data Manipulation/Loading:

plyr

dplyr

reshape2

RCurl

data.table 

Modeling

gbm

nnet

randomForest

Data Visualization

ggplot2

RColorBrewer

colorspace

Base R

These are a part of base R so no packages need to be installed, but these are very useful functions.

cbind

rbind

head

tail

summary

str

merge

class

Again – this is not even CLOSE to a comprehensive list, just some super helpful packages to get started. The next installation will talk about how to read data into R, and start to get into some of the basic tenets of the R language.

What is Data Science, who are Data Scientists, and what does the Future of Data Science Hold?

What is Data Science & why is it Growing so Rapidly?

Data Science as a field carries an enormous amount of hype, and no real formal definition. At its core, data science is solving business problems through quantitative means. The reason for its massive growth in popularity & recognition are two-fold.

  • Technology/Open Source Community

Recent advances in computational power have opened up a whole new world for advanced business analysis. In the past, things that can be done extremely easily today were either impossible, or illogical from a cost/benefit perspective. In addition to computational power, the open-source community has done a lot of the heavy lifting for the end-user – rather than having to build algorithms from scratch, endless libraries of highly effective and efficient libraries are ready for use.

  • Ambiguity of Definition & Limitless Potential

Unlike more concrete fields, data science can really be applied anywhere. At its core, data science is solving business problems through quantitative means. Every business in the world needs to solve business problems, so choosing not to pursue data science as an organization creates the perception of falling behind competition.

Who are Data Scientists?

The net of this situation are practicing data scientists that come from a wide variety of fields, and perform a wide variety of functions. In it’s broadest definition, a data scientist requires a skill set that would almost necessitate multiple PhDs in computer science, mathematics or physics.

Lately, however, the data scientist position has begun to morph into two groups – the data engineer/architect, who is more focus on feature creation and data warehousing, and the data modeler/scientist, who is concerned with using quantitative methods to extract insights from data. It’s difficult to find a single person who can perform all the tasks necessary to store, clean, model & present the data, and if he or she could, they likely would not have the time to do it.

Future of Data Science, and Data Scientists

In my opinion, in the future the “data architect/engineer” will just be known as IT, and the “data scientist/modeler” will just be known as an analysts. The skills necessary to fit into each group may not be 100% necessary to fill traditional IT & analyst roles today, but in 10-15 years I believe they will. As a society, we are continuously growing more tech savvy & reliant, and rely on quantitative measurement to make decisions more than ever. Look for colleges & universities to catch up with an emphasis on quantitative degrees, as increased competition from boot camps & free coursework has cheapened the value of a traditional college degree. At some point the hype around data science will dissipate, and that means that data science has truly become a fabric of the business world.