week4_slides_061421.pptx

Simple Linear Regression
Chapter 9

Key Points of the Chapter:
Databases contain lots of information, and we’ve covered some basic ways to analyze data and to draw conclusions or insights – e.g. univariate analyses (1 variable), cross-tabs (2, or pairs, of variables). But what do you do if you want to do more sophisticated analyses?
There are methods to use pieces of info as predictors to predict an outcome of interest (response, purchase, cancel, etc.)
Regression is an analytic approach that captures (models) that relationship between variable(s) that are predictive of an outcome of interest
The fundamentals of simple linear regression are the topic for this week, but keep in mind that the fundamentals extend to multiple regression

Chapter 9: Simple Linear Regression

‹#›

‹#›

Is there a relationship between customer income (x) and
customer spend (y)? Can you quantify this relationship?

Can this relationship be expressed as an equation or formula?
Why do we care about having an equation or formula?

Source: Exhibit 9.1-9.2, of text

Income (x) Spend (y)

58 76

42 45

24 26

76 102

33 42

69 97

31 33

46 49

51 52

38 40

Income ($k)
Chapter 9: Simple Linear Regression
Yes, as you did in hw3 by “r”
Yes. We care because it lets us go from “reporting” to “predicting”

‹#›

‹#›

Note to class – Think about the “boost” in analytic power you get that comes from being able to predict. . . .the future. It is much more powerful that “just” being able to report or describe present state. Going from correlation to regression allows us to make this transition.
3

Spend (y) 58 42 24 76 33 69 31 46 51 38 76 45 26 102 42 97 33 49 52 40
Yes, there is a relationship between income and spend
Yes, we can quantify (express, measure, etc.) this via “r”, the correlation
We can also model this as an equation (customer spend is the outcome of interest, customer income is a predictor of this outcome)
Having an equation allows us to understand the relationship and to make predictions

The Equation for this is: Y’ = -15.395 + 1.53(x) or Spend = (1.53*Income) – 15.4

Income (x) Spend (y)

58 76

42 45

24 26

76 102

33 42

69 97

31 33

46 49

51 52

38 40

  Coefficients

Intercept -15.395

X Variable 1 1.529809

Income ($k)
Spend ($)
Chapter 9: Simple Linear Regression

‹#›

‹#›

Note to Class: We’ll go through how we arrived at the particular equation shown here, and how to come up with the regression line.
4

Spend (y) 58 42 24 76 33 69 31 46 51 38 76 45 26 102 42 97 33 49 52 40
How do we get to a regression line that models the relationship?
Hint – Start with something that measures an association between 2 variables

Start with correlation, or “r”
Note that a linear regression equation is also an equation for a line . . . We can write equations in terms of z-scores:
y’ = a + bx Usual form of a straight line equation. Note the terms for intercept (a), and slope (b)
Zy’ = 0 + r Zx Re-written in terms of z-scores, note that intercept is 0 when using z-scores
Zy’ = r Zx Note what happens to Zy’ if r =1, (perfect correlation)
if r = 0, (no correlation)
if r = -1 (perfect negative correlation)

For Income (x) and Spend (y), note what happens when “r” is something other than -1,0, or 1.
What we want to know (show) is that the regression equation, model, line, is equivalent to and based upon our understanding and use of correlations as z-scores:Zy’ = r Zx

Ch 9: Regression: The intuition & the math

‹#›

‹#›

zx zy zxzy

0.701424769 0.79114 0.554925

-0.30061062 -0.44751 0.134527

-1.42790042 -1.20669 1.72303

1.828714576 1.83001 3.346565

-0.86425552 -0.56738 0.490364

1.390324096 1.630227 2.266544

-0.98950994 -0.92699 0.917268

-0.05010177 -0.28769 0.014414

0.263034288 -0.16782 -0.04414

-0.55111946 -0.6473 0.356737

  Coefficients

Intercept -3.2E-16

X Variable 1 0.976027

Zy’ = r Zx

0.976023 correlation

Source: Exhibit 9.1-9.2, data converted to z-scores
Ch 9: Regression: The intuition & the math
See slide note below re: something very cool about getting this equation

‹#›
‹#›

Note to Class: If we take our raw scores, and transform them into z-scores and calculate the correlation as you did in HW3, notice that you can generate the regression equation (via software or algebra), that is equivalent to Zy’ = r Zx. Or in “English”, the regression is that the “predicted Y value (as a zscore) = the correlation multiplied by the known X-value (as a zscore).”
6

0.70142476906215701 -0.300610615312353 -1.427900422733678 1.828714576483482 -0.86425551902301501 1.3903240958196339 -0.98950994206982901 -5.0101769218725403E-2 0.263034288398309 -0.551119461405981 0.791139556618222 -0.44751328455172201 -1.2066876065591059 1.8300096814704321 -0.56738291434236099 1.6302269651526999 -0.92699180371428103 -0.28768711149753501 -0.167817481706896 -0.64729600086945505

To find the regression line that models the relationship between X & Y (without software), here’s what’s happening:
i.e. We want to go from: Zy’ = r Zx to an equation of the form: Y’ = a + bX
(see the “white board” steps that algebraically take us from z-scores to the standard form of an equation for a line, and the “AHA moment”…..
See: file: correlation_to_regression_eq.pdf

Steps as follows:
Rewrite z-scores to raw scores for Y and for X…
Multiply (move) the sigma_y from the left side to the right side of the equation…
Add (move) “y_bar” (average_Y) from the left to the right, to isolate a predicted_y on the left . . .
Rewrite the right side of the equation, distributing terms and getting rid of parentheses . . .
Declare ‘b’ to be equal to part of the expression . . .
Rewrite using ‘b’ …
Declare ‘a’ to be equal to part of the expression . . .
Rewrite using ‘a’ and ‘b’ terms, . . . and we have our regression equation !

Notes:
‘b’ is the slope of the regression line adjusted by ratio of population standard deviations.
When dealing w/z-scores, the slope ‘b’ is just the correlation value, “r”

Ch 9: Regression: The intuition & the math

‹#›

‹#›

What is explained vs. unexplained variation in the context of regression?

If there is no relationship between “x” and “y”, what’s your prediction for “y”?
Answer: The Average, or y’ = $56.2

If there is a relationship, then knowing “x” allows you to predict “y”
From the equation, If income = $69, then predicted spend = $90.18

But we have data where income=$69, spend actually = $97. Not $90 ??
Note that this $7 difference is an unexplained (pun intended) variation
y’ = -15.395 + 1.53(x)
Y-Yavg

Y’-Yavg

Y-Y’

Ch 9: Explained and unexplained variation

‹#›

‹#›

Spend (y) 58 42 24 76 33 69 31 46 51 38 76 45 26 102 42 97 33 49 52 40
How “well” does our regression line (model) fit?
Although it’s a best-fit effort, the regression line will always have unexplained error (unless there’s a perfect correlation). Too “much” unexplained error means it’s not a “good” fit
R2 is the amount of variation in Y that is explained by the predictor X.
It’s a measure of how well the regression line models the data, or the “coefficient of determination”.
Mathematically it is literally the correlation, “R” squared
In the example of income and spend, R2 = 95.26%,
So, a bit more than 95% of the variation in Spend, can be accounted for by knowing Income
… and less than 5% is unexplained by the model. It’s a pretty good fit.

Ch 9: Regression: Assessing Fit

‹#›

‹#›

SQL variants and the latest in databases
State of Art

State of the Art in SQL
The basics (review):
Relational tables decompose information into parts and sub-tables. As you did with Student, Job, and Match info as a design exercise (HW2), and as a preliminary build (HW3). Very cool!
Any good design stores information efficiently, has relations based on keys, and has a structure reflected as a schema
Any RDBMS system uses SQL

‹#›

‹#›

State of the Art in SQL
Typical Commercial databases
Currently, most large scale industrial RDBMS are typically from Oracle, Teradata, Microsoft SQL Server.
These are typically “expensive” to license, but have extensive commercial support.
Newer Open Source databases
PostgreSQL, mySQL, SQLite some popular variations of SQL
As open-source, they are “free” but support is typically the responsibility of the user(s)

‹#›

‹#›

State of the Art in SQL
Newest Leading-Edge Database Systems
Advanced databases combine the power of a BigData paradigm with RDBM’s popularity and/or also take advantage of Open Source
Apache Hive is a BigData environment that uses HiveQL to “translate” Java and MapReduce() into SQL-like statements
Snowflake uses is a newer database technology for the cloud, and uses SnowSQL

‹#›

‹#›

State of the Art in SQL
Important differences among all of these SQL variants:
Fundamentally, SQL works for any and all relational databases (analogous to English as a language that is spoken in many countries, but with some localized variations)
Variations to be aware of typically have to do with:
How tables are joined,
How data are created/added,
How a selection of a few records from the “top” is handled
How a PIVOT table is supported or implemented

‹#›

‹#›

Extend correlation into regression, & build a regression model from data
(See Assignments for details)
HW

Exercise: Using the Email marketing (EMMA) dataset, build a simple regression model that predicts “Total customer value” (Y) from “Tenure” (X). . .

Use the Excel capabilities for Simple Regression (hint: data  scatterplot  fit trendline)

Use the algebra that we walked through to come up with the simple regression equation
You will need the following components to determine the slope ‘b’ and intercept ‘a’ values of the equation:
Correlation, r (you can use Excel’s CORR( ) function to calculate this),
Standard deviations for X and Y (use population standard deviations, stdev.p( ),
Means for X and Y,

What would you conclude about the relationship and ability to predict Value from Tenure?

Note: that answers to ‘1’ and ‘2’ above should all match if calculated correctly.

HW4: Regression Homework

‹#›

‹#›

Place your order
(550 words)

Approximate price: $22

Calculate the price of your order

550 words
We'll send you the first draft for approval by September 11, 2018 at 10:52 AM
Total price:
$26
The price is based on these factors:
Academic level
Number of pages
Urgency
Basic features
  • Free title page and bibliography
  • Unlimited revisions
  • Plagiarism-free guarantee
  • Money-back guarantee
  • 24/7 support
On-demand options
  • Writer’s samples
  • Part-by-part delivery
  • Overnight delivery
  • Copies of used sources
  • Expert Proofreading
Paper format
  • 275 words per page
  • 12 pt Arial/Times New Roman
  • Double line spacing
  • Any citation style (APA, MLA, Chicago/Turabian, Harvard)

Our guarantees

Delivering a high-quality product at a reasonable price is not enough anymore.
That’s why we have developed 5 beneficial guarantees that will make your experience with our service enjoyable, easy, and safe.

Money-back guarantee

You have to be 100% sure of the quality of your product to give a money-back guarantee. This describes us perfectly. Make sure that this guarantee is totally transparent.

Read more

Zero-plagiarism guarantee

Each paper is composed from scratch, according to your instructions. It is then checked by our plagiarism-detection software. There is no gap where plagiarism could squeeze in.

Read more

Free-revision policy

Thanks to our free revisions, there is no way for you to be unsatisfied. We will work on your paper until you are completely happy with the result.

Read more

Privacy policy

Your email is safe, as we store it according to international data protection rules. Your bank details are secure, as we use only reliable payment systems.

Read more

Fair-cooperation guarantee

By sending us your money, you buy the service we provide. Check out our terms and conditions if you prefer business talks to be laid out in official language.

Read more

Order your essay today and save 30% with the discount code HAPPY