Housing Prices across university/non-university towns
Hello World!
First post ever on Github Pages!
The project covered in this post is based on Week 4 of UMichigan’s Introduction to Data Science in Python on Coursera. I have its repository on my Github, and the source data files. I hope to extend this project a little further, perhaps creating linear regression models.
So the point of this project is to compare the impact of recession on housing prices in university towns/non-university towns.
Hypothesis: University towns housing prices are less affected by recession compared to non-university housing prices.
Few notes:
- A quarter is 3 months; Jan-March is Q1, Apr-Jun is Q2, Jul-Sept is Q3, Oct-Dec is Q4.
- A recession starts when GDP declines in two consecutive quarter, and ends with GDP growth in two consecutive quarters.
- Recession bottom is the quarter with the lowest GDP in a recession.
- A university town is a city with a high percentage of college students compared to the total population of said city.
I like to break things down into intermediate steps:
- Find out when recession period started & ended.
- Find out when recession was at the bottom.
- Make list of university towns.
- Make dataframe of housing prices.
- Using housing prices dataframe, calculate ratio of housing prices at start/end of recession period.
- Using ratio of housing prices, run a t-test to test for a significant difference across university/non-university towns.
Hope you’re still with me!
First things first, import your libraries. You’ll need pandas
, numpy
, and an independent t-test from the scipy.stats
library.
We’re using the chained value to 2009 dollars, 2000 onward. Get your recession_start
, in string format:
And get the recession_end
:
From recession_start
to recession_end
, you need to find the bottom:
Get a list of university_towns
:
And then convert the raw housing data into quarters.
We’re looking for averages of the month of Jan-Mar/Apr-June/July-Sept/Oct-Dec, converted into a dataframe, with a multi-index in the shape of [“State”, “RegionName”]. You should have 67 columns and 10, 730 rows.
Call all the functions from earlier, and drop the NaN values in the dataframe.
Make a copy of your housingdata_df
, then create a ratio of housing prices:
This was where I struggled. I could not join ratio
as a column on hdf; it returned a
DateParseError: Unknown datetime string format, unable to parse ratio.
Remember when we converted the housingdata_df
into quarters using PeriodIndex
?
Ratio was not recognized as a datetime
. The solution I chose was to change hdf
dataframe’s columns into strings, then concatenate ratio to the multiple strings… and then convert it back to a dataframe.
Then splice the dataframe into university_town and non-university towns, calculate ratio for each, and dropna
:
The last bit is to run the t-test. Skip the next two paragraphs if you’re familiar with t-tests.
So what exactly does t-test do? A t-test tests the difference between the means of two independent (or different conditions) groups.
For example, let’s say we want to compare the mean weight of two different groups of peaches, with fertilizer and without. We know there’s going to be a difference in the average weight of the two groups, but is the difference in the average weight ENOUGH/sufficiently large to say that these peaches were drawn from different populations?
A common p-value is p < 0.05; that is, the probability of obtaining this sample data is less than 0.05 IF there is no difference between means between the two groups.
This returns True, the p-value, and university town.
Based on the t-test, we can conclude that the alt-hypothesis is not rejected; ie., housing prices in university towns are less impacted by recession compared to housing prices in non-university towns.
I hope you found this post useful!