Edward Rowland
Big Data - Methodology
Jessica Lawrence, Nathan Davis, Anthony Fitzroy and Ben Vince
Data Engineering - DaaS
Duncan Elliot
Time series analysis - Methodology
Office for National Statistics, UK
# source: https://stackoverflow.com/questions/27934885/how-to-hide-code-from-cells-in-ipython-notebook-visualized-with-nbviewer
HTML('''<script>
code_show=true;
function code_toggle() {
if (code_show){
$('div.input').hide();
} else {
$('div.input').show();
}
code_show = !code_show
}
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()"><input type="submit" value="Click here to toggle on/off the raw code."></form>''')
#source: https://stackoverflow.com/questions/9031783/hide-all-warnings-in-ipython
from IPython.display import HTML
HTML('''<script>
code_show_err=false;
function code_toggle_err() {
if (code_show_err){
$('div.output_stderr').hide();
} else {
$('div.output_stderr').show();
}
code_show_err = !code_show_err
}
$( document ).ready(code_toggle_err);
</script>
<form action="javascript:code_toggle_err()"><input type="submit" value="Click here to toggle on/off warnings."></form>''')
Other things I may not have time to talk about but are in these slides
1. More localised estimates
Identify areas of low or negative growth that would be lost if looking at the national figures
2. Potential early indicator
A recession is defined as having two or more quarters of negative GDP growth.
The UK won't officially know if it is in recession until around 6-7 months after it started.
3. Identify impacts of specific events
With high frequency data, it could identify the impacts of events (Weather, terrorist attacks, Brexit etc.)
Research using data from an extensive road sensor network in The Netherlands shows correlations with traffic flow and a number of economic measures with a lag of 3 months
Measure | Total Traffic | Cat 1 (< 5.6m) Traffic | Cat 2 (5.6m =< >= 12.2m) Traffic | Cat 3 (> 12.2m) Traffic |
---|---|---|---|---|
Inflation | -0.42 | -0.43 | - 0.19 | -0.43 |
Unemployment | -0.47 | -0.41 | -0.55 | -0.22 |
Income | 0.74 | 0.74 | 0.45 | 0.65 |
GDP | 0.54 | 0.63 | -0.01 | 0.70 |
Killan, Ros, "Road Traffic Correlations with Economic Variables: The Big Data Perspective., 2017, https://pdfs.semanticscholar.org/1f1d/b563d229bdd4fd8c90ad8dd6c5cd3487f76b.pdf
Črt Grahonja, 2018, Use of alternative data sources as flash estimates of economic indicators, European conference on quality in official statistics June 2018: https://www.q2018.pl/papers-presentations/?drawer=Sessions*Session%2022
Henri Luomaranta, 2018, Nowcasting Finnish Real Economic Activity: a Machine Learning Approach, European conference on Quality in official statistics June 2018: https://www.q2018.pl/papers-presentations/?drawer=Sessions*Session%2022
Highways England provides an open dataset containing traffic flow counts at 15 minute intervals for Motorways and major A-Roads across England.
The Data Engineering team in DaaS have scraped and ingested this data
Really exciting high frequency dataset - ideally suited for this work
Make the data manageable
A range of different approaches were used including...
Ridge regression looked to be the "best" model of these
IFrame("figures/full_ts_ridge_model_small.html",
width=1000,
height=600)
The approaches used here are similar to the work carried out in Finland and Slovenia.
Data time period used by...
... only one of these includes a global financial crisis. Let us see what happens when we try this with data from 2010-2014.
IFrame("figures/no_rec_ridge_model_small.html",
width=800,
height=600)
Metric | Value | Fitted to | Tested on | Features |
---|---|---|---|---|
MSE | 0.353 | 2006-2014 | 2006-2014 | 7 day moving average |
R squared | 0.354 | 2006-2014 | 2006-2014 | 7 day moving average |
Metric | Value | Fitted to | Tested on | Features |
---|---|---|---|---|
MSE | 0.137 | 2006-2014 | 2010-2014 | 7 day moving average |
R squared | -0.126 | 2006-2014 | 2010-2014 | 7 day moving average |
Metric | Value | Fitted to | Tested on | Features |
---|---|---|---|---|
MSE | 0.024 | 2010-2014 | 2010-2014 | Trend and seasonal components |
R squared | 0.796 | 2010-2014 | 2010-2014 | Trend and seasonal components |
If nowcasts/flash estimates from these data sources are used by then it is possible they will not be able anticipate drastic changes, like a recession.
This means your model may fail, just when policymakers and buisinesses need them the most.
In turn, this could introduce complacency - worse than if your estimations didn't exist
Try some other economic indicators?
Well the data we have means we may be able to solve this problem.
We have only scratched the surface, there may be a recession "singature" in the data. With careful decomposition and analysis we may be able to find it, such as...
Spectral analysis from fitting an AR(35) model to the data
easter_df
-2 | -1 | 0 | 1 | 2 | |
---|---|---|---|---|---|
Easter | |||||
16/04/06 | -4.97 | 1.97 | 0.58 | -1.83 | 2.18 |
08/04/07 | -4.82 | 2.16 | 0.63 | -1.70 | 2.74 |
23/03/08 | -5.51 | 1.33 | -0.44 | -1.27 | 2.99 |
12/04/09 | -5.37 | 1.96 | 0.18 | -1.46 | 2.44 |
04/04/10 | -4.99 | 1.96 | 0.34 | -2.16 | 2.55 |
24/04/11 | -5.38 | 0.63 | 1.43 | -0.54 | 1.73 |
08/04/12 | -4.99 | 2.05 | 0.11 | -2.39 | 2.84 |
31/03/13 | -4.73 | 2.03 | -0.39 | -2.59 | 2.21 |
20/04/14 | -4.56 | 1.90 | 0.08 | -1.37 | 2.63 |
Including the effects of rain and temperature
days_df.sort_values("Coefficient")
Coefficient | Standard Error | |
---|---|---|
Additional Holiday for Wedding of Kate and William | -136.10 | 10.07 |
Christmas Day | -113.26 | 5.03 |
New Year Bank Holiday | -108.68 | 3.67 |
Good Friday | -103.16 | 3.36 |
Late May Bank Holiday | -91.11 | 3.36 |
Additional Holiday for Queens Diamond Jubilee | -89.29 | 10.07 |
Christmas Bank Holiday | -84.30 | 5.05 |
Early May Bank Holiday | -80.01 | 3.36 |
August Bank Holiday (England and Wales) | -77.27 | 3.35 |
Boxing Day Bank Holiday | -70.55 | 3.50 |
Easter Monday | -67.69 | 3.36 |
Christmas Eve | -35.32 | 3.50 |
Deviation from mean daily rain fall | -0.40 | 0.06 |
Deviation from mean daily temperature | 0.45 | 0.13 |
Even with all of the above, there is still a lot of unexplained structure in the time series
See these slides again!
Gitpages : https://onsbigdata.github.io/RSS-2018/
Github repo: https://github.com/ONSBigData/RSS-2018/
Email: edward.rowland@ons.gov.uk
Annual figures are taken to match with traffic flow
Pearson's correlations of AADF traffic and economic measures lagged by one year
Economic indicator | CPI | Change in unemployment (% pts) | Change in weekly earnings (£) | GDP Growth |
---|---|---|---|---|
Vehicle type | ||||
All HGVs | 0.08 | -0.15 | 0.95 | 0.51 |
All Motor Vehicles | 0.09 | -0.50 | 0.83 | 0.76 |
Buses and Coaches | 0.11 | -0.56 | 0.72 | 0.76 |
Cars and Taxis | 0.10 | -0.46 | 0.84 | 0.77 |
LGVs | -0.05 | -0.67 | 0.18 | 0.44 |
Motorbikes and Scooters | 0.07 | 0.12 | 0.83 | 0.26 |
Pedal Cycles | 0.04 | -0.57 | -0.18 | 0.37 |
Measure | Total Traffic | Cat 1 (< 5.6m) Traffic | Cat 2 (5.6m =< >= 12.2m) Traffic | Cat 3 (> 12.2m) Traffic |
---|---|---|---|---|
Inflation | -0.42 | -0.43 | - 0.19 | -0.43 |
Unemployment | -0.47 | -0.41 | -0.55 | -0.22 |
Income | 0.74 | 0.74 | 0.45 | 0.65 |
GDP | 0.54 | 0.63 | -0.01 | 0.70 |
IFrame("figures/actual_vs_AR_predictions.html",
width=800,
height=600
)
Daily traffic flow showing mean 15 min traffic counts averaged across previous 91 days
IFrame("figures/quarterly_smoothed_daily_time_series_multi_metrics_small.html",
width=800,
height=600
)
IFrame("figures/trends_traffic_flow_decomposition.html",
width=800,
height=400
)
IFrame("figures/seasons_trends_traffic_flow_decomposition.html",
width=800,
height=400`
)
IFrame("figures/resids_trends_traffic_flow_decomposition.html",
width=800,
height=400
)