Key Business Factors in Machine Learning: Part 2- Testing 3 Machine Learning Techniques to predict Divvy Bike Ride Predictions

Authored by: 
Annie Liu, Consultant, TruQua Enterprises
JS Irick, Lead Developer and Principal Consultant, TruQua Enterprises
Daniel Settanni, Senior Cloud Architect, TruQua Enterprises
Geoffrey Tang, Consultant, TruQua Enterprises

In part one of “Key Business Factors in Machine Learning” (https://www.truqua.com/key-business-factors-machine-learning-part-1-predicting-employee-turnover/), we explored how Machine Learning can categorize data.  We also reviewed the business’s role in model development.  In this blog, we will look at creating Machine Learning algorithms to predict values.  In particular, we will be looking at Sales Demand for Bicycle rentals.  Divvy Bikes is Chicago’s own bike sharing system and, with over 10 million unique rides taken a year, the largest in North America.

The dataset used in this article combines all of Divvy Bike’s 10+ Million rides from 2015, along with hourly weather data and the Chicago Cubs schedule to observe the effect of external factors on rider traffic (for a different presentation on adapting Machine Learning models for discrete locations).

In this example, we are going to test three different popular Machine Learning technique: Machine Learning with Logistic Regression, Support Vector Machines and Random Forest algorithm models to predict the number of bikes that will be in service for a given hour of a Divvy station.

Refining the dataset

The Divvy Bike/Weather/Cubs dataset in this article is much more complicated than the Employee Attrition dataset in part 1, featuring over 55 different factors.

Key Business Factors in Machine Learning
Key Business Factors in Machine Learning

Two of the factors in the model can be generalized into groups to help more efficiently train the model.  These factors express time as integers – Day of the Year and Day of the week.  When expressing them as integer, their meaning is actually obscured from the model.

Certain algorithmic techniques can work around this obfuscation, but it can be much more efficient to perform an initial grouping to accelerate the model development.

Day of the week and day of the year have obvious groupings; however, your business data may have groupings that are not immediately obvious to the data scientists.

Here we see the impact of creating a “Season” category for day of the year:

Key Business Factors in Machine Learning

Similarly, we can see that there is a large effect on demand based on the day of the week.  Weekdays have a huge 5PM spike that is not seen on weekends.  Therefore, we can greatly increase the initial accuracy of our models by changing day of the week into a Weekday/Weekend grouping.

Key Business Factors in Machine Learning

Investigating the Data

Key Business Factors in Machine Learning

Removing outliers can be a critical step in increasing model fit.  However, it is important to define just what an outlier is in the context of your business process.  For example, removing outliers from our bike ride dataset can help refine our model (the Fourth of July causes a demand spike which is obvious to anyone familiar with the US.  The demand spike for the “Air and Water Show” would not be apparent to non-Chicagoans.).  We don’t want our model to try and fit a factor that is not present in our model.

Key Business Factors in Machine Learning

However, compare this to Fraud detection algorithms that exist only to detect outliers.  If we were to remove the outliers from that model, we would end up with a 100% accurate, since all the fraudulent transactions were removed from the dataset.

However, compare this to Fraud detection algorithms that exist only to detect outliers.  If we were to remove the outliers from that model, we would end up with a 100% accurate, since all the fraudulent transactions were removed from the dataset.

Modeling with the Dataset

Once the dataset has been prepared, it is time to develop, train and test with different Machine Learning algorithms.

In this example, we looked at three very different algorithms – Logistic Regressions, Support Vector Machines, and Random Forests.

Logistic Regression

Logistic Regression is a statistical method for analyzing a dataset where there are one or more independent variables that determine an outcome. The outcome is measured with a dichotomous variable (MEDCALC).

Support Vector Machines

Support vector machines (SVMs) are supervised learning models with associated learning algorithms that analyze data used for classification and regression analysis. Given a set of training examples, each marked as belonging to one or the other of two categories, an SVM training algorithm builds a model that assigns new examples to one category or the other (Wikipedia).

Random Forest

Random forests or random decision forests are an ensemble learning method for classification, regression and other tasks, that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees. Random decision forests correct for decision trees’ habit of overfitting to their training set (Wikipedia).

Modeling Results

Key Business Factors in Machine Learning
As can be seen, Logistic Regression was the clear winner for this scenario, but the optimal model typically isn’t obvious at the start, and may even be a surprise at the end. Further accuracy gains required separating out the data by departure station, as there are significant model differences between each station (for example, downtown station usage is very resistant to changes in weather and is almost entirely dependent on the weekday/weekend grouping).
For more information on how you can put Machine Learning to work at your own organization, contact us today at info@truqua.com. Our team of consultants and data scientists are on-hand and ready to assist.  For companies with robust data science organizations, we offer several project accelerators to easily and securely combine your business data with your Data Scientists’ Machine Learning algorithms.

Key Business Factors in Machine Learning: Part 1- Predicting Employee Turnover

Authored by:
Annie Liu, Consultant, TruQua Enterprises
JS Irick, Lead Developer and Principal Consultant, TruQua Enterprises
Daniel Settanni, Senior Cloud Architect, TruQua Enterprises

As industry leaders ramp up their investments in Machine Learning, there is a growing need to communicate effectively with Data Scientists. Without a true understanding of both the technology and business factors involved in the Machine Learning scenario, it is impossible to create long term solutions.

In Part 1 of this 2-part blog series, we will work through the first of two Machine Learning examples and describe the communication and collaboration necessary to successfully leverage Machine Learning for business scenarios.

Machine Learning algorithms are very good at predicting outcomes for many different types of scenarios by analyzing existing data and learning how it relates to the known outcomes (what you’re trying to predict).  Two of the most common types of machine learning algorithms are classification and regression.

With classification, the predicted values are fixed, meaning there are a limited number of outcomes, such as determining if a customer will make a purchase or not.  Regressions on the other hand, make continuous numerical predictions, such as determining the lifetime value of a customer. In each case, it is critical that the Data Scientist understands both the inputs (the source of the individual factors and how they are created) and the business event you are trying to categorize or predict.

Next-Gen Technologies Investment

Categorizing Example: Employee Turnover

 

Understanding Machine Learning and business goals

First, let’s look at an example that demonstrates how to use Machine Learning to perform categorization. In this case, we are trying to better predict Employee Turnover. So, the goal of the machine learning algorithm is to categorize current employees as “Likely to Leave” or “Unlikely to Leave”. The categorization will be based on factors we have about each employee.

However, our goal is slightly different. Our business requirement is to identify the employees likely to leave so that actions can be taken to retain the employees. Before we continue, it is important to understand the cost of both a false positive and a false negative with regards to your business.

False Positive: An employee that is not going to leave is flagged as likely to leave.

False Negative: An employee leaves despite no indication from the machine learning algorithm.

In this case, False Negatives are costlier than False Positives. The algorithm with the best fit (overall performance) may not be the most effective for your business if it does not appropriately weigh the cost of the outcomes.

Business Requirement 1
Communicating the available data with the Data Scientists

Machine learning algorithms need to be developed and trained on historical data, so for each historical employee we have features that we believe are related to whether an employee stays or leaves, as well as whether they remain at the company.

When undertaking a Machine Learning project, it’s critical to work with a partner who will take the time to understand the various features that can be used within the model. If the data scientist does not understand the inputs into the model, it is likely to end up with models that perform well in testing, but poorly in production. This is called “overfitting”.

This communication with the Data Scientist can also lead to the inclusion of additional valuable external data that were initially missing from the model.

Business Requirement 2

Let’s look at the factors in the Employee Turnover dataset.

SAP Analytics

There are three important items to note here:

1. Satisfaction level is self-reported and people are notoriously poor self-reporters.
2. The job role column is labeled “sales” in the input dataset. While descriptive column names are nice, they are no replacement for a good data dictionary.
3. Salary is a simple “High/Medium/Low” value, but is not normalized for job role.

Refining the dataset

Once we have reviewed the factors, as well as the business event we are trying to model, we need to better understand how they relate to each other. An analysis should occur on the relationships between factors and results, as well as between individual factors. Here we see a chart describing the correlations between our various factors, and whether the employee stayed with the company.

Employee Satisfaction Level

When looking at the relationships, we start to understand the correlations between our data. This step should reveal a number of data relationships which make intuitive sense, and may show some surprising results.

1. Number of current projects and number of hours worked are related. [Intuitive]
2. Employees with a longer tenure are less likely to leave. [Intuitive]
3. There is a slight negative relationship between satisfaction and retention, [Surprising]

When looking at the relationships between data, we can also find highly correlated associations. This can help determine factors to either combine or remove.

Additionally, it is necessary to look at the numerical data to determine if we should change certain values to ranges/buckets. For example, look at the relationship between monthly hours and employee retention.

Predicting Employee Turnover

Note the monthly hours for employees that were not retained. This should make intuitive sense, as the only thing worse than working too much, is working too little. Rather than use monthly hours as a value, our model would be better served by defining categories for monthly hours.

 

Predicting Employee Turnover
Model Development

Once the data set has been analyzed, model development can begin. This is generally an iterative process, going through a number of different model types, as well as re-examining the initial data set.

While this iterative process is being performed, it is important to look at the output of the models, not just this fit. This is where the definition of your business goal, as well as communication with an experienced Data Scientist is critical. For example, a fraud detection algorithm that never detects fraud is over 99% accurate. Fit is not enough.

Predicting Employee Turnover

For our employee retention example, we tested three popular machine learning algorithms. Below you can see the Fit of each of the three models, more importantly you can see the output for a subset of the testing data.

Predicting Employee Turnover

We have taken an abbreviated look at how a data scientist might approach this scenario, but in the real world this is only a part of the solution. There are still questions surrounding how the model is served, how it is consumed within the business process and how a strategy is devised in order to retrain the model with updated data.

If you have questions, we have the answers. TruQua’s team of consultants and data scientists merge theory and practice to help customers gain deeper insights into their data for more informed decision making. For more information or to schedule a complimentary workshop that identifies what Machine Learning scenarios make sense for your business, contact us today at info@truqua.com.

Machine Learning…..Demystified

By Daniel Settanni, Senior Cloud Development Architect, TruQua Enterprises

Artificial Intelligence (AI), Machine Learning (ML), Predictive Analytics, Blockchain – with so many different buzzwords, it can be a challenge to understand how they are applicable to your business. Here’s a short primer to help customers make sense of Machine Learning in the Enterprise.

What is Machine Learning?
There are two definitions that seem to be the most commonly referenced when discussing the meaning of Machine Learning. Arthur Samuel coined the phrase Machine Learning in 1959 as a “Field of study that gives computers the ability to learn without being explicitly programmed.”

In 1998, Tom Mitchell added some clarity by stating: “A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.”  Or in short – Machine Learning means that a computer’s performance improves with experience.

While both definitions are accurate, neither provides absolute clarity, so let’s explore the concept of Machine Learning through an example.

An Example of Machine Learning
A large credit organization is struggling to detect fraudulent transactions. They have years’ worth of historical transactions, including those that have been identified as fraudulent. Their goal is to detect suspicious transactions in real time. Putting this example into Tom Mitchell’s formula can be displayed as the following:

E = the experience of analyzing historical transactions
T = the task of reviewing transactions
P = the performance of the program in identifying fraudulent transactions

This is an example of supervised learning, which is another way of saying that the algorithm will be taught by the historical data. This is possible because the data is labeled (i.e., historical fraudulent transactions are known).

When the data isn’t labeled, unsupervised learning is required. An example of unsupervised learning would be market segment analysis. Here, Machine Learning is learning from the data and making its own connections and insights, as opposed to being taught from past outcomes.

The response, or output of Machine Learning can be described as regression or classification.

With regression, the response is a set of continuous values – think of a curve that predicts home prices based on size. A home price could be found for a home of any size.

Classification identifies group membership. For example, Machine Learning could classify images based on their content (i.e., this image contains a car, this one contains a rooster, etc.). Our fraudulent transaction example above is an example of classification.  The Machine Learning algorithm is classifying transactions as either fraudulent, or non-fraudulent.

Machine Learning Implementation Process
Now that you have a high-level overview of what Machine Learning can do, you might be wondering what it takes to implement. The basic process includes the following steps:

1. Understanding the problem that needs to be solved
2. Analyzing and preparing the data
3. Identifying potential algorithm(s)
4. Training,testing and tweaking several Machine Learning models
5. Integrating Machine Learning with existing systems and processes

Another key question you’ll need to ask is, who can do all of this? In some cases, a software vendor can deliver Machine Learning capabilities out-of-the-box. This works best when a problem is well defined and common within a specific business or industry process.

For example, SAP’s Cash Management Application is a perfect example of a solution that can harness the full benefits of machine learning.

But what if an out-of-the-box solution doesn’t exist? This is where you’ll need to go a step further and employ the skills of a data scientist and an area where TruQua can help.

Conclusion
A key thing to keep in mind with Machine Learning is that similar to most projects having to do with data analysis, if your data is inaccurate or full of discrepancies, you won’t achieve a positive end result. As with any project, it’s critical to pick your the right partner and solution provider.

How TruQua can help
TruQua’s team of consultants and data scientists merge theory and practice to help customers gain deeper insights and better visibility into their data for more informed decision-making, utilizing the latest predictive analytic and Machine Learning capabilities from SAP. Contact us today and learn how TruQua can help:

  • Improve business processes
  • Enhance decision making
  • Direct, optimize, and automate decisions to meet defined business goals

Leverage Predictive Analytics to Improve Expense Planning in SAP BPC

JS Irick, TruQua Enterprises
Eric Weine, TruQua Enterprises

Lemonade

This blog post is built on content from JS Irick’s recent session at Financials 2017 in Amsterdam.

As businesses grow, so do their expenses, making it more important than ever to create a balanced budget. Using TruQua’s simple scenario of Rosie’s Lemonade (referenced in previous posts: https://www.truqua.com/tips-planning-sap-bpc-part-1-bring-external-pricing-data-sap-bpc-better-sales-planning-using-sap-cloud-platform/), we’ll walk through an example that illustrates how Rosie effectively utilizes linear regression using R, SAP BPC and SAP HANA to better track uniform expenses for her employees.

Rosie wants to use historical data from her expense planning model and forecast data from the headcount model to more accurately forecast her uniform expenses for employees. Using linear regression using R, SAP BPC 10.1 and SAP HANA, her hope is to see that information in a much quicker and streamlined way.

Rosie predicts that the number of employees she hires will also have a linear relationship with the number of uniforms she requires. She also predicts that her sales will have a linear relationship with the number of uniforms she requires, because more sales correlates with more accidents at work.

In order to accurately forecast the expense of her uniforms, Rosie will need to create a predictive model and within this model, Rosie will:

  • Utilize historical data to create her statistical model
  • Use forecast data to populate the model and get an output
  • Leverage BW technologies to read the data model and combine it with the rest of her forecast

So here is a simple report with Rosie’s actuals for the last year including her three Cost of Goods Sold Accounts: Headcount, Sales, and Uniforms.  In addition, you can see the data that has already been forecasted. Rosie will use this data to build the model to predict her spend on uniforms.

 

SAP Predictive Analytics Demo
SAP Predictive Analytics Demo

Now for Rosie to be able to accurately predict the number of uniforms she will need to provide for her employees, she needs to create a model to carry out forecasting. One of the simplest, but most useful models is called linear regression. The purpose of linear regression is to make a prediction about the future, using data from the past. This method makes several assumptions, but the most important and obvious one is that there is a linear relationship between two (or more) variables. If, for example, every two dollars spent by a company on advertising leads to a hundred dollar increase in sales, regardless of the amount of money spent on advertising, then these variables have a linear relationship.

So now, Rosie is going to use the statistical software R to pull in her data from BPC and use that data to build a linear model. R is an open source software that has a tight integration with SAP HANA that lets you leverage its statistical model.

So first, she’ll call her Historical Data the “SEED” for the model.  Here, we can see the data has been populated. However, Rosie wants these accounts to be columns as opposed to rows. In order to do this, she’ll need to use a reshape library that’s available in R.

SAP Predictive Analytics Demo

Now when Rosie looks at her linear model input, its pivoted nicely. Next, she’ll update her column names.

SAP Predictive Analytics Demo

Once the column names have been adjusted, Rosie can see 15 trials within her model. She’ll then go ahead and build a model and using her “SEED” data, this calculates uniforms as a factor of headcount and sales and will the spend on uniforms.

SAP Predictive Analytics Demo

So here we can see that for every $1000 dollars Rosie is spending, of that $3 is spent on uniforms and $10 is spent on uniforms per person with a flat startup cost of $31.

 

 

 

So now, let’s take that data and plot it on a linear model:

SAP Predictive Analytics Demo

Now that linear model is built, Rosie will need to create some feed data. You’ll see she’s simply created one that has $50,000 in sales and 50 in headcount. Then using the predict function, the linear model, and our feed data, Rosie is able to predict a $718 in spend on uniforms.

SAP Predictive Analytics Demo

So now, let’s go ahead and look at this from the SAP HANA side and view how this model can be integrated with BPC.

Below is the code that was built in R brought into Rosie’s BPC environment.

SAP Predictive Analytics Demo

Rosie has built a calculation view that does a read of the exact same data that was in BPC. It’s reading in both the FEED data and SEED data and passing it along to the linear model and returning those results.

SAP Predictive Analytics Demo

Now, when a data preview is generated, here are the results Rosie will receive.

SAP Predictive Analytics Demo

So how can Rosie get that into her BPC system? Now that she’s built a calculation view, Rosie can build a virtual provider in SAP BW on top of that view. The great thing about a virtual provider is it can expose your HANA views as info providers, meaning less change management, and the ability to integrate this data with your reporting tools.

SAP Predictive Analytics Demo

Here Rosie can integrate this view with Analysis for Office and she will see what has been determined to be the predictive spend for those uniforms for those 4 months.

SAP Predictive Analytics Demo
SAP Predictive Analytics Demo

Now if Rosie is predicting to double in size that month, she can go ahead and double her numbers and save that back to BPC. When she calls the HANA view, it will now bring in the updated forecast data and recalculate it.

SAP Predictive Analytics Demo
SAP Predictive Analytics Demo
SAP Predictive Analytics Demo
SAP Predictive Analytics Demo
SAP Predictive Analytics Demo

Another example is if Rosie wants to add another month, she’ll update her data to 100 employees and her first ever $100,000 sales month.

SAP Predictive Analytics Demo

Once saved, the output of the view will update and Rosie can pull in the new data through Analysis for Office.

SAP Predictive Analytics Demo
SAP Predictive Analytics Demo
SAP Predictive Analytics Demo
SAP Predictive Analytics Demo

Utilizing these predictive analytics, Rosie is now able to pull in real-time the BPC data, create the model, and better predict her uniform spend based on Historical Data.

To view the full demo of this solution, visit: https://youtu.be/ypXDtovbv4s

Tips for Planning: How to Use Predictive Modeling Capabilities Alongside SAP HANA

Eric Weine, TruQua Enterprises
JS Irick, TruQua Enterprises
This article builds on content presented by JS Irick at SAP Insider Financials, Amsterdam

SAP HANA provides access to excellent modelling tools. For this purpose of this blog we’ll be utilizing R, one of the largest open source statistical software environments, which can be accessed in SAP HANA Studio by installing an R server. For more information on how this can be achieved visit: https://help.sap.com/viewer/a78d7f701c3341339fafe4031b64f015/2.0.01/enUS).

R is excellent for modelling data, but has limited visualization capabilities. For data visualization, there are several BI tools that can be utilized such as SAP Lumira, Design Studio, or Tableau. All have native integration capabilities with SAP HANA databases, so visualization can be done quickly and easily.

DIVVY BIKE

 

DIVVY Bikes

To illustrate how to use more predictive modelling, we’ll be using Divvy Bike as an example. Divvy Bike is Chicago’s own bike sharing system and is the largest in North America. In 2016, they serviced more than 3.5 million rides with almost 6,000 bikes in Chicago and the surrounding suburbs. Bike sharing programs have existed since the 1960s, but they were not viable on a large scale until the advent of advanced computer systems that allowed bike sharing companies to keep track of their bikes.

One of Divvy Bike’s greatest challenges is effectively meeting the demand for bikes around the city. Because bikes are expensive to purchase and maintain, Divvy Bike tries to minimize the number of bikes in service while still meeting most of the demand for the bikes. This balance can be difficult to maintain, because the demand for bikes is often sporadic, so the bikes in one area can quickly be diminished. To avoid customers being left without bikes, Divvy Bike must transport bikes from areas of low demand to areas of high demand. However, this process cannot easily be done without ample preparation ahead of time, because transporting bikes often takes a long time. To aid in this process, Divvy Bike should make a predictive model, so that they can predict ahead of time areas that will require more bikes.

Using R alongside SAP HANA, TruQua has built a model to demonstrate how Divvy Bike can take advantage of weather forecasts and information about events happening around Chicago to make more accurate predictions of where bikes will be needed. Because the variation in bike rentals is so large depending upon where one is in the City, it made most sense create a model that is specific to an individual neighborhood.

For the purposes of this example, we have decided to focus on data in a 1.5 mile radius surrounding Wrigley Field, home of the Chicago Cubs (Wrigleyville) to demonstrate how Divvy Bike can use the Cubs’ schedule to make demand predictions. To have a fully robust model, we would need to model the change in the number of bikes that an area will get per hour, so Divvy Bike will know exactly where more bikes are need to be serviced. However, we modeled just the number of bikes that are taken from stations in Wrigleyville, because using the same logic a complete model can easily be made.

The model was trained on all Divvy Bike data from 2016. We attempted to predict the total number of Divvy Bikes checked out in Wrigleyville for every hour for every day of the year. To predict this, we used the time of day, the day of the week, the temperature, if it was raining, and whether a Cubs game was happening within a 6-hour time window.

Yearly Rentals by Zip Code
Yearly Rentals by Zip Code

Weather Effects

Temperature was a very important factor in determining the number of bike rentals. It is apparent in the graph below that temperature and the average number of bike rentals have a relatively linear relationship. Specifically, for every 1 degree increase in temperature, about 2.5 more bikes are rented per hour.

 

Temperature vs Bike Rentals

The rain also has a substantial effect upon the number of individuals who decide to rent bikes. In our dataset, there seemed to be several outliers that were likely due to a measurement error. The weather data came from Oak Street Beach, which is about 4 miles from Wrigley Field. If there were local variations in the weather, then this could have caused some of the strange effects we see. Rain is measured in rain intensity, which assesses how much rain is falling at once. Below is the graph of rain intensity vs. the number of bike rentals.

Rain Intensity vs Bike Rentals

We cannot fairly exclude these outliers, because we don’t know where they come from. However, we can transform the data to just use a variable that determines if it is raining at all to make a prediction. If we look at the average number of bikes rented when it is raining as opposed to when it is not raining, we see a clear trend. About 35 more bikes are rented on average when it is not raining.

Rain Present vs Bike Rentals

Time Effects

The day of the week is another very important factor in determining the number of riders within an hour. As expected, far more people rent bikes on the weekend than any other day of the week. On an average weekend day, about 15 more bikes are rented than an average weekday. However, there is some variation on weekdays themselves. Specifically, Wednesday and Thursday see substantial drops in the number of bikes that are rented.

Histogram of Bike Rentals

The hour of the day also has a strong influence upon the number of bikes that are rented per hour. However, this effect is a bit more difficult to model because the effect is highly non-linear. The overall distribution of average bike rentals is shown below.

Average Bike Rentals

The relationship between the hour of the day and the number of bike checkouts is clearly non-linear. However, the above graph, since it is merely an average over an entire week, does not tell the whole story. The day of the week is very influential upon what hours’ people rent bikes. The distribution of bike rentals on weekdays is shown below. Around 7-8 AM and 5-6 PM are the peak hours of bike riding, and that there is a substantial drop in the middle of the day. This can likely be attributed to the fact that most people work on the weekdays.

Average Bike Rentals on Weekdays

However, the distribution of bike rentals per hour looks much different on the weekend days. Here, the peak bike rental time are at 12-1 PM and there is no drop in the middle of the day. This interaction between the day of the week and the hour of the day is accounted for in our model.

Average Bike Rentals on Weekends

Cubs Games

A large part of this study is focusing on how to make better predictions for the demand for Divvy Bikes based upon data about local events in the City. In Wrigleyville, a main and recurrent event is Cubs games. We wanted to investigate how much of an effect a Cubs game occurring within a 6 hour window would have upon the demand for Divvy Bikes. To test this, we controlled for the time of year and the hour of the day, so that we could isolate the effect of Cubs games upon the demand for Divvy Bike rentals. On the left is the distribution of bike rentals when there is a Cub’s game. On the right is the distribution of bike rentals during the day, and in the summer, when there is not a Cub’s game. The biggest difference between these distributions is that when a Cubs game is happening, it is very rare that few people are renting bikes. However, beyond this, it is not easy to describe the difference between these distributions using visual aids alone, especially because Cubs game are somewhat infrequent relative to the number of hours in the year. However, the quantitative measures of the data are clear. On average, about 36 more bikes are rented per hour when there is a Cub’s game.

Average Bike Rentals when Cubs Play

Conclusion

Taking these factors into account, we built a strong model to predict the demand for Divvy Bikes in Wrigleyville using R in conjunction with SAP HANA. The average number of bikes rented in an hour is about 50. Our model, on average, was off by a little under 16 bikes. While this model is by no means perfect, it gives a fairly accurate estimate of the number of bikes that will be checked out, which is enough to give Divvy Bike the ability to prepare for instances of mass demand.

 

**This blog is for informational purposes. TruQua has no affiliation with Divvy Bikes.

English EN Português PT Español ES