Skip to main content

How to plan a successful Data Science Project


Planning is an important aspect of all projects to ensure that the strategic objectives of the project are accomplished. It is critical to breakdown a project into manageable phases or tasks to monitor progress and ensure timely completion. As a data scientist, it is important to have a working and well-thought plan of the end-to-end execution of a project to encourage yourself and have tangible outputs to show to stakeholders.

Tools like Jira and Microsoft DevOps can be used to break down a data science project into different tasks for your team and it also gives team members visibility and progress updates. It makes collaboration much easier and code sharing to eliminate people solving the same problem. In one of my previous roles, Jira made collaboration easier because everyone with access to the project could raise issues/user stories and explain what and how they want the issue to be addressed.

Also, it is good practice to write acceptance criteria in user stories in order to know when a task is completed. As with all software development, you need to use an agile methodology to handle the project and know that it will be an iterative process.
The process involved are:
  1. Understand the big picture/ Business Understanding
  2. Obtain the data
  3. Visualise the data to gain insights
  4. Prepare the data obtained for Machine Learning algorithms
  5. Select a model and train it
  6. Fine-tune your model
  7. Present your solution
  8. Launch, monitor, and maintain your system
Understand the big picture/ Business Understanding: As a data scientist, you will often be brought in to solve a business problem or enhance an existing platform through added Machine Learning or Predictive Statistics. It is important to leave the data and tools, and first understand the main business issue to be resolved. For instance, if you are building a forecasting tool, it is important to know where this tool sits among other solutions that the business offers and how the new feature impacts on the overall platform. That is, is the forecasting tool at the middle or end of a platform ecosystem. If it is in the middle, how is it related to the next program.

Getting the business objective right will enable you to have a big picture of the final output and how best to approach the project. As part of knowing the big picture, you need to see and understand what the current solution looks like. This gives you a reference point and also a glue of where that system needs to be after you complete the project.

After you understand the business objective and evaluated the current system if there is one, you will be able to know if the task is a supervised, unsupervised or reinforcement learning. As part of the early stage, it is also good to think about performance measure. How are you going to verify the accuracy of the predicted figures? If it is a regression task and you are using the Root Mean Square Error (RMSE), then you want this value to be small as it signifies that your model has small amounts of errors.

Lastly, as part of your intelligence gathering, make sure you ask what in particular would be seen as a successful output and also the format of presenting it. Do they want the predicted numbers to show as figures or do you need to convert the output into categorical values that would be fed into another system? It is important to make this clarification from my experience and from best practice.

Obtain the data: Once you have a good understanding of the project, methodology and desired result, it is time to get the data. Most of the time, the data is in a data warehouse or relational database in the cloud or legacy systems. Studying the data schema will give a better understanding of the data types, primary keys and other features. Also, you should take a quick look at the data structure by selecting the first 5 rows using head(5).

Next use the info() function to get a quick description of the data regarding the number of rows, attribute types and number of non-null values. You should also use summarise() to get the descriptive statistics of the dataset.

Hist() will plot a histogram of the data for you to know the range of the data. It is also best practice to create a test set at this point which is typically 20% of your data to explore the data further. This is often a very important aspect of the project and I will write another blog about this part of the project.

Visualise the data to gain insights: Depending on the size of the data, you need to select some part of the training set or visualise the whole training set for a high level or granular view of the data. After you decide, create a copy of the training set using filename = strat_train_set.copy() to ensure you don’t harm the data.

There is a general consensus that human brains are good at identifying patterns on charts, so it is important to tune your visualisation parameters to produce a picture with easy to spot patterns. Often, this will give you a general idea about your data and some of the things you might find in your analysis and even the areas worth predicting with machine learning algorithms.

It is also good practice to use the corr() method to compute the correlation coefficient of the attributes. This will give you general ideas regarding the linear relationships between the features and the target value. For instance, this can be used to confirm the theoretical relationship between income and energy consumption. If you have a correlation coefficient of 1, this means that there is a strong positive relationship and as income increases, energy consumption also increases possibly through the acquisition of more gadgets by consumers. You can use pandas scatter_matrix method.

from pandas.plotting import scatter_matrix
attributes = [“average_house_price”, “age_of_property”, “number_of_rooms”, average_income”]
scatter_matrix(housing[attributes], flgsize = (12,8))

To focus on an attribute use
energy.plot(kind= “scatter”, x= “average_income”, y = “average_house_price”, alpha= 0.1)

Data Preparation: Before you prepare your data for machine learning models, some data transformation might be necessary depending on the datasets that you are working with. You might be able to create a more informative column by combining data from two columns and have a new attribute with a more meaningful or significant correlation with the target attribute.
It is good practice to prepare your data by writing functions as this enables to do four important things:
  • The functions can be used on new datasets easily in the transformation stage
  • Enables you to build a library of transformation functions to use on future projects
  • You can use the functions on new data in your live system before feeding them into your algorithm
  • You can try different transformations to identify the ones that works best on your dataset
Data cleaning should also be part of this process as most ML models cannot handle missing features. For the task depending on the situation, you might be able to use Dataframe’s dropna(), drop() and fillna() methods.
To complete this step, write your data transformation pipeline scripts to clean and prepare your data for ML algorithms automatically.

Model selection and training: Now that you have a business problem to solve, explored the data with samples of the training and test sets with some visualisations and an automated data transformation pipeline. The next step is to select and train a model on the entire dataset. You can use a regression, classification, random forest or neural nets depending on the problem. It is also possible to train multiple algorithms if you have a large project with different target attributes.
Use cross validation to evaluate your model performance. The Scikit-Learn’ cross validation is a library to use for this exercise. Cross validation also allows you to know how accurate your predictions are.

Fine tune your model: After you have trained a selection of models, the next step is to fine tune the models to select the best ones. You have a few options here. First is the Scikit-Learn’s GridSearchCV employed to search for the best models. The second option is to use a Randomised search where there is a random combination of selected models used to arrive at a random value for each hyperparameter at different iteration, giving you more control. Lastly, you can also use the ensemble method like the Random Forest Algorithm which will perform better than individual decision tree. After fine-tuning, it is now time to evaluate the final models selected on the test set.

Present your solution: After all the steps above, you need to document all the process followed thoroughly and prepare presentation slides. You should state the problem, assumptions made, limitations of the solution and also what you found. It is also good practice to state what could have been done better and improved as you now move on to production.

Launch, monitor and maintain your system: Hooray! The board have approved your data product and you are now going into production. To deploy, you need to plug in input data sources and write testing scripts. It is also important to re-train your models periodically with new data to avoid performance issues and use monitoring code to check your system live performance at regular intervals with triggers in place if there are issues. Likewise, keep a tab on input data to ensure that they are of good quality.


All in all, the bulk of the work is the data preparation step, setting up human evaluation pipelines and automating regular model training.  Clearly, selecting a good algorithm is important but you also need to make sure you have a working pipeline and the platform architecture is well designed and built for performance and monitoring.

Comments

Popular posts from this blog

Are you confused about what to study at university? Economics is the answer

I wanted to write about something personal this month. If you are thinking about your career and you are not sure about what to study at the university. I have been there before and will tell you five reasons an Economics degree might be the best decision for you. I was admitted to study Medicine at the age of 16 and was a Medic for four years. Apart from the course being a tough one requiring several hours of study. I did not enjoy what I was learning and thought I would instead do a social science course even if it meant learning new subjects and doing the extra study to catch up. I knew being a medical doctor might not be the way for me, but I was not sure of what a better alternative would be. I started talking to friends in different departments across social science, and one of them who was about to graduate recommended Economics, which according to him was the best degree as it helps your analytical and problem-solving skills. You can work in any industry which i...

AI and automation are changing the Construction Industry

We’ve all heard about the fourth industrial revolution (industry 4.0) and the technological disruption it will bring to the world of work. To show its popularity, a 7-Day search on Google Trends reveals that the query ‘industrial revolution’ had peak popularity of 100 on September 27 2018, at 3:00 AM in the UK. 100 is the highest rating for search queries popularity on Google Search. The 'popularity' shows two things; there is an increased awareness about the topic and businesses are evaluating its impact on their operations and strategies. So, what does the fourth industrial revolution represent? The simplest definition of Industry 4.0 is, it is a move towards digitalisation. In other words, it is a period when machines will self-optimise and self-configure to produce quality goods and services using artificial intelligence (AI). Development in big data analytics will enable industries to be more agile through real-time data processing that generates insights faster for...

Essential characteristics of Good Jobs

With record employment level in most G7 countries and the increase in economic growth, little attention is given to analysing the quality of work created. Creating more jobs is paramount to make a nation prosperous and for the redistribution of wealth. Some of us were taught the quantitative measure of labour demand and supply in economics. The chart shows increasing economic growth and employment rate in the UK. UK GDP and Employment Rate, 2010-2018 Qualitative measures regarding the quality of work is also important for a happier workforce and society. Work should be evaluated both quantitatively and qualitatively for measurable improvements and a balanced society. Personally, I do know that if I am happy at work, I feel and perform better both at work and home. Same goes for most of my family and friends. Providing good work for all citizens should be a national strategic objective in every modern society. There is a consensus among existing studies that with the right leg...