How to plan a successful Data Science Project

Planning is an important aspect of all projects to ensure that the strategic objectives of the project are accomplished. It is critical to breakdown a project into manageable phases or tasks to monitor progress and ensure timely completion. As a data scientist, it is important to have a working and well-thought plan of the end-to-end execution of a project to encourage yourself and have tangible outputs to show to stakeholders.

Tools like Jira and Microsoft DevOps can be used to break down a data science project into different tasks for your team and it also gives team members visibility and progress updates. It makes collaboration much easier and code sharing to eliminate people solving the same problem. In one of my previous roles, Jira made collaboration easier because everyone with access to the project could raise issues/user stories and explain what and how they want the issue to be addressed.

Also, it is good practice to write acceptance criteria in user stories in order to know when a task is completed. As with all software development, you need to use an agile methodology to handle the project and know that it will be an iterative process.
The process involved are:

Understand the big picture/ Business Understanding
Obtain the data
Visualise the data to gain insights
Prepare the data obtained for Machine Learning algorithms
Select a model and train it
Fine-tune your model
Present your solution
Launch, monitor, and maintain your system

Understand the big picture/ Business Understanding: As a data scientist, you will often be brought in to solve a business problem or enhance an existing platform through added Machine Learning or Predictive Statistics. It is important to leave the data and tools, and first understand the main business issue to be resolved. For instance, if you are building a forecasting tool, it is important to know where this tool sits among other solutions that the business offers and how the new feature impacts on the overall platform. That is, is the forecasting tool at the middle or end of a platform ecosystem. If it is in the middle, how is it related to the next program.

Getting the business objective right will enable you to have a big picture of the final output and how best to approach the project. As part of knowing the big picture, you need to see and understand what the current solution looks like. This gives you a reference point and also a glue of where that system needs to be after you complete the project.

After you understand the business objective and evaluated the current system if there is one, you will be able to know if the task is a supervised, unsupervised or reinforcement learning. As part of the early stage, it is also good to think about performance measure. How are you going to verify the accuracy of the predicted figures? If it is a regression task and you are using the Root Mean Square Error (RMSE), then you want this value to be small as it signifies that your model has small amounts of errors.

Lastly, as part of your intelligence gathering, make sure you ask what in particular would be seen as a successful output and also the format of presenting it. Do they want the predicted numbers to show as figures or do you need to convert the output into categorical values that would be fed into another system? It is important to make this clarification from my experience and from best practice.

Obtain the data: Once you have a good understanding of the project, methodology and desired result, it is time to get the data. Most of the time, the data is in a data warehouse or relational database in the cloud or legacy systems. Studying the data schema will give a better understanding of the data types, primary keys and other features. Also, you should take a quick look at the data structure by selecting the first 5 rows using head(5).

Next use the info() function to get a quick description of the data regarding the number of rows, attribute types and number of non-null values. You should also use summarise() to get the descriptive statistics of the dataset.

Hist() will plot a histogram of the data for you to know the range of the data. It is also best practice to create a test set at this point which is typically 20% of your data to explore the data further. This is often a very important aspect of the project and I will write another blog about this part of the project.

Visualise the data to gain insights: Depending on the size of the data, you need to select some part of the training set or visualise the whole training set for a high level or granular view of the data. After you decide, create a copy of the training set using filename = strat_train_set.copy() to ensure you don’t harm the data.

There is a general consensus that human brains are good at identifying patterns on charts, so it is important to tune your visualisation parameters to produce a picture with easy to spot patterns. Often, this will give you a general idea about your data and some of the things you might find in your analysis and even the areas worth predicting with machine learning algorithms.

It is also good practice to use the corr() method to compute the correlation coefficient of the attributes. This will give you general ideas regarding the linear relationships between the features and the target value. For instance, this can be used to confirm the theoretical relationship between income and energy consumption. If you have a correlation coefficient of 1, this means that there is a strong positive relationship and as income increases, energy consumption also increases possibly through the acquisition of more gadgets by consumers. You can use pandas scatter_matrix method.

from pandas.plotting import scatter_matrix
attributes = [“average_house_price”, “age_of_property”, “number_of_rooms”, average_income”]
scatter_matrix(housing[attributes], flgsize = (12,8))

To focus on an attribute use
energy.plot(kind= “scatter”, x= “average_income”, y = “average_house_price”, alpha= 0.1)

Data Preparation: Before you prepare your data for machine learning models, some data transformation might be necessary depending on the datasets that you are working with. You might be able to create a more informative column by combining data from two columns and have a new attribute with a more meaningful or significant correlation with the target attribute.
It is good practice to prepare your data by writing functions as this enables to do four important things:

The functions can be used on new datasets easily in the transformation stage
Enables you to build a library of transformation functions to use on future projects
You can use the functions on new data in your live system before feeding them into your algorithm
You can try different transformations to identify the ones that works best on your dataset

Data cleaning should also be part of this process as most ML models cannot handle missing features. For the task depending on the situation, you might be able to use Dataframe’s dropna(), drop() and fillna() methods.
To complete this step, write your data transformation pipeline scripts to clean and prepare your data for ML algorithms automatically.

Model selection and training: Now that you have a business problem to solve, explored the data with samples of the training and test sets with some visualisations and an automated data transformation pipeline. The next step is to select and train a model on the entire dataset. You can use a regression, classification, random forest or neural nets depending on the problem. It is also possible to train multiple algorithms if you have a large project with different target attributes.
Use cross validation to evaluate your model performance. The Scikit-Learn’ cross validation is a library to use for this exercise. Cross validation also allows you to know how accurate your predictions are.

Fine tune your model: After you have trained a selection of models, the next step is to fine tune the models to select the best ones. You have a few options here. First is the Scikit-Learn’s GridSearchCV employed to search for the best models. The second option is to use a Randomised search where there is a random combination of selected models used to arrive at a random value for each hyperparameter at different iteration, giving you more control. Lastly, you can also use the ensemble method like the Random Forest Algorithm which will perform better than individual decision tree. After fine-tuning, it is now time to evaluate the final models selected on the test set.

Present your solution: After all the steps above, you need to document all the process followed thoroughly and prepare presentation slides. You should state the problem, assumptions made, limitations of the solution and also what you found. It is also good practice to state what could have been done better and improved as you now move on to production.

Launch, monitor and maintain your system: Hooray! The board have approved your data product and you are now going into production. To deploy, you need to plug in input data sources and write testing scripts. It is also important to re-train your models periodically with new data to avoid performance issues and use monitoring code to check your system live performance at regular intervals with triggers in place if there are issues. Likewise, keep a tab on input data to ensure that they are of good quality.

All in all, the bulk of the work is the data preparation step, setting up human evaluation pipelines and automating regular model training. Clearly, selecting a good algorithm is important but you also need to make sure you have a working pipeline and the platform architecture is well designed and built for performance and monitoring.

CrestML

Search This Blog

How to plan a successful Data Science Project

Comments

Post a Comment

Popular posts from this blog

Are you confused about what to study at university? Economics is the answer

AI and automation are changing the Construction Industry

Essential characteristics of Good Jobs