What is best practice for Data Science projects?

Difference between data science and data analysis.

Data Science: You have a question, you’re trying to get to an answer and you don’t necessarily know at the beginning if it’s going to work.

“You have a question, you dont know if you can find an answer”.

Examples of data science could be image recognition tasks or models for prediction. A questions is proposed, we have data, the data science task is experimentation in find a possible solution.

Data Analysis: You have a question, which you know is answerable. You are applying known methods to answer the question.

“You are answering a question”

Examples of data analysis could be quantifiable metrics based on sales, ie from various channels over time frames. Or it could be metrics of the quality of a data set. ie, missingness or population statistics.

Managing a project

Agile project management (SCRUM) is said to be more suited to data analysis than data science.

If you know what the end result should be, then agile is a good practice to implement the solution. The agile management technique is a way to develop the process of finding the methodology in order to achieve the solution. (there are sprints and estimation values attached to the tasks, as they are known).

When the project is more of an experiment and evaluate type of project, then agile practices might not be best suited. We could run the experimentation in a sprint type style. Working out (1) the exact set of all experiments at the beginning would be difficult without some initial results/evaluation (2) assigning estimates of complexity to these experiments would not be so easy either.

[Ref: The Data Science Process]

Data Science

In the discipline of data science it is important to ‘frame the problem’. This is were a lot of the work should go.

Data science can tell us what to expect or what might happen, but it often cannot tell us why. To understand why, you have to talk to people. Domain experts.

We need to embrace the importance of human relationships. They are very important for the data systems we are building.

Selection bias

There should be a section in every project report for selection bias. In doing so it gets you to critically think about this area.

Selection bias can happen in many areas for many reasons. An interesting example is survivorship bias. A historical example of this bias was the analysis of bullet holes in returning planes in World War II. The areas with the most bullet holes were reinforced. But, the sample of planes analysed were bias to only returning planes with damage, not plane that were shot down and did not return.

Ethical

The ethical considerations is also very important area which should be reported on. It is important to critically think about this area and assess the impact of the systems capabilities. Ethical considerations need to be a part of the product design and planning process.

Data science is an emerging discipline. It will most likely evolve over time. Follow interesting problems, people and technologies into the future of what data science will become.

Reference: https://www.datacamp.com/community/podcast/data-science-past-present-and-future section ‘The Data Science Process’

Creating your first programming language is easier than you think,
...also looks great on your resume/cv.