Skip to content

Journal Entry Week 1 (Aug 30 and Sep 1)

Zhiming Zhong edited this page Sep 13, 2022 · 1 revision

My first journal entry for R4R

In the first week, we mainly focus on two topics. In the last Tuesday, we form a group discussion to understand some basic concepts, like data science, open science, and open data science. We also discussed which skills play an important role for us to be an expert in data science field. In the last Thursday, Dr. Greg Chism is invited to have a talk on some preliminaries on how to use Github, especially focusing on how to use the Wiki function to create a notebook for our learning in this program.

What are data science, open science, and open data science?

There is actually no standard answer to this question. However, some of us have raised some critical factors defining these concepts. Some cohorts have mentioned that data science should use some experiment and investigation methods to study real-world problems. Some cohorts think that data science focus on the methodologies of how to process large-scale data sets with dozens of or even hundreds of GBs. I also participate in this discussion from the perspective of my research. In my view of point, data science needs to use some quantitative methods, like probability, statistics, and optimization theories, to combine with computer programming skills to help us gain deeper insight into application problems, thus making better decisions.

As for open science, I think its key is to increase accessibility and promote the dissemination of our research. The connotation of open science is reflected in every process of research. For instance, we need to ensure that the data collected are eligible for the utilization in our research, and our readers can easily access this paper. The software and codes we use to process this data should also be provided to the readers so they can reproduce the results. The results and conclusions should clearly presented to our readers. An open-access publication is also a viable option to accelerate the dissemination of our research.

The concept of open data science is thus a combination of data science and open science. That is, the research in the data science field where multiple methods in open science are adopted to improve the openness of the research.

Important skills in data science

Most of us think computer programming skills are of the utmost importance. For instance, we must have a working knowledge of at least one popular programming language, like Python, R, and Julia, all of which are the ones that are most widely employed by researchers in the data science field. For me, I think a mathematical foundation is also very important like we must have a solid background in calculus, linear algebra, and probability & statistics theories.

Some preliminaries on how to use Github

In the last Thursday's lecture, we have learned how to use some basic functions of Github, which is one of the most popular platforms for us to share the data and codes of our research. From this lecture, I have learned how to create and share my notebook for this program via the Wiki function in Github. To this end, what we need to do are shown as follows. First, we need to create a Github account if we do not have one before. Then we log into this account and enter the homepage. Then we need to create a new repository. We can also create readme file at the same time, which is used for the description of this repository. The Wiki bar will appear in this repository. Markdown mode is available for the edit of this Wiki page.

Successes and challenges in my last week's learning

I have not met any challenges until now since we are still in the introduction parts, and I have successfully created a new repository and Wiki page for my notebook. More updates on these will be followed after we learn more advanced techniques in the related fields.

How do I apply the knowledge that I learned

I think the Wiki function in Github is useful for me and my lab member to disseminate our research. Previously, we only use regular academic journals to publish the papers regarding the finding of our research. Now I find that we can use this Wiki function to create a tutorial to introduce the methodologies to other researchers in the same areas in a more convenient, prompt, and intuitive way. In addition to sharing the data set used in the numerical experiments of our papers, we may also put some other relatively less non-trivial parts of our papers into a pdf file, upload them to Github, and attach the links in the papers, which is potentially useful, especially in the cases when the journals have a page limit.