Data Mining for Predicting the Covid-19 Pattern


Aman Jatain

Department of Computer science, Amity University, Haryana.

*Corresponding Author E-mail:



The knowledge discovery in databases (KDD) is alarmed by the development of methods and techniques for the use of data. Data mining is one of the most critical phases of the KDD. Data mining is a method of pattern discovery and extraction where there is a large amount of data involved. Electronic health records are becoming increasingly common in health care organizations. With increased access to a substantial amount of patient data, healthcare companies are now in a position to optimize the efficiency and quality of their businesses through data mining. COVID-19 is a new global epidemic in 186 countries around the world. And as a result of this pandemic, patient data is being introduced at a quicker rate. Search engines have valuable data from populations and this data can be useful for the study of epidemics. Using data mining tools for available data will provide deeper insight into the management of the coronavirus outbreak health problem for each country and the world. In order to contribute to the well-being of the living population, the research will analyze coronavirus actions in the previous months and will display statistics using different models, data mining techniques. Various data mining models and methods will demonstrate the pattern of the COVID-19 over the year.


KEYWORDS: COVID-19, KDD, Data Mining, Health, Pattern.




Data mining is one of the most successful and inspiring fields of study. It not only contributes to healthcare services, but also to almost every other sector, such as corporations, education, and much more. Data Mining is said to be the discovery of hidden information, frequent trends, useful data from a database or other data warehouse. In standard terms, "Data mining can be defined as a process of finding previously unknown patterns and trends in databases and using this information to construct predictive models"[1] Instead, it can be defined as a process of data selection and discovery and modelling using vast data stores to discover previously unknown patterns. Therefore, because of its immense benefits, Data Mining is gaining popularity year by year and day by day. Healthcare plays a critical role in any person's life. "Health is Wealth" is a well-known proverb that has arisen from the latest Covid-19 pandemic. The sense of the proverb is still very well known by individuals who used to be careless about it. For several years, thus every passing day, researchers have been pouring their hardcore energies into improving health services. Data mining is just one of them. Data Mining techniques offer a range of advantages to the medical industry, such as grouping patients with similar conditions, discovering alternative medical options, fluctuations in healthcare policies, and much more. Using various data mining techniques and algorithms, descriptive and essential data visualizations can be done.


Pandemics have a huge influence on the human race and lead to the very popular idea of Big Data Analytics. When data increases, so do its complexity. As a consequence, it is difficult to identify trends, the requisite data that can support the medical field, make forecasts for potential use, and several other factors. Data Mining, therefore, comes hand in hand in this case. These problems can be overcome or enhanced by using data mining techniques. The latest coronavirus pandemic has already reached the data mining world, adding data on a regular basis resulting in Big Data Analytics. It is also necessary to keep a close eye on this vast volume of data so that patterns can be detected as soon as possible and forecasts can be made. The goal of this research entitled "Data Mining in Healthcare Services" is to contribute to the well-being of the living population, so that the study can analyze coronavirus activity in previous years and display statistics using different models, data mining methods. Various data mining models and methods will reveal the pattern of Covid-19 over the years. The objective of the study is: to identify useful and understandable patterns from a large amount of data and to see the statistics and trend over the years by using data mining models.



Data mining can only be regarded as a relatively recently established technique and technology that came to prominence in 1994. Various studies show that data mining benefits almost every sector by creating relationships in a database and providing the necessary knowledge for decision-making. Knowledge Discovery in Databases (KDD) is another concept used in Data Mining. But all of these words are claimed to be distinct and data mining is just one of the stages of Information Discovery in Databases. KDD refers to the large process of discovering patterns and trends in a vast amount of databases and using that information for predictions and decision-making.[2] According to researchers, Knowledge Discovery in Databases is organized at different stages in order to identify common patterns and knowledge. The first stage involves the collection of data from different sources, the second involves the pre-processing of data, which involves the handling of incomplete data, the elimination of noisy data, etc The third stage involves the transformation of data into an understandable form. After that data mining is performed to determine patterns, and ultimately, these patterns are interpreted for information. Stages of KDD are shown in below Figure 1.


Figure 1 Stages of Knowledge Discovery Process.


Skills and expertise are essential criteria for the performance of the Data Mining mission, as the success and failure of data mining projects depends heavily on the person who manages the process due to lack of standard system availability. In today's period, large amounts of data are being generated by numerous public and private health organizations that are very difficult to process and manage. There is also a need to analyze and interpret the useful information of this data with powerful automated data mining tools. This knowledge is very useful for healthcare specialists to understand the cause of illness and to provide better and more cost-effective care and treatment for patients. Data Mining offers novel health information which in turn, helps to make administrative and medical decisions, such as estimation of medical staff, decisions on health insurance policies, choice of treatments, disease predictions, and much more. COVID-19 pandemic is the most recent scenario which has stepped into the world of Data Mining very rapidly and is growing day by day.


On 8 December 2019, the Chinese government announced the death of one patient and the hospitalization of 41 others with unknown etiology in Wuhan.[3] This cluster triggered the novel coronavirus (COVID-19) epidemic of respiratory diseases. Though early cases of the disease were related to the wet market, human-to-human transmission has led to widespread outbreaks of the virus across China. On 30 January, the World Health Organization (WHO) declared the emergence of COVID-19 as a public health emergency of international concern (PHEIC). On 11 March, on the basis of the global distribution and magnitude of the disease, the Director-General of the WHO officially called the outbreak of COVID-19 a pandemic.[4] The COVID-19 pandemic has entered a new phase of rapid spread in countries outside China. Owing to the widespread and growing prevalence of COVID-19 worldwide, several studies have investigated various aspects of the disease. These include the identification of the source of the virus and the analysis of its gene sequences, the analysis of patient information, the analysis of first cases in the countries concerned, methods for the detection of viruses, the analysis of epidemiological outbreaks, and the prediction of COVID-19 cases. This has given rise to health data to an even greater degree than findings in Big Data and data mining have become even more crucial. Predicting this situation in the current pandemic is very necessary to control the danger because it helps to make prompt operations and medical decisions.


Figure 2. Timeline of COVID-19 across the nations.



There are limited data available for outbreaks of COVID-19 epidemics, leaving forecasts widely uncertain. Previous studies have shown that the timing and location of the outbreak allowed the rapid spread of the virus within a highly mobile population. The table 2.1 describes the contribution of people who did an analysis of the COVID-19 epidemic using data mining and machine learning and gave their beneficial advice to make things better.


Table 2.1: Summarization of brief Contribution of Researcher in Literature



Method Used

Future Plan

Narinder et al.

Paper discusses the analysis of COVID-19 pandemic

Machine Learning techniques are used to do the analysis

Need to extend the study to further utilize other machine learning and deep learning models.

Divya et al.

Brief introduction of Data Mining Techniques

Comparative study of various techniques of Data Mining

The essential need for effective data mining techniques

Hian et al.

It discusses data mining and its applications

Classification methods are used for better understanding

Need more contribution to data mining and healthcare literature


addresses on recent studies that apply ML and AI technology

A selective assessment of information d to the application of ML and AI technology on Covid-19

Need more study as the current urgency requires an improved model with high-end performance accuracy

Andrea et al.

Summarizes the evidence regarding chloroquine for the treatment of COVID-19.

PubMed and three trial Registries were searched for studies on the use of chloroquine in patients with COVID-19.

Safety data and data from high-quality clinical trials are urgently needed.

Peter A. Bath

A brief explanation of data mining tools and approaches

Selected data mining and statistical techniques have been used

On a routine basis, DM will be seen as the systematic process with clear, precise, and realistic objectives

Rameshwar et al.

Survey to predict the outbreaks and epidemics is done through some major research articles from the year 2010 to 2017

Classification of data based on Machine learning techniques

need to propose such a model that describes the best

ways of data collection, data filtering

Michael et al.

Data Mining deep study and steps involved in Data mining

KDD and various specific models

data-mining research needs to incorporate the characteristics of a given task domain

Tanujit et al.

Data-driven analysis that can provide deep insights into the study of early risk assessments for 50 immensely affected countries.

Optimal regression tree algorithm techniques

the proposed model can be used as an early warning system to fight against the COVID-19 pandemic

As seen in the table, various researchers gave a tremendous contribution to the recently alarmed COVID-19 pandemic along with data mining techniques. Knowledge Discovery in Databases, various Machine learning algorithms, and many more approaches are used to find related and common patterns. Also, authors in their respective papers have mentioned their future plans, which can help readers to stay connected for better understanding so that more and more research can be done in an effective manner.



Based on the above literature, it is evident that sufficient work is available on exploratory data analysis to understand the existing trend of the epidemic but still there is a lot of scopes to develop and test efficient machine learning-based prediction models so that proactive strategies could be identified to cater the immediate needs. Machine Learning is one of the approaches used in Data Mining. Machine Learning is used to do COVID-19 analysis over the years globally. Various models are used in Machine Learning which gives accuracy accordingly. But every Predictive model uses a special approach for prediction.† The approach used to build machine learning algorithms: i) Understanding a problem and final goal, ii) Data collection, iii) Data preparation and pre-processing, iv) Modelling and testing and v) Model deployment and monitoring.


Figure 3.1 Steps followed by Machine Learning Algorithms.


Building systems capable of identifying patterns in data and learning from it without specific programming is the key characteristic of machine learning.



The Covid-19 dataset is taken from a very popular website, The Covid-19 dataset contains various CSV files named confirmed.csv, covid_19_data.csv, patients_data.csv, etc. Variables include Observation date, Province/State, Country/Region, Confirmed, Deaths, and Recovered. The whole dataset contains data from the month of January to September.



To implement data mining, which means extracting knowledge from a huge database, various machine learning algorithms are used on different datasets of COVID-19. Data mining is done on various datasets to give results that show criteria of confirmed, recovered, and deaths all over the world. Below Figure 4.1 showing a pie chart visualizing the Top 10 countries affected with COVID-19 from January to September. The pie chart shows the US tops the list with 37% infected population. Brazil takes the second position with a 20% infected population. India comes on number 3 with 16% population having confirmed cases. Italy, South Africa, the UK, and Mexico had approximately the same criteria of infected people with 3%. For easy understanding, figure 4.2 visualizing the top 10 countries affected globally.


Figure 4.1 Top 10 countries affected globally.


Now, after giving an eye over Top 10 globally affected countries, now is the need to see Top 10 globally affected states which shows New York and Sao Paulo tops the list with 6% affected population respectively. In India, 2 states that are Maharashtra and Tamil Nadu take place in the top 10 globally affected states with 5% and 3% affected population respectively shown in figure 4.2.


Figure 4.2 Top 10 states affected globally.


Now, if seen separately, below Figure 4.5 shows Visualizing COVID-19 Confirmed cases, Recovered and Deaths in India from February to May 2020. The plot is showing very clearly that confirmed cases in blue color are increasing from 0 to 200k from Feb 2020 to May 2020, Recovered cases are somewhat more than 50k and luckily death rates are quite low.


Figure 4.3 COVID-19 Cases, Recovery and Deaths in India from Feb till May.

Now, the below figure 4.4 the mapping of Covid-19 Active, Recovered, and Deaths across the world using a pie chart. The plot is showing very clearly that there are 27.6% active cases, 69.6% recovered cases, and 2.9% deaths across the world.


Figure 4.4 COVID-19 Active, Recovered and Deaths across the world.



The planet is under the influence of the virus COVID-19. Early prediction of transmission can help to take the required action. As a result, it results in Big Data to a greater degree and is increasing rapidly. Data mining is necessary at this point to extract information from a large amount of data so that the analysis can be carried out correctly. This research focuses on the global review of COVID-19 patterns from January to September. Various machine learning algorithms have therefore been used for disease analysis in this study. However, if the distribution follows as seen in the various plots it will lead to a massive loss of life as it presents the exponential growth of transmission worldwide. As seen in China, this rise in COVID-19 can be minimized and extinguished by reducing the number of susceptible individuals from infected individuals. This can be done by being unsocial and orderly following the lock-down initiative. Data mining applications in health care can have enormous potential and value. However, the effectiveness of health data mining depends on the availability of clean health data. In this respect, it is important for the healthcare industry to understand how data can be better collected, processed, prepared, and mined. Furthermore, as health data are not limited to quantitative data, such as medical notes or clinical records, it is also important to explore the use of text mining in order to extend the reach and essence of what health data mining can currently do. It is especially useful to be able to combine data and text mining. It is also useful to look at how digital diagnostic images can be integrated into health data mining applications.



The author declare no conflict of interest.



1.      H. C. Koh and G. Tan. Data Mining Application in Healthcare. Journal of Healthcare Information Management. 2005, 19(2): 64-72.

2.      D. Hand, H. Mannila and P. Smyth.† Principles of data mining.† MIT Press. 2001: 546.

3.      Stoecklin, Sibylle Bernard, Patrick Rolland, Yassoungo Silue, Alexandra Mailles, Christine Campese, Anne Simondon, Matthieu Mechain. First cases of coronavirus disease 2019 (COVID-19) in France: Surveillance, Investigations and Control Measures. Euro-surveillance. 2020, 25(6): 2000094.

4.      WHO coronaviruses (COVID-19). Retrieved March 30, 2020 from

5.      Singh, R., Singh, R., Bhatia, A. Sentiment analysis using Machine Learning technique to predict outbreaks and epidemics. International Journal of Advance Science and Research. 2018, 3(2):19-24.

6.      Lauer, Stephen A., Kyra H. Grantz, Qifang Bi, Forrest K. Jones, Qulu Zheng, Hannah R. Meredith, Andrew S. Azman, Nicholas G. Reich, and Justin Lessler. The incubation period of coronavirus disease 2019 (COVID-19) from publicly reported confirmed cases: estimation and application. Annals of internal medicine. 2020, 172(9):577-582.

7.      Singer, H. M. Short-term predictions of country-specific Covid-19 infection rates based on power law scaling exponents. arXiv preprint arXiv:2003. 2020: 1-6.

8.      Milley, A. Healthcare and data mining. Health Management Technology. 200, 21(8):44-47.

9.      Trybula, W.J. Data mining and knowledge discovery. Annual Review of Information Science and Technology. 1997, 32: 197-229.

10.   Chakraborty, T., and Ghosh, I. Real-time forecasts and risk assessment of novel coronavirus (COVID-19) cases: A data-driven analysis. Chaos, Solitons & Fractals. 2020, 135: 1-7. 109850.



Received on 16.04.2021††††††††††† Accepted on 25.09.2021

©A&V Publications all right reserved

Research J. Engineering and Tech. 2021;12(3):79-84.

DOI: 10.52711/2321-581X.2021.00013