Abstract
The emergence of big data has necessitated the development of new analytical methods in the interdisciplinary field of knowledge discovery and data mining. This approach goes beyond traditional statistical approaches and employs deductive and inductive processes to extract new knowledge from vast amounts of data. By considering a larger number of joint, interactive, and independent predictors, data mining addresses causal heterogeneity and enhances prediction capabilities. Rather than challenging conventional model-building approaches, data mining complements them by improving model goodness of fit, uncovering hidden patterns, identifying nonlinear and non-additive effects, and providing valuable insights into data developments, methods, and theory. Additionally, data mining enriches scientific discovery by revealing valid and significant findings.
Machine learning, on the other hand, leverages models and algorithms to learn from data, particularly when the explicit model structure is unclear or achieving good performance is challenging. Recent developments incorporate this predictive modeling paradigm with the classical approach of parameter estimation regressions, resulting in improved models that combine explanation and prediction.
In this era of big data, knowledge discovery and data mining have revolutionized research processes across various fields, including the social sciences. These projects require domain knowledge from experts in diverse disciplines, as well as expertise in data processing, database technology, and statistical and computational algorithms. Data mining technologies enable the discovery of previously hidden patterns, fostering innovation and the development of new theories.
This paper explores the epistemological contributions of data mining to theory innovation and discusses the implications of big data. By situating knowledge discovery and data mining within the philosophical and methodological traditions of scientific research, we highlight their strengths and challenges. We provide a systematic explanation of key procedures in supervised and unsupervised machine