A beginner’s guide to data mining



As the technology landscape becomes increasingly contingent on data analytics, the term “data mining” has become a popular buzzword in many corporate circles. Generally, this term refers to “any form of large-scale data or information processing,” including, but not limited to, extraction, analysis, warehousing, and collection.


Here is a quick guide to this relevant and storied process.


How did data mining begin?


Early data mining techniques can be traced back to the 1960s, though such extraction processes have occurred as far back as the 1700s (Bayes’ theorem). Statisticians initially coined terms like “data fishing” and “data dredging” to dismiss mining as a bad practice lacking an a-priori hypothesis. Later terms to describe the process included “data archaeology” and “information harvesting.” It was not until the early 1990s that “data mining” became the general catch-all term for such practices.


In the 1943, Warren McCulloch and Walter Pitts became the first people to create a conceptual model of a neural network, illustrating their findings in a paper titled “A logical calculus of the ideas immanent in nervous activity.” The paper essentially boiled down to three major findings: neurons in a network are able to “receive inputs, process inputs, and generate output.”


McCulloch and Walter’s findings were later implemented in sophisticated database management systems during the 1970s. It then became possible to “store and query terabytes and petabytes of data,” moving users away from a transaction-based data mindset and towards a more analytical one.


Large-scale academic research surrounding data mining awoke in the mid 1990s, and data science later became recognized as an independent discipline in the early 2000s. Publications like Michael Lewis’s “Moneyball” catapulted the concept into more of the general public, showing how data analytics could be applied to new and interesting mediums — in this case, professional baseball. From here, data mining became much more of a household name in tech, growing in prevalence within a variety of industry facets, including business, engineering, and medicine.


How is data mining used today?


In most cases, today’s data mining is implemented to discover key patterns in large data sets at “the intersection of machine learning, statistics, and database systems.” Perhaps the most active current data mining technique is deep learning, or a broad branch of machine learning based on “learning data representations” without the need for full supervision. This process is just one significant example of mining potential in a modern context.


Overall, data mining stands as a major component of big data’s growth into a widespread, commonplace tech term.