Learn data mining from scratch

Data mining is a big subject that covers many areas including predictive analytics, prescriptive analytics, text mining and information discovery. Obviously, we can’t cover the full spectrum of everything data mining offers in just one day, but we can teach basic data mining principles, and a simple method that is suitable for beginners that is straight forward to understand.
Give this basic data mining tutorial a try, and once you understand how it works and what to expect, we can help you build on these fundamentals, in small increments, that will lead you to understand other skills that are not as basic, but that achieve some unbelievable results.

What is the purpose of data mining?

Data mining is the search for patterns in data that model or represent a topic of interest. The results of data mining are both theoretical and practical. The patterns are useful for the information they provide about the topics mined (theory), and the models are the application of the patterns to create desired outcomes (practical).

Let’s try an experiment.

Here is a table of data that records attempts to solicit charitable donations on the street in front of different types of establishments. This data was gathered using a small group of people for a short time, ahead of a large-scale campaign, for the purpose of gathering information about the best places to solicit donations for the maximum benefit to the charity.

Sample data for a charity campaign

The table looks like a grid in which separate requests for donations are recorded in the horizontal rows, and the data about those requests are recorded in the vertical columns.

In this case, the data is the four locations sampled in the study: Grocery store, Beer Store, Pizza parlour and Gas Pump, a well as the results of whether or not a donation was received. At the points were the rows and columns intersect is the value of the data for each request, where Y is yes, and N is no. In total, seven donations were received after twenty solicitations were made.

The objective of mining this data is to gather information about soliciting charitable donations at these four types of locations. The information gathered will be used to choose the locations where to put canvassers to maximize the results of the upcoming charitable campaign. This is a typical project for data mining, and even though the table has little data and few samples, the format is typical for a data mining project, and will provide an excellent learning example.

How does data mining work at a high level?

First, let me explain that data cannot explain everything. If there is no information value in the data that explains the topic we are studying, then there will be no patterns in the data to be found. The strength of our model depends on how much information there is about the topic, that is contained in the data.

This subject has a topic of interest with two outcomes, donations made yes or no. We need to find and isolate data that relates strongly to when donations were made, and also when donations were not made. This related data can be used to understand the occurrence of both outcomes that we are interested in. We must also eliminate any data that is strongly shared by both outcomes together and individually, as this data will not help us to differentiate and explain why each outcome occurred. Shared data is coincidental, and not useful for modelling.

With the final list of useful data in hand, we study the relationships of the data to the two outcomes of donation and no donation. We call these relationships a model. The donation model can be used to decide which locations to use by the charity during the upcoming campaign.

Don’t worry if the high-level explanation is not completely clear yet, as the step by step example to follow will show you what the theory looks like in practice.

Hands on data mining.

This data mining example uses four steps.

1. Split the data table, one for success samples and one for failure samples. (Preparation)
2. Look for data with the strongest connections to the outcomes in each table. (Information)
3. Remove data that adds weight to both tables (Coincidental).
4. Explain the relationships in the final data to donations (Modelling).

These steps are typically automated using data mining software, which offers many tools and options for analysis that also hides some of the data mining process. For this example, we will data mine a solution by hand, using everyday math, so the process is transparent and easy to understand.

First, split the table into two smaller tables, one for rows with donation success, and the other for rows with donation failure. Here is what they look like:

Failure rows

Dividing the tables this way makes it easier to compare and contrast the data for the two outcomes. Visually, it is easier to see that there are some relationships in the data and their outcomes. So far, this is very much like a reporting exercise, which is a very good step to take before beginning any data mining project. Often, reports help us to spot problems or topics of interest that we will want to know more about using data mining techniques. These reports will also form a baseline of current capability, which we can use to compare with changes suggested by data mining, to make sure that the models are adding value.

Reports are essential to most organizations, and a good beginning for us, but we are not data mining yet. Often, data sets are much larger than what we can fit onto one screen, making it difficult to see clearly what is going on. Data mining can find process enormous data sets, which makes it a more efficient and objective means to find the information you need.

Data mining and information discovery.

Now that we have organized the data into two useful groups, we can begin mining the data. The first phase of data mining will assess the information value that each data has for the topic studied, or in other words, we will determine how much each data helps us to explain the topic. As we already said, there are a lot of sophisticated tools available, such as neural nets and logistic regression, but we promised the math we use in this example will be easy to understand, so let’s use the average occurrence of the data for each outcome, donation (7) and no donation (13), to mine what we need to know.

The results are in the following tables:

Just the success recordsJust the failure rows

 

 

The first thing we look for is data which has no information value, meaning data with a 0% (or close to 0%) for both the successes and the failures. In larger data sets, this is not unusual and happens very often. In this example, pizza has a 0% value for successful donation, meaning it cannot help us understand successful donations at all, but it has a value of 30.8% for failures, which means it can help us understand almost one third of the failures, which is a lot of information value, so we keep pizza. There is no data in this example which is 0% in both tables, so there is no need for us to remove or ignore any of the data for that reason.

All the data in this table has some value, providing information about donations, when they are successful and when they are not. If we rank the information in each table, we see that the best indicators of success are Beer store (57.1%), Grocery Store (28.6%), Gas Pump (14.3%) and Pizza (0.0%). Similarly, we can rank the failures as Gas pump (38.5%), Pizza (30.8%), Beer Store (15.4%) and Grocery (15.4%). This level of data mining has provided us with information about our subject, which we call information discovery. This is the simplest objective of data mining.

If we were to use this information to maximize success for the upcoming campaign, I would recommend that we solicit at beer stores and grocery stores, but that we avoid gas pumps and pizza places. This information alone provides substantial value to the charitable campaign, as it will position canvassers in the best places to collect the most donations.

Let’s build a model.

Many data mining projects stop at this point, as useful information was discovered. There are projects, however, where this is not enough. For example, in projects with enormous data lists, there might still be too much data to get a clear picture of what to do. Even in this small model, we have data that exists in both the success and the failure lists, and a reconciliation of when and where to rely on the data might be useful to decision makers. That is why most data mining efforts go past information gathering to organize the data into models.

A data mining model organizes the data to tell us a clear story of what happened and provides guidance for similar decisions in future. It will also reconcile the data conflicts meaningfully between the success and failure cases. Data mining software is very good at doing this objectively, but once again we will try to give a rough idea what the software can do, by doing it by hand using simple averages.

We already ordered the data within each table by the amount of information they provided, but for this exercise, we will need to reconcile the data between the tables. For each piece of data in our study, we will calculate their rate of occurrence in both tables, by taking the number of solicitations in each table divided by the total number of solicitations of both tables together. We will then choose the larger value between the two tables for each data, and order the results from most important to least. In this example:

1. Pizza location 100% failure
2. Gas bar location 83% failure
3. Beer location 67% success
4. Grocery location 50% success

In other words, pizza failed 100% of the time and gas bar solicitations failed 83% of the time. Beer locations succeed 67% of the time and grocery locations succeed half of the time. Going down the list in order, the model says stick to beer and grocery locations for success. I bet many people could have picked out the right answers just by staring at the original table, and this process might look like we went about the problem the hard way, but we really didn’t.

The answer to the question about where to solicit donations wasn’t what was important here. What we demonstrated was an overview of the method using an easy to understand example. Remember that most data sets are too large to divine all the important information you need to make decisions, and also remember that most of this process itself is automated, so in reality you don’t need to split tables and do the calculating yourself. If you understand that the outcomes are modelled for maximum information value, while also differentiating between outcomes to provide information for clear choices, then you understand the basic principles.

So, if most of the data mining process is automated, then what’s the catch? Couldn’t I just attach some software to my data and get the same answers automatically? Well, somewhat. There are people working on that type of solution, but it’s just not there yet. There are some aspects of data mining that we glossed over in this example, which I will touch on now.

First, selecting the right topic to study is not as easy as it seems. In this data, we added the data for donation yes/no, but often the topic is not as clearly defined as it is in this example. The data mining term for the topic you are studying is the output variable, or target variable.

Partnered with the target variable is selecting a good population for mining. Depending on which samples are chosen, the representation of the target variable will change as well. You will want to choose a population that closes matches, as much as possible, the population to which the model will be applied.

Probably the most interesting and complex part of the process comes from getting the data ready. Often, data does not have clear labels like in this example, and there are usually missing values, bizarre values and completely wrong values stored in the data. Even with clean data, you will spend a lot time turning data of limited use, like birth date, into something more useful like age. At a conservative estimate, data mining easily spends 80% of human effort in data understanding, data preparation and data staging, before it is ready to be presented to modelling algorithms for analysis.

And of course, we already mentioned there are a blizzard of tools and techniques that you can apply to data once it is ready. That said, there are some every day “go to” options, that I find that I use time and again, which could serve any new data miner very well. You can add the other tools slowly, over time, as they are needed. There is absolutely no need to rush in and try to learn everything right away.

And that is the take away from learning data mining in a day. Yes, there are a lot of tools and options out there to learn, but you do not need to know all of them to data mine with success. Start small, and build on what you know over time, but even starting small, you are data mining. Even using the simplest of skills like averaging, we used data mining today to find the information we needed in the data, and we built a simple model from what we learned.

If you do not have any interest in data mining yourself, but you wanted to know more about how it worked, before potentially hiring someone else to do the work for you, hopefully this process will help you to avoid one of the biggest misunderstandings people have about data mining. Data mining does not create success or failure, it is the science of finding and understanding the success and failure that is hiding in your data. If success isn’t there now, data mining can’t create it.

And that is data mining in a day. Congratulations for getting through to this point. If you want to delve a little deeper into some of the more powerful aspects of data mining, improving your skills beyond simple averages, then we have other courses and materials the incrementally add to the basic knowledge you gained here. Just click on the button below and browse what we offer.

If you want to look at more data mining theory, we have published the following articles in our library that you might find interesting:

 

Monkey Path